Python Forum
Selecting the first occurrence of a duplicate
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Selecting the first occurrence of a duplicate
#1
Hi all,

I'm new to coding and thought I would try and practice a little website scraping.

I've come across an instance where an element (?) I'm trying to retrieve is duplicated and I'm not sure how to extract only one instance of this value..


Here's the code I'm using (please note that "my practice url" does contain an actual url of a website in my code!):

from urllib.request import urlopen
from bs4 import BeautifulSoup

from urllib.request import Request, urlopen
url = "[my practice url]"

soup = BeautifulSoup(webpage, 'html.parser')

prices = soup.findAll("div", {"class": "col price"})

for price in prices:
    income = price.span.text
    print(income)
So to illustrate what's happening and make it clearer, if I comment out the loop component just to get a clearer picture of what the data contains, and then print(prices), here's an extract of what I get:

Output:
</div>, <div class="col price"><span>$139,501</span></div>, <div class="col price"> <span>$139,501</span> </div>, <div class="col price"><span>$137,349</span></div>, <div class="col price"> <span>$137,349</span> </div>, <div class="col price"><span>$132,955</span></div>, <div class="col price"> <span>$132,955</span> </div>, <div class="col price"><span>$129,000</span></div>, <div class="col price"> <span>$129,000</span>
So as you can see, within each line, the price amount is duplicated between the span tag. So, if I was to run the code above with the loop enabled, I'd see:

Output:
$139,501 $139,501 $137,349 $137,349 $132,955 $132,955 $129,000 $129,000
What I'm trying to achieve, is to obtain one instance of each number- but without any experience, I'm stuck.

I read about pandas and I've installed that and tried adding: import pandas as pd to the top of my code, but after reading more about it and watching a few videos, I'm not sure how to apply it to my code.

Could you please advise me on what needs to be done?

Thanking you Smile
perfringo write May-22-2021, 05:16 AM:
Nobody expects the Spanish Inquisition! Our chief weapon is surprise! Surprise and fear. Fear and surprise. Let me tell you something: when you're looking at your thread tonight and manic silence meets you don't come cryin' to me. Instead do use respective tags while posting code, output and errors (refer to BBCode help). This empowers others to help you. And... Always Look on the Bright Side of Life: I added them this time but if in the future you do it all by yourself I feel happy.
Reply
#2
first, instead of using urllib.request, use requests which is installed with pip install requests
prices can be accessed by index, ex: prices[1] for second item
Reply
#3
(May-22-2021, 10:13 PM)Larz60+ Wrote: first, instead of using urllib.request, use requests which is installed with pip install requests
prices can be accessed by index, ex: prices[1] for second item

Hi Larz60+,

Thanks very much for helping out.

I've changed the code to use the requests function as you suggested, but after spending many hours fiddling and trialing it in different spots with your suggestion about using prices[1], I've failed. To be honest I don't know where to include it so it works correctly.

Here's my latest revised code:

import requests
url = "[my practice url]"

page = requests.get(url)
from bs4 import BeautifulSoup

soup = BeautifulSoup(page.text, 'html.parser')
prices = soup.findAll("div", {"class": "col price"})
 
for price in prices:
   	income = prices[1]
    income = price.span.text
    print(income)
The result from running this code is as follows:

Output:
$1,618,347 $1,618,347 $1,185,566 $1,185,566 $1,123,378 $1,123,378 $1,007,209 $1,007,209 $949,749 $949,749 $590,029 $590,029 $281,798 $281,798
As you can see, I'm still getting the repeated text on each loop (eg- $1,618,347 printed twice, $1,185,566 printed twice etc .

Would you be able to spell it out where/how to modify my code to return a non duplicated result please? eg ($1,618,347, $1,185,566, $1,123,378, $1,007,209 etc)

Looking forward to finally having that 'ah ha' moment! :)

Thanks for your time and help.
Reply
#4
I wasn't suggesting that you use prices[1], I was just showing that prices can be indexed.
you can find out how many prices there are with the len function: numprices = len(prices)
Reply
#5
(May-23-2021, 11:36 PM)Larz60+ Wrote: I wasn't suggesting that you use prices[1], I was just showing that prices can be indexed.
you can find out how many prices there are with the len function: numprices = len(prices)

Sorry I am totally lost. Can you please tell me what I need to add (and where) to get this working?

I've spent hours and hours trying to figure this out but unfortunately as a beginner, I have no idea.

Thanking you.
Reply
#6
Hi Knight2000, Larz60+ explained "prices" is a list which can be indexed: prices[0], prices[1], prices[2], ... prices[len(prices)-1] .
What you need to do is print only the prices with even indexes: prices[0], prices[2], prices[4], ... .
This can be done like this:
prices = ["$139,501",
"$139,501",
"$137,349",
"$137,349",
"$132,955",
"$132,955",
"$129,000",
"$129,000"]

for i in range(0, len(prices), 2):
    print(f"index={i}: {prices[i]}")
Output:
index=0: $139,501 index=2: $137,349 index=4: $132,955 index=6: $129,000
Now try to incorporate this in the previous code and let us know if it works. If not: show the code and the result (and the complete error message).
knight2000 likes this post
Reply
#7
Hi ibreeden

Thanks so much for chiming in on this.

It finally works!! Big Grin

I used this with the previous code:

for i in range(0, len(prices), 2):
    print(prices[i].span.text)
I'm not sure if that's the correct or best way to go about it, but the results are:

Output:
$139,501 $137,349 $132,955 $129,000
Perfect- That's what I was wanting to see.

Thanks so much for your guidance and time (and you too Larz60+). Can't wait to learn more and experiment.
Reply
#8
(May-24-2021, 11:05 AM)knight2000 Wrote: I'm not sure if that's the correct or best way to go about it, but the results are:
Probably could fix it earlier when parse out the values to not get duplicates,like using CSS selector in BS select() and select_one() to get right tags.
range(len(sequence)) is looked as a bad way in most cases,
can maybe be justified here as use step parameter in range() here,that enumerate() dos not have.

An other way remove duplicate then loop.
prices = [
    "$139,501",
    "$139,501",
    "$137,349",
    "$137,349",
    "$132,955",
    "$132,955",
    "$129,000",
    "$129,000",
]

for item in dict.fromkeys(prices):
    print(item)
Output:
$139,501 $137,349 $132,955 $129,000
Just remove duplicates without care about order would be to use set().
>>> set(prices)
{'$132,955', '$139,501', '$129,000', '$137,349'}
Reply
#9
(May-24-2021, 11:50 AM)snippsat Wrote:
(May-24-2021, 11:05 AM)knight2000 Wrote: I'm not sure if that's the correct or best way to go about it, but the results are:
Probably could fix it earlier when parse out the values to not get duplicates,like using CSS selector in BS select() and select_one() to get right tags.
range(len(sequence)) is looked as a bad way in most cases,
can maybe be justified here as use step parameter in range() here,that enumerate() dos not have.

An other way remove duplicate then loop.
prices = [
    "$139,501",
    "$139,501",
    "$137,349",
    "$137,349",
    "$132,955",
    "$132,955",
    "$129,000",
    "$129,000",
]

for item in dict.fromkeys(prices):
    print(item)
Output:
$139,501 $137,349 $132,955 $129,000
Just remove duplicates without care about order would be to use set().
>>> set(prices)
{'$132,955', '$139,501', '$129,000', '$137,349'}

Hi snippsat

Thank you for giving me some pointers to improve my code to higher standards. Haven't heard of the 'set' function before. Will look up to try and learn more about it. Interesting.

Thanks again.
Cheers.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  How to get unique entries in a list and the count of occurrence james2009 5 2,910 May-08-2022, 04:34 AM
Last Post: ndc85430
  Checking for one or more occurrence in a list menator01 3 2,637 May-18-2020, 06:44 AM
Last Post: DPaul
  count occurrence of numbers in a sequence and return corresponding value python_newbie09 6 3,390 May-20-2019, 06:33 PM
Last Post: python_newbie09
  Word co-occurrence matrix for a string (NLP) JoeB 2 11,565 Feb-27-2018, 11:21 PM
Last Post: Larz60+
  Maximum Occurrence Letter GalaxyCR 2 3,844 Nov-27-2017, 09:00 PM
Last Post: nilamo

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020