Selecting the first occurrence of a duplicate

Selecting the first occurrence of a duplicate - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Selecting the first occurrence of a duplicate (/thread-33735.html)

Selecting the first occurrence of a duplicate - knight2000 - May-22-2021

Hi all,

I'm new to coding and thought I would try and practice a little website scraping.

I've come across an instance where an element (?) I'm trying to retrieve is duplicated and I'm not sure how to extract only one instance of this value..

Here's the code I'm using (please note that "my practice url" does contain an actual url of a website in my code!):

from urllib.request import urlopen
from bs4 import BeautifulSoup

from urllib.request import Request, urlopen
url = "[my practice url]"

soup = BeautifulSoup(webpage, 'html.parser')

prices = soup.findAll("div", {"class": "col price"})

for price in prices:
    income = price.span.text
    print(income)

So to illustrate what's happening and make it clearer, if I comment out the loop component just to get a clearer picture of what the data contains, and then print(prices), here's an extract of what I get:

Output:</div>, <div class="col price"><span>$139,501</span></div>, <div class="col price">
<span>$139,501</span>
</div>, <div class="col price"><span>$137,349</span></div>, <div class="col price">
<span>$137,349</span>
</div>, <div class="col price"><span>$132,955</span></div>, <div class="col price">
<span>$132,955</span>
</div>, <div class="col price"><span>$129,000</span></div>, <div class="col price">
<span>$129,000</span>

So as you can see, within each line, the price amount is duplicated between the span tag. So, if I was to run the code above with the loop enabled, I'd see:

Output:$139,501
$139,501
$137,349
$137,349
$132,955
$132,955
$129,000
$129,000

What I'm trying to achieve, is to obtain one instance of each number- but without any experience, I'm stuck.

I read about pandas and I've installed that and tried adding: import pandas as pd to the top of my code, but after reading more about it and watching a few videos, I'm not sure how to apply it to my code.

Could you please advise me on what needs to be done?

Thanking you Smile

RE: Selecting the first occurrence of a duplicate - Larz60+ - May-22-2021

first, instead of using urllib.request, use requests which is installed with pip install requests
prices can be accessed by index, ex: prices[1] for second item

RE: Selecting the first occurrence of a duplicate - knight2000 - May-23-2021

(May-22-2021, 10:13 PM)Larz60+ Wrote: first, instead of using urllib.request, use requests which is installed with pip install requests
prices can be accessed by index, ex: prices[1] for second item

Hi Larz60+,

Thanks very much for helping out.

I've changed the code to use the requests function as you suggested, but after spending many hours fiddling and trialing it in different spots with your suggestion about using prices[1], I've failed. To be honest I don't know where to include it so it works correctly.

Here's my latest revised code:

import requests
url = "[my practice url]"

page = requests.get(url)
from bs4 import BeautifulSoup

soup = BeautifulSoup(page.text, 'html.parser')
prices = soup.findAll("div", {"class": "col price"})
 
for price in prices:
   	income = prices[1]
    income = price.span.text
    print(income)

The result from running this code is as follows:

Output:$1,618,347
$1,618,347
$1,185,566
$1,185,566
$1,123,378
$1,123,378
$1,007,209
$1,007,209
$949,749
$949,749
$590,029
$590,029
$281,798
$281,798

As you can see, I'm still getting the repeated text on each loop (eg- $1,618,347 printed twice, $1,185,566 printed twice etc .

Would you be able to spell it out where/how to modify my code to return a non duplicated result please? eg ($1,618,347, $1,185,566, $1,123,378, $1,007,209 etc)

Looking forward to finally having that 'ah ha' moment! :)

Thanks for your time and help.

RE: Selecting the first occurrence of a duplicate - Larz60+ - May-23-2021

I wasn't suggesting that you use prices[1], I was just showing that prices can be indexed.
you can find out how many prices there are with the len function: numprices = len(prices)

RE: Selecting the first occurrence of a duplicate - knight2000 - May-24-2021

(May-23-2021, 11:36 PM)Larz60+ Wrote: I wasn't suggesting that you use prices[1], I was just showing that prices can be indexed.
you can find out how many prices there are with the len function: numprices = len(prices)

Sorry I am totally lost. Can you please tell me what I need to add (and where) to get this working?

I've spent hours and hours trying to figure this out but unfortunately as a beginner, I have no idea.

Thanking you.

RE: Selecting the first occurrence of a duplicate - ibreeden - May-24-2021

Hi Knight2000, Larz60+ explained "prices" is a list which can be indexed: prices[0], prices[1], prices[2], ... prices[len(prices)-1] .
What you need to do is print only the prices with even indexes: prices[0], prices[2], prices[4], ... .
This can be done like this:

prices = ["$139,501",
"$139,501",
"$137,349",
"$137,349",
"$132,955",
"$132,955",
"$129,000",
"$129,000"]

for i in range(0, len(prices), 2):
    print(f"index={i}: {prices[i]}")

Output:index=0: $139,501
index=2: $137,349
index=4: $132,955
index=6: $129,000

Now try to incorporate this in the previous code and let us know if it works. If not: show the code and the result (and the complete error message).

RE: Selecting the first occurrence of a duplicate - knight2000 - May-24-2021

Hi ibreeden

Thanks so much for chiming in on this.

It finally works!! Big Grin

I used this with the previous code:

for i in range(0, len(prices), 2):
    print(prices[i].span.text)

I'm not sure if that's the correct or best way to go about it, but the results are:

Output:$139,501
$137,349
$132,955
$129,000

Perfect- That's what I was wanting to see.

Thanks so much for your guidance and time (and you too Larz60+). Can't wait to learn more and experiment.

RE: Selecting the first occurrence of a duplicate - snippsat - May-24-2021

(May-24-2021, 11:05 AM)knight2000 Wrote: I'm not sure if that's the correct or best way to go about it, but the results are:

Probably could fix it earlier when parse out the values to not get duplicates,like using CSS selector in BS select() and select_one() to get right tags.
range(len(sequence)) is looked as a bad way in most cases,
can maybe be justified here as use step parameter in range() here,that enumerate() dos not have.

An other way remove duplicate then loop.

prices = [
    "$139,501",
    "$139,501",
    "$137,349",
    "$137,349",
    "$132,955",
    "$132,955",
    "$129,000",
    "$129,000",
]

for item in dict.fromkeys(prices):
    print(item)

Output:$139,501
$137,349
$132,955
$129,000

Just remove duplicates without care about order would be to use set().

>>> set(prices)
{'$132,955', '$139,501', '$129,000', '$137,349'}

RE: Selecting the first occurrence of a duplicate - knight2000 - May-25-2021

(May-24-2021, 11:50 AM)snippsat Wrote:
(May-24-2021, 11:05 AM)knight2000 Wrote: I'm not sure if that's the correct or best way to go about it, but the results are:
Probably could fix it earlier when parse out the values to not get duplicates,like using CSS selector in BS select() and select_one() to get right tags.
range(len(sequence)) is looked as a bad way in most cases,
can maybe be justified here as use step parameter in range() here,that enumerate() dos not have.

An other way remove duplicate then loop.
prices = [
    "$139,501",
    "$139,501",
    "$137,349",
    "$137,349",
    "$132,955",
    "$132,955",
    "$129,000",
    "$129,000",
]

for item in dict.fromkeys(prices):
    print(item)
Output:$139,501
$137,349
$132,955
$129,000
Just remove duplicates without care about order would be to use set().
>>> set(prices)
{'$132,955', '$139,501', '$129,000', '$137,349'}

Hi snippsat

Thank you for giving me some pointers to improve my code to higher standards. Haven't heard of the 'set' function before. Will look up to try and learn more about it. Interesting.

Thanks again.
Cheers.