Python Forum
For Loop Returning 3 Results When There Should Be 1 - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: For Loop Returning 3 Results When There Should Be 1 (/thread-35061.html)

Pages: 1 2


For Loop Returning 3 Results When There Should Be 1 - knight2000 - Sep-26-2021

Hi Guys,

After trying to figure this one out for over 8 hours, I thought I would get a fresh perspective from someone.

I'm practicing some web scrapping and I've got a scenario where I've got a pretty easy goal: I'm trying to find an object and if it exists, extract some data from it (shipping information) and if it doesn't exist, enter something like " " (...because I'm going to be using pandas- so I need to do something when it can't find the object, else I know I'll get the "ValueError Arrays Must be All Same Length" error).

I've tried many things to do this, but I'm unable to successfully:
1) capture where the object doesn't exist; and
2) accurately get data from when the object does exist.

My current reiteration of the code is:

from bs4 import BeautifulSoup

with open("out_of_stock2.html", encoding="utf8") as fp:
    soup = BeautifulSoup(fp, 'html.parser')
    for item in soup:
        mt2 = soup.find('span', {'class': 'w_A w_C w_B mr1 mt1 ph1'})
        if mt2 is None:
            print('There is no record')
        else:
            print (mt2)
When I run this, I get:
Output:
<span class="w_A w_C w_B mr1 mt1 ph1">1-day shipping</span> <span class="w_A w_C w_B mr1 mt1 ph1">1-day shipping</span> <span class="w_A w_C w_B mr1 mt1 ph1">1-day shipping</span>
I'm not sure why I'm getting 3 instances of this when the data only contains 1? (The object I'm looking for is "w_A w_C w_B mr1 mt1 ph1")

Additionally, there is one record in the dataset that doesn't contain the object but the code output ignores my print statement ('There is no record')

Could someone please shed some light on what I'm doing incorrectly?

Thank you.


RE: For Loop Returning 3 Results When There Should Be 1 - SamHobbs - Sep-26-2021

Can you use the following?

from bs4 import BeautifulSoup
 
with open(r"out_of_stock2.html", encoding="utf8") as fp:
	soup = BeautifulSoup(fp, 'html.parser')
	print(len(soup))
	mt2 = soup.find('span', {'class': 'w_A w_C w_B mr1 mt1 ph1'})
	if mt2 is None:
		print('There is no record')
	else:
		print (mt2)



RE: For Loop Returning 3 Results When There Should Be 1 - knight2000 - Sep-26-2021

Hi Sam,

Thanks for chiming in.

I tried your code and got:

Output:
3 <span class="w_A w_C w_B mr1 mt1 ph1">1-day shipping</span>
It's still not reporting where it can't find 1 record (there should find one instance of the object and 1 instance where there is no record of the object), so it's still failing at:
if mt2 is None:
		print('There is no record')
The reason I used a loop is the real file contains about 40 records (I've just taken a sample of two records to troubleshoot), so I thought a loop would be required to go through each and look for that object?


RE: For Loop Returning 3 Results When There Should Be 1 - SamHobbs - Sep-26-2021

I assume that
len(soup)
being 3 explains why you are getting 3 when you expect 1 but the BeautifulSoup documentation is not clear about what it is.


RE: For Loop Returning 3 Results When There Should Be 1 - snippsat - Sep-26-2021

Should not loop over soup object knight2000,as it's not needed and can give unwanted result.
It will depend on parser used,so if i use lxml(recommend) as parser the length will be one.
from bs4 import BeautifulSoup

with open(r"out_of_stock2.html", encoding="utf8") as fp:
    soup = BeautifulSoup(fp, 'lxml')
    print(len(soup))
    mt2 = soup.find('span', class_="w_A w_C w_B mr1 mt1 ph1")
    if mt2 is None:
        print('There is no record')
    else:
        print (mt2)
Output:
1 <span class="w_A w_C w_B mr1 mt1 ph1">1-day shipping</span>
It's easier to use class_="w_A w_C w_B mr1 mt1 ph1 than make it a dictionary call.
Then can just copy CSS class from web-site and add one _.


RE: For Loop Returning 3 Results When There Should Be 1 - deanhystad - Sep-26-2021

You are doing something similar to this:
soup = {'A':1, 'B':2, 'C':3}
class_ = 'B'
for item in soup:
    mt2 = soup.get(class_)
    if mt2:
        print(mt2)
    else:
        print('There is no record')
Output:
2 2 2
In this example and yours you will get a different item each time you iterate through soup, but soup either contains "class_" or not, and that is independent of the current item.

How you fix this depends on what you want to get from soup. From your description I think you would find all span and iterate through those items, comparing the item's class against your pattern. Something like this:
from bs4 import BeautifulSoup
 
with open("out_of_stock2.html", encoding="utf8") as fp:
    soup = BeautifulSoup(fp, 'html.parser')
    for item in soup.find('span'):
        if item['class_'] == "w_A w_C w_B mr1 mt1 ph1":
            print(item)
        else:
            print ('No match')



RE: For Loop Returning 3 Results When There Should Be 1 - knight2000 - Sep-26-2021

You're spot on Sam. After replying to you, I was mulling over it and realized that the 3 from your code definitely gave a clue as to why I was getting 3 results.

(Sep-26-2021, 05:29 AM)SamHobbs Wrote: I assume that
len(soup)
being 3 explains why you are getting 3 when you expect 1 but the BeautifulSoup documentation is not clear about what it is.



RE: For Loop Returning 3 Results When There Should Be 1 - knight2000 - Sep-26-2021

Hi snippsat,

Thank you for your advice about not using soup when looping- I had tried over 30 different methods to get this data and most of them didn't use soup when looping, but by the end of all those failures- I then tried soup Shocked and off course that didn't work either! But good to know never to use it for looping.

Also, thank you for teaching me the easier way to call a class. That's soooo much easier than what I've always done. I have seen your method before, but as I'm still learning, I didn't want to try and learn too many variations and confuse myself more. Big Grin

With regards to parser, I've only ever used one: html.parser.

So I followed your suggestion to use lxml and tried the following code:

from bs4 import BeautifulSoup

with open(r"out_of_stock2.html", encoding="utf8") as fp:
    soup = BeautifulSoup(fp, 'lxml')
    ph1 = soup.find_all('div', class_ ='h-100 pb1-xl pr4-xl pv1 ph1')
    for item in ph1:
        mt1_ph1 = item.find('span', class_ = 'w_A w_C w_B mr1 mt1 ph1')
        if mt1_ph1 is None:
            print('No data')
        else:
            print(mt1_ph1.text)
The result it returned:
Output:
No data 1-day shipping
Dance You fixed it! Thank you so much. I've Wall for 2 days trying to figure it out- and honestly probably wouldn't have thought of trying your option. Really appreciate it.



(Sep-26-2021, 06:36 AM)snippsat Wrote: Should not loop over soup object knight2000,as it's not needed and can give unwanted result.
It will depend on parser used,so if i use lxml(recommend) as parser the length will be one.
from bs4 import BeautifulSoup

with open(r"out_of_stock2.html", encoding="utf8") as fp:
    soup = BeautifulSoup(fp, 'lxml')
    print(len(soup))
    mt2 = soup.find('span', class_="w_A w_C w_B mr1 mt1 ph1")
    if mt2 is None:
        print('There is no record')
    else:
        print (mt2)
Output:
1 <span class="w_A w_C w_B mr1 mt1 ph1">1-day shipping</span>
It's easier to use class_="w_A w_C w_B mr1 mt1 ph1 than make it a dictionary call.
Then can just copy CSS class from web-site and add one _.



RE: For Loop Returning 3 Results When There Should Be 1 - knight2000 - Sep-26-2021

Hi deanhystad,

Thanks a lot for explaining it to me- I've read your reply a few times to try and understand it. Smile

I tried your code but I seem to have got an error:
Error:
if item['class_'] == "w_A w_C w_B mr1 mt1 ph1": TypeError: string indices must be integers
To be honest, not too sure what that means, but I seemed to have had success with the code by changing the parser from html to lxml.

Thank you for the time you invested in helping me.

Have a great one.

(Sep-26-2021, 07:41 AM)deanhystad Wrote: You are doing something similar to this:
soup = {'A':1, 'B':2, 'C':3}
class_ = 'B'
for item in soup:
    mt2 = soup.get(class_)
    if mt2:
        print(mt2)
    else:
        print('There is no record')
Output:
2 2 2
In this example and yours you will get a different item each time you iterate through soup, but soup either contains "class_" or not, and that is independent of the current item.

How you fix this depends on what you want to get from soup. From your description I think you would find all span and iterate through those items, comparing the item's class against your pattern. Something like this:
from bs4 import BeautifulSoup
 
with open("out_of_stock2.html", encoding="utf8") as fp:
    soup = BeautifulSoup(fp, 'html.parser')
    for item in soup.find('span'):
        if item['class_'] == "w_A w_C w_B mr1 mt1 ph1":
            print(item)
        else:
            print ('No match')



RE: For Loop Returning 3 Results When There Should Be 1 - SamHobbs - Sep-26-2021

(Sep-26-2021, 11:05 AM)knight2000 Wrote: So I followed your suggestion to use lxml and tried the following code:
In your fixed code you first find relevant div elements then look for a relevant span element and I think the requirement for the div elements was not in the original question. You were saying the code does not determine when there is not a match and I could not understand what that means.