Python Forum
For Loop Returning 3 Results When There Should Be 1
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
For Loop Returning 3 Results When There Should Be 1
#1
Hi Guys,

After trying to figure this one out for over 8 hours, I thought I would get a fresh perspective from someone.

I'm practicing some web scrapping and I've got a scenario where I've got a pretty easy goal: I'm trying to find an object and if it exists, extract some data from it (shipping information) and if it doesn't exist, enter something like " " (...because I'm going to be using pandas- so I need to do something when it can't find the object, else I know I'll get the "ValueError Arrays Must be All Same Length" error).

I've tried many things to do this, but I'm unable to successfully:
1) capture where the object doesn't exist; and
2) accurately get data from when the object does exist.

My current reiteration of the code is:

from bs4 import BeautifulSoup

with open("out_of_stock2.html", encoding="utf8") as fp:
    soup = BeautifulSoup(fp, 'html.parser')
    for item in soup:
        mt2 = soup.find('span', {'class': 'w_A w_C w_B mr1 mt1 ph1'})
        if mt2 is None:
            print('There is no record')
        else:
            print (mt2)
When I run this, I get:
Output:
<span class="w_A w_C w_B mr1 mt1 ph1">1-day shipping</span> <span class="w_A w_C w_B mr1 mt1 ph1">1-day shipping</span> <span class="w_A w_C w_B mr1 mt1 ph1">1-day shipping</span>
I'm not sure why I'm getting 3 instances of this when the data only contains 1? (The object I'm looking for is "w_A w_C w_B mr1 mt1 ph1")

Additionally, there is one record in the dataset that doesn't contain the object but the code output ignores my print statement ('There is no record')

Could someone please shed some light on what I'm doing incorrectly?

Thank you.

Attached Files

.html   out_of_stock2.html (Size: 7.61 KB / Downloads: 73)
Reply
#2
Can you use the following?

from bs4 import BeautifulSoup
 
with open(r"out_of_stock2.html", encoding="utf8") as fp:
	soup = BeautifulSoup(fp, 'html.parser')
	print(len(soup))
	mt2 = soup.find('span', {'class': 'w_A w_C w_B mr1 mt1 ph1'})
	if mt2 is None:
		print('There is no record')
	else:
		print (mt2)
Reply
#3
Hi Sam,

Thanks for chiming in.

I tried your code and got:

Output:
3 <span class="w_A w_C w_B mr1 mt1 ph1">1-day shipping</span>
It's still not reporting where it can't find 1 record (there should find one instance of the object and 1 instance where there is no record of the object), so it's still failing at:
if mt2 is None:
		print('There is no record')
The reason I used a loop is the real file contains about 40 records (I've just taken a sample of two records to troubleshoot), so I thought a loop would be required to go through each and look for that object?
Reply
#4
I assume that
len(soup)
being 3 explains why you are getting 3 when you expect 1 but the BeautifulSoup documentation is not clear about what it is.
knight2000 likes this post
Reply
#5
Should not loop over soup object knight2000,as it's not needed and can give unwanted result.
It will depend on parser used,so if i use lxml(recommend) as parser the length will be one.
from bs4 import BeautifulSoup

with open(r"out_of_stock2.html", encoding="utf8") as fp:
    soup = BeautifulSoup(fp, 'lxml')
    print(len(soup))
    mt2 = soup.find('span', class_="w_A w_C w_B mr1 mt1 ph1")
    if mt2 is None:
        print('There is no record')
    else:
        print (mt2)
Output:
1 <span class="w_A w_C w_B mr1 mt1 ph1">1-day shipping</span>
It's easier to use class_="w_A w_C w_B mr1 mt1 ph1 than make it a dictionary call.
Then can just copy CSS class from web-site and add one _.
Reply
#6
You are doing something similar to this:
soup = {'A':1, 'B':2, 'C':3}
class_ = 'B'
for item in soup:
    mt2 = soup.get(class_)
    if mt2:
        print(mt2)
    else:
        print('There is no record')
Output:
2 2 2
In this example and yours you will get a different item each time you iterate through soup, but soup either contains "class_" or not, and that is independent of the current item.

How you fix this depends on what you want to get from soup. From your description I think you would find all span and iterate through those items, comparing the item's class against your pattern. Something like this:
from bs4 import BeautifulSoup
 
with open("out_of_stock2.html", encoding="utf8") as fp:
    soup = BeautifulSoup(fp, 'html.parser')
    for item in soup.find('span'):
        if item['class_'] == "w_A w_C w_B mr1 mt1 ph1":
            print(item)
        else:
            print ('No match')
knight2000 likes this post
Reply
#7
You're spot on Sam. After replying to you, I was mulling over it and realized that the 3 from your code definitely gave a clue as to why I was getting 3 results.

(Sep-26-2021, 05:29 AM)SamHobbs Wrote: I assume that
len(soup)
being 3 explains why you are getting 3 when you expect 1 but the BeautifulSoup documentation is not clear about what it is.
Reply
#8
Hi snippsat,

Thank you for your advice about not using soup when looping- I had tried over 30 different methods to get this data and most of them didn't use soup when looping, but by the end of all those failures- I then tried soup Shocked and off course that didn't work either! But good to know never to use it for looping.

Also, thank you for teaching me the easier way to call a class. That's soooo much easier than what I've always done. I have seen your method before, but as I'm still learning, I didn't want to try and learn too many variations and confuse myself more. Big Grin

With regards to parser, I've only ever used one: html.parser.

So I followed your suggestion to use lxml and tried the following code:

from bs4 import BeautifulSoup

with open(r"out_of_stock2.html", encoding="utf8") as fp:
    soup = BeautifulSoup(fp, 'lxml')
    ph1 = soup.find_all('div', class_ ='h-100 pb1-xl pr4-xl pv1 ph1')
    for item in ph1:
        mt1_ph1 = item.find('span', class_ = 'w_A w_C w_B mr1 mt1 ph1')
        if mt1_ph1 is None:
            print('No data')
        else:
            print(mt1_ph1.text)
The result it returned:
Output:
No data 1-day shipping
Dance You fixed it! Thank you so much. I've Wall for 2 days trying to figure it out- and honestly probably wouldn't have thought of trying your option. Really appreciate it.



(Sep-26-2021, 06:36 AM)snippsat Wrote: Should not loop over soup object knight2000,as it's not needed and can give unwanted result.
It will depend on parser used,so if i use lxml(recommend) as parser the length will be one.
from bs4 import BeautifulSoup

with open(r"out_of_stock2.html", encoding="utf8") as fp:
    soup = BeautifulSoup(fp, 'lxml')
    print(len(soup))
    mt2 = soup.find('span', class_="w_A w_C w_B mr1 mt1 ph1")
    if mt2 is None:
        print('There is no record')
    else:
        print (mt2)
Output:
1 <span class="w_A w_C w_B mr1 mt1 ph1">1-day shipping</span>
It's easier to use class_="w_A w_C w_B mr1 mt1 ph1 than make it a dictionary call.
Then can just copy CSS class from web-site and add one _.
Reply
#9
Hi deanhystad,

Thanks a lot for explaining it to me- I've read your reply a few times to try and understand it. Smile

I tried your code but I seem to have got an error:
Error:
if item['class_'] == "w_A w_C w_B mr1 mt1 ph1": TypeError: string indices must be integers
To be honest, not too sure what that means, but I seemed to have had success with the code by changing the parser from html to lxml.

Thank you for the time you invested in helping me.

Have a great one.

(Sep-26-2021, 07:41 AM)deanhystad Wrote: You are doing something similar to this:
soup = {'A':1, 'B':2, 'C':3}
class_ = 'B'
for item in soup:
    mt2 = soup.get(class_)
    if mt2:
        print(mt2)
    else:
        print('There is no record')
Output:
2 2 2
In this example and yours you will get a different item each time you iterate through soup, but soup either contains "class_" or not, and that is independent of the current item.

How you fix this depends on what you want to get from soup. From your description I think you would find all span and iterate through those items, comparing the item's class against your pattern. Something like this:
from bs4 import BeautifulSoup
 
with open("out_of_stock2.html", encoding="utf8") as fp:
    soup = BeautifulSoup(fp, 'html.parser')
    for item in soup.find('span'):
        if item['class_'] == "w_A w_C w_B mr1 mt1 ph1":
            print(item)
        else:
            print ('No match')
Reply
#10
(Sep-26-2021, 11:05 AM)knight2000 Wrote: So I followed your suggestion to use lxml and tried the following code:
In your fixed code you first find relevant div elements then look for a relevant span element and I think the requirement for the div elements was not in the original question. You were saying the code does not determine when there is not a match and I could not understand what that means.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  returning values in for loop Nickd12 4 1,208 Dec-17-2020, 03:51 AM
Last Post: snippsat
  Search Results Web results Printing the number of days in a given month and year afefDXCTN 1 928 Aug-21-2020, 12:20 PM
Last Post: DeaD_EyE
  Adding loop results as rows in dataframe Shreya10o 2 1,092 May-09-2020, 11:00 AM
Last Post: Shreya10o
  How to append one function1 results to function2 results SriRajesh 5 1,516 Jan-02-2020, 12:11 PM
Last Post: Killertjuh
  Returning true or false in a for loop bbop1232012 3 3,762 Nov-22-2018, 04:44 PM
Last Post: bbop1232012
  RegExp: returning 2nd loop in new document syoung 5 2,270 May-02-2018, 12:36 PM
Last Post: syoung

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020