Posts: 68
Threads: 21
Joined: May 2021
Hi Guys,
After trying to figure this one out for over 8 hours, I thought I would get a fresh perspective from someone.
I'm practicing some web scrapping and I've got a scenario where I've got a pretty easy goal: I'm trying to find an object and if it exists, extract some data from it (shipping information) and if it doesn't exist, enter something like " " (...because I'm going to be using pandas- so I need to do something when it can't find the object, else I know I'll get the "ValueError Arrays Must be All Same Length" error).
I've tried many things to do this, but I'm unable to successfully:
1) capture where the object doesn't exist; and
2) accurately get data from when the object does exist.
My current reiteration of the code is:
from bs4 import BeautifulSoup
with open("out_of_stock2.html", encoding="utf8") as fp:
soup = BeautifulSoup(fp, 'html.parser')
for item in soup:
mt2 = soup.find('span', {'class': 'w_A w_C w_B mr1 mt1 ph1'})
if mt2 is None:
print('There is no record')
else:
print (mt2) When I run this, I get:
Output: <span class="w_A w_C w_B mr1 mt1 ph1">1-day shipping</span>
<span class="w_A w_C w_B mr1 mt1 ph1">1-day shipping</span>
<span class="w_A w_C w_B mr1 mt1 ph1">1-day shipping</span>
I'm not sure why I'm getting 3 instances of this when the data only contains 1? (The object I'm looking for is "w_A w_C w_B mr1 mt1 ph1")
Additionally, there is one record in the dataset that doesn't contain the object but the code output ignores my print statement ('There is no record')
Could someone please shed some light on what I'm doing incorrectly?
Thank you.
Posts: 52
Threads: 3
Joined: Sep 2021
Can you use the following?
from bs4 import BeautifulSoup
with open(r"out_of_stock2.html", encoding="utf8") as fp:
soup = BeautifulSoup(fp, 'html.parser')
print(len(soup))
mt2 = soup.find('span', {'class': 'w_A w_C w_B mr1 mt1 ph1'})
if mt2 is None:
print('There is no record')
else:
print (mt2)
Posts: 68
Threads: 21
Joined: May 2021
Sep-26-2021, 04:24 AM
(This post was last modified: Sep-26-2021, 09:15 AM by Yoriz.
Edit Reason: removed unnecessary quote of previous post
)
Hi Sam,
Thanks for chiming in.
I tried your code and got:
Output: 3
<span class="w_A w_C w_B mr1 mt1 ph1">1-day shipping</span>
It's still not reporting where it can't find 1 record (there should find one instance of the object and 1 instance where there is no record of the object), so it's still failing at:
if mt2 is None:
print('There is no record') The reason I used a loop is the real file contains about 40 records (I've just taken a sample of two records to troubleshoot), so I thought a loop would be required to go through each and look for that object?
Posts: 52
Threads: 3
Joined: Sep 2021
I assume that len(soup) being 3 explains why you are getting 3 when you expect 1 but the BeautifulSoup documentation is not clear about what it is.
knight2000 likes this post
Posts: 7,313
Threads: 123
Joined: Sep 2016
Sep-26-2021, 06:36 AM
(This post was last modified: Sep-26-2021, 06:36 AM by snippsat.)
Should not loop over soup object knight2000,as it's not needed and can give unwanted result.
It will depend on parser used,so if i use lxml(recommend) as parser the length will be one.
from bs4 import BeautifulSoup
with open(r"out_of_stock2.html", encoding="utf8") as fp:
soup = BeautifulSoup(fp, 'lxml')
print(len(soup))
mt2 = soup.find('span', class_="w_A w_C w_B mr1 mt1 ph1")
if mt2 is None:
print('There is no record')
else:
print (mt2) Output: 1
<span class="w_A w_C w_B mr1 mt1 ph1">1-day shipping</span>
It's easier to use class_="w_A w_C w_B mr1 mt1 ph1 than make it a dictionary call.
Then can just copy CSS class from web-site and add one _ .
Posts: 6,779
Threads: 20
Joined: Feb 2020
You are doing something similar to this:
soup = {'A':1, 'B':2, 'C':3}
class_ = 'B'
for item in soup:
mt2 = soup.get(class_)
if mt2:
print(mt2)
else:
print('There is no record') Output: 2
2
2
In this example and yours you will get a different item each time you iterate through soup, but soup either contains "class_" or not, and that is independent of the current item.
How you fix this depends on what you want to get from soup. From your description I think you would find all span and iterate through those items, comparing the item's class against your pattern. Something like this:
from bs4 import BeautifulSoup
with open("out_of_stock2.html", encoding="utf8") as fp:
soup = BeautifulSoup(fp, 'html.parser')
for item in soup.find('span'):
if item['class_'] == "w_A w_C w_B mr1 mt1 ph1":
print(item)
else:
print ('No match')
knight2000 likes this post
Posts: 68
Threads: 21
Joined: May 2021
You're spot on Sam. After replying to you, I was mulling over it and realized that the 3 from your code definitely gave a clue as to why I was getting 3 results.
(Sep-26-2021, 05:29 AM)SamHobbs Wrote: I assume that len(soup) being 3 explains why you are getting 3 when you expect 1 but the BeautifulSoup documentation is not clear about what it is.
Posts: 68
Threads: 21
Joined: May 2021
Hi snippsat,
Thank you for your advice about not using soup when looping- I had tried over 30 different methods to get this data and most of them didn't use soup when looping, but by the end of all those failures- I then tried soup  and off course that didn't work either! But good to know never to use it for looping.
Also, thank you for teaching me the easier way to call a class. That's soooo much easier than what I've always done. I have seen your method before, but as I'm still learning, I didn't want to try and learn too many variations and confuse myself more.
With regards to parser, I've only ever used one: html.parser.
So I followed your suggestion to use lxml and tried the following code:
from bs4 import BeautifulSoup
with open(r"out_of_stock2.html", encoding="utf8") as fp:
soup = BeautifulSoup(fp, 'lxml')
ph1 = soup.find_all('div', class_ ='h-100 pb1-xl pr4-xl pv1 ph1')
for item in ph1:
mt1_ph1 = item.find('span', class_ = 'w_A w_C w_B mr1 mt1 ph1')
if mt1_ph1 is None:
print('No data')
else:
print(mt1_ph1.text) The result it returned:
Output: No data
1-day shipping
 You fixed it! Thank you so much. I've  for 2 days trying to figure it out- and honestly probably wouldn't have thought of trying your option. Really appreciate it.
(Sep-26-2021, 06:36 AM)snippsat Wrote: Should not loop over soup object knight2000,as it's not needed and can give unwanted result.
It will depend on parser used,so if i use lxml(recommend) as parser the length will be one.
from bs4 import BeautifulSoup
with open(r"out_of_stock2.html", encoding="utf8") as fp:
soup = BeautifulSoup(fp, 'lxml')
print(len(soup))
mt2 = soup.find('span', class_="w_A w_C w_B mr1 mt1 ph1")
if mt2 is None:
print('There is no record')
else:
print (mt2) Output: 1
<span class="w_A w_C w_B mr1 mt1 ph1">1-day shipping</span>
It's easier to use class_="w_A w_C w_B mr1 mt1 ph1 than make it a dictionary call.
Then can just copy CSS class from web-site and add one _ .
Posts: 68
Threads: 21
Joined: May 2021
Hi deanhystad,
Thanks a lot for explaining it to me- I've read your reply a few times to try and understand it.
I tried your code but I seem to have got an error:
Error: if item['class_'] == "w_A w_C w_B mr1 mt1 ph1":
TypeError: string indices must be integers
To be honest, not too sure what that means, but I seemed to have had success with the code by changing the parser from html to lxml.
Thank you for the time you invested in helping me.
Have a great one.
(Sep-26-2021, 07:41 AM)deanhystad Wrote: You are doing something similar to this:
soup = {'A':1, 'B':2, 'C':3}
class_ = 'B'
for item in soup:
mt2 = soup.get(class_)
if mt2:
print(mt2)
else:
print('There is no record') Output: 2
2
2
In this example and yours you will get a different item each time you iterate through soup, but soup either contains "class_" or not, and that is independent of the current item.
How you fix this depends on what you want to get from soup. From your description I think you would find all span and iterate through those items, comparing the item's class against your pattern. Something like this:
from bs4 import BeautifulSoup
with open("out_of_stock2.html", encoding="utf8") as fp:
soup = BeautifulSoup(fp, 'html.parser')
for item in soup.find('span'):
if item['class_'] == "w_A w_C w_B mr1 mt1 ph1":
print(item)
else:
print ('No match')
Posts: 52
Threads: 3
Joined: Sep 2021
(Sep-26-2021, 11:05 AM)knight2000 Wrote: So I followed your suggestion to use lxml and tried the following code: In your fixed code you first find relevant div elements then look for a relevant span element and I think the requirement for the div elements was not in the original question. You were saying the code does not determine when there is not a match and I could not understand what that means.
|