Posts: 7
Threads: 4
Joined: Apr 2020
Dears,
I'm very newbie to the Python Language and I have spent some time teaching myself. Every time I face an error I try to dig for the solution online, but this time I really gave up.
I'm trying to do a web scraping and I got a stuck with an error that have driven me crazy, I will show the code and the result.
import requests
from bs4 import BeautifulSoup
from Data import row
# Collect and parse first page
page = requests.get('https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ1.htm')
soup = BeautifulSoup(page.text, 'html.parser')
# Pull all text from the BodyText div
artist_name_list = soup.find(class_='BodyText')
# Pull text from all instances of <a> tag within BodyText div
artist_name_list_items = artist_name_list.find_all('a')
# Create for loop to print out all artists' names
for artist_name in artist_name_list_items:
print(artist_name.prettify()) Error: Traceback (most recent call last):
File "C:/Users/HP/PycharmProjects/PyShop/Test1.py", line 15, in <module>
artist_name_list_items = artist_name_list.find_all('a')
AttributeError: 'NoneType' object has no attribute 'find_all'
I run python 3.8, any suggestings?
Posts: 443
Threads: 1
Joined: Sep 2018
This error means the object you're working with is None; and that is the crux of the problem. artist_name_list is instantiated on line 12. Since it's None, that means soup.find() returned None. First, review the documentation for the find method to ensure it has a return and to determine when/why it would return None. Second, review the HTML you're parsing to ensure the argument passed into soup.find() will work.
Posts: 7,313
Threads: 123
Joined: Sep 2016
Start at top that is to see what page request return.
>>> page
<Response [445]>
>>> page.status_code
445 So 445 is The request was rejected .
A simple User Agent will fix this
headers = {'User-agent': 'Mozilla/5.0'} >>> page
<Response [200]>
>>> page.status_code
200 import requests
from bs4 import BeautifulSoup
#from Data import row
# Collect and parse first page
headers = {'User-agent': 'Mozilla/5.0'}
page = requests.get('https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ1.htm', headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
# Pull all text from the BodyText div
artist_name_list = soup.find(class_='BodyText')
# Pull text from all instances of <a> tag within BodyText div
artist_name_list_items = artist_name_list.find_all('a')
# Create for loop to print out all artists' names
for artist_name in artist_name_list_items:
print(artist_name.text)
Posts: 7
Threads: 4
Joined: Apr 2020
Apr-11-2020, 12:36 AM
(This post was last modified: Apr-11-2020, 12:43 AM by BadWhite.)
(Apr-10-2020, 10:18 PM)stullis Wrote: This error means the object you're working with is None; and that is the crux of the problem. artist_name_list is instantiated on line 12. Since it's None, that means soup.find() returned None. First, review the documentation for the find method to ensure it has a return and to determine when/why it would return None. Second, review the HTML you're parsing to ensure the argument passed into soup.find() will work.
Thanks for your feedback, actually that's what I have read in some websites but I didn't and still don't know how to figure it out if the soup.find() returns "None" or else.
(Apr-10-2020, 11:40 PM)snippsat Wrote: Start at top that is to see what page request return.
>>> page
<Response [445]>
>>> page.status_code
445 So 445 is The request was rejected .
A simple User Agent will fix this
headers = {'User-agent': 'Mozilla/5.0'} >>> page
<Response [200]>
>>> page.status_code
200 import requests
from bs4 import BeautifulSoup
#from Data import row
# Collect and parse first page
headers = {'User-agent': 'Mozilla/5.0'}
page = requests.get('https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ1.htm', headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
# Pull all text from the BodyText div
artist_name_list = soup.find(class_='BodyText')
# Pull text from all instances of <a> tag within BodyText div
artist_name_list_items = artist_name_list.find_all('a')
# Create for loop to print out all artists' names
for artist_name in artist_name_list_items:
print(artist_name.text)
Thanks for the reply, but have you tried to run the code? because when I did nothing happens, it is just running without showing anything , empty screen for a almost 15 mins.
Also, by the way, why you have added "headers" variable?
Posts: 7,313
Threads: 123
Joined: Sep 2016
Apr-11-2020, 08:07 AM
(This post was last modified: Apr-11-2020, 08:07 AM by snippsat.)
(Apr-11-2020, 12:36 AM)BadWhite Wrote: but have you tried to run the code? Yes.
import requests
from bs4 import BeautifulSoup
#from Data import row
# Collect and parse first page
headers = {'User-agent': 'Mozilla/5.0'}
page = requests.get('https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ1.htm', headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
# Pull all text from the BodyText div
artist_name_list = soup.find(class_='BodyText')
# Pull text from all instances of <a> tag within BodyText div
artist_name_list_items = artist_name_list.find_all('a')
# Create for loop to print out all artists' names
for artist_name in artist_name_list_items:
print(artist_name.text) Output: Zabaglia, Niccola
Zaccone, Fabian
Zadkine, Ossip
Zaech, Bernhard
Zagar, Jacob
Zagroba, Idalia
Zaidenberg, A.
Zaidenberg, Arthur
Zaisinger, Matthäus
Zajac, Jack
Zak, Eugène
Zakharov, Gurii Fillipovich
Zakowortny, Igor
Zalce, Alfredo
Zalopany, Michele
Zammiello, Craig
Zammitt, Norman
Zampieri, Domenico
Zampieri, called Domenichino, Domenico
Zanartú, Enrique Antunez
Zanchi, Antonio
Zanetti, Anton Maria
Zanetti Borzino, Leopoldina
Zanetti I, Antonio Maria, conte
Zanguidi, Jacopo
Zanini, Giuseppe
Zanini-Viola, Giuseppe
Zanotti, Giampietro
Zao Wou-Ki
Zas-Zie
Zie-Zor
nextpage
BadWhite Wrote:why you have added "headers" variable? That was what i explain first,the site return 455 The request was rejected without user agent.
import requests
from bs4 import BeautifulSoup
#from Data import row
# Collect and parse first page
page = requests.get('https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ1.htm')
print(page.status_code) Output: 445
So when get this no more scraping is possible,using a user agent we identify as browser in this case Firefox.
The get 200 OK and can continue to scrape.
The problem most be something on your side here a run in a other environment colab.
As you see it work fine there to.
Posts: 7
Threads: 4
Joined: Apr 2020
(Apr-11-2020, 08:07 AM)snippsat Wrote: (Apr-11-2020, 12:36 AM)BadWhite Wrote: but have you tried to run the code? Yes.
import requests
from bs4 import BeautifulSoup
#from Data import row
# Collect and parse first page
headers = {'User-agent': 'Mozilla/5.0'}
page = requests.get('https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ1.htm', headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
# Pull all text from the BodyText div
artist_name_list = soup.find(class_='BodyText')
# Pull text from all instances of <a> tag within BodyText div
artist_name_list_items = artist_name_list.find_all('a')
# Create for loop to print out all artists' names
for artist_name in artist_name_list_items:
print(artist_name.text) Output: Zabaglia, Niccola
Zaccone, Fabian
Zadkine, Ossip
Zaech, Bernhard
Zagar, Jacob
Zagroba, Idalia
Zaidenberg, A.
Zaidenberg, Arthur
Zaisinger, Matthäus
Zajac, Jack
Zak, Eugène
Zakharov, Gurii Fillipovich
Zakowortny, Igor
Zalce, Alfredo
Zalopany, Michele
Zammiello, Craig
Zammitt, Norman
Zampieri, Domenico
Zampieri, called Domenichino, Domenico
Zanartú, Enrique Antunez
Zanchi, Antonio
Zanetti, Anton Maria
Zanetti Borzino, Leopoldina
Zanetti I, Antonio Maria, conte
Zanguidi, Jacopo
Zanini, Giuseppe
Zanini-Viola, Giuseppe
Zanotti, Giampietro
Zao Wou-Ki
Zas-Zie
Zie-Zor
nextpage
BadWhite Wrote:why you have added "headers" variable? That was what i explain first,the site return 455 The request was rejected without user agent.
import requests
from bs4 import BeautifulSoup
#from Data import row
# Collect and parse first page
page = requests.get('https://web.archive.org/web/20121007172955/https://www.nga.gov/collection/anZ1.htm')
print(page.status_code) Output: 445
So when get this no more scraping is possible,using a user agent we identify as browser in this case Firefox.
The get 200 OK and can continue to scrape.
The problem most be something on your side here a run in a other environment colab.
As you see it work fine there to.
Thanks man, you are the best.
Let me bother you with small question, might be silly little bit.
Why there is an underscore after the class words like here:
artist_name_list = soup.find(class_='BodyText') why not class only?
Posts: 7,313
Threads: 123
Joined: Sep 2016
Apr-11-2020, 05:59 PM
(This post was last modified: Apr-11-2020, 05:59 PM by snippsat.)
class is a reserved word in Python.
So bs4 by adding class _ understand that is search by CSS class and not a Python class.
This is simpler(close to source code if copy),than older dictionary method.
Both still work.
# New way
soup.find(class_='BodyText')
# Older way
soup.find({"class": "BodyText"})
|