Python Forum
BeautifulSoup 'NoneType' object has no attribute 'text'
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
BeautifulSoup 'NoneType' object has no attribute 'text'
#1
I have some Python scripts running on 2-3 Amazon Web Services instances that scrape records from a few websites. The code has been running fine on these AWS instances.

I created a new local VM today, installed the exact same Python libraries (BeautifulSoup, Selenium, ChromeDriver, GeckoDriver, etc.) on that VM as the ones that exist on the other AWS instances, and every single line that I have in the un-changed code that uses BeautifulSoup4 to try to read various fields of data from the pages fails, giving me...

'NoneType' object has no attribute 'text'

Any thoughts as to what could be going on here? I've verified numerous times that I have all of the appropriate software/packages/libraries/etc. installed on the new VM just as they are on the AWS instances where the code actually runs correctly.

Thanks in advance for any recommendations/ideas.
Reply
#2
It seems that the code isn't the issue if it's identical to the AWS implementations. I have a couple questions to narrow down the possibilities:

1. Are you getting any positive results? Likewise, is the VM configured to retrieve HTTP responses?

2. Does the code have any dependency on AWS? I'm not versed in AWS (I should fix that sometime), but any API dependencies could be mucking up the works. Though, I imagine any such problem would raise an exception.
Reply
#3
Thank you so much for your feedback and the questions.

I have a few "odd" areas of the pages I'm scraping with the site that the Python script is written for that aren't very reliable to scrape using Beautiful Soup, so I have always used Selenium (getting some data using XPATH), and that means of scraping a few fields of data continues to work just fine on the newly created VM... no issues there. It's just the other 95% of the site that I have been scraping for an eternity using Beautiful Soup that all of a sudden is telling me the "NoneType object has no attribute 'text'" messages for all fields that I try to access. And there's definitely data there. I can flip over to the AWS instance and run the exact same code and it scrapes perfectly. Very odd.

I've gone back through my Python script and all logic revolving around that and there's not anything that's dependent on the code needing to run on an AWS instance. We're trying to move this daily scrape away from an AWS instance and to a local VM to save some $$$.

I'm still stumped.
Reply
#4
Quote:'NoneType' object has no attribute 'text'
(Sep-12-2018, 02:33 AM)bmccollum Wrote: using Beautiful Soup that all of a sudden is telling me the "NoneType object has no attribute 'text'" messages for all fields that I try to access. And there's definitely data there. I can flip over to the AWS instance and run the exact same code and it scrapes perfectly.

If your using selenium you might need a longer delay between the page load and the scraping code. Its possible that your local VM is slower than the AWS counterpart. That would be one reason why a scrape would return None.
Recommended Tutorials:
Reply
#5
I'll take a shot at the delay trick. From memory, I think there's currently about a 15 second delay to begin with, but I can always double that to see if that makes a difference. Will chime back in w/what I find. Thanks again!
Reply
#6
Are you using WebDriverWait for the presence of an element or time.sleep?
Recommended Tutorials:
Reply
#7
I'm not remoted in to the AWS instance right now, so I'm not completely sure about the time.sleep vs WebDriverWait. If for some reason I'm using time.sleep, I may need to adjust this to WebDriverWait to try this out instead. Whatever I'm using, it works flawlessly for scraping hundreds if sometimes not thousands of pages on this site on a daily basis... but just not on the new, local VM. Thanks again. Will update this once I check back in on the code in the a.m. to see about the time.sleep() vs WebDriverWait.

Actually, before I close this, it *seems* like I'm using time.sleep(), as I don't think I could figure out how to implement WebDriverWait as most of the pages are loading via clicking on page #s after the initial page is loaded, and the URL in the browser stays exactly the same throughout the entire process, regardless of what page # I've just clicked on to load another 50 records. Hope that makes sense.
Reply
#8
I'm sort of at the end of my rope currently on this one. I've put in time.sleep() statements as long as 4-5 min. as to give the page a ton of time to load to where Beautiful Soup can retrieve the HTML successfully, and still get the 'NoneType' object has no attribute 'text' msg.

Code is still working perfectly fine on the Amazon Web Services VM, again with exactly the same software/libraries/etc. installed on both my new (local) VM and the long-standing AWS instance.

I can get data all day long with the couple of statements I have in the Python script that gets a few bits of info. from each page using XPATH as opposed to Beautiful Soup.

Any last suggestions / thoughts on this?

Thanks.
Reply
#9
15 seconds was a long enough delay to test. I dont think its improper installed libraries. You seem to be getting the page but getting None when scraping. Although you can make sure that you are getting the page by printing out the HTML and seeing if its correct or not.

Forget the fact that your AWS works for the moment and revert the traceback from last to start. By that i mean.... What is returning as None? Is it the object of BeautifulSoup() or a more narrow tag within its soup? If its soup, then you are probably not getting the page at all, but if its an embedded tag, then maybe for some reason your local is triggering the site to respond with different code. Why is it returning None? Start printing out the variables to locate what tag is not being identified.

Its hard to tell without the site you are parsing and the code you are parsing it with to tell you exactly the solution. If you could replicate the HTML code and its parsing code, we could more easily pinpoint the problem. But as is its just taking shots in the dark.
Recommended Tutorials:
Reply
#10
When I look at the HTML retrieved from Beautiful Soup on the new/local VM, there's results in there. I think what I need to do is save a copy of what the Beautiful Soup objecty gets as a whole from the page when the code executes on the AWS instance, and then do the same with what the Beautiful Suop object gets as a result when the code executes on the local VM, and see if, for the exact same page I'm trying to load/road on both machines, if there's any difference there.

I'll keep digging and will post back if I unearth anything. Thanks for the help/recommendations.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  How to fix "'dict_values' object has no attribute 'inject_wsgi'" in session_transacti devid 0 1,138 Aug-13-2023, 07:52 AM
Last Post: devid
  Trying to extract style attribute with BeautifulSoup knight2000 1 2,983 Dec-28-2022, 03:06 AM
Last Post: knight2000
  AttributeError: 'ellipsis' object has no attribute 'register_blueprint' Mechanicalpixelz 2 2,355 Dec-29-2021, 01:30 AM
Last Post: Mechanicalpixelz
  Python BeautifulSoup gives unusable text? dggo666 0 1,405 Oct-29-2021, 05:12 AM
Last Post: dggo666
  BeautifulSoup returning text as N/A tantony 6 2,665 Sep-09-2021, 12:59 PM
Last Post: tantony
  AttributeError: 'NoneType' object in a parser - stops it apollo 4 3,964 May-28-2021, 02:13 PM
Last Post: Daring_T
  AttributeError: ResultSet object has no attribute 'get_text' KatMac 1 4,336 May-07-2021, 05:32 PM
Last Post: snippsat
  Python 3.9 : BeautifulSoup: 'NoneType' object has no attribute 'text' fudgemasterultra 1 8,813 Mar-03-2021, 09:40 AM
Last Post: Larz60+
  BeautifulSoup attribute problem zzy 3 2,966 Dec-07-2020, 11:07 PM
Last Post: zzy
  select all the span text with same attribute JennyYang 2 2,096 Jul-28-2020, 02:56 PM
Last Post: snippsat

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020