Python Forum
My first Python scraping script not working...
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
My first Python scraping script not working...
#1
I'm learning Python and have a huge interest in bots and scraping.

I made the code below to extract the h1 text from a web page, but an error comes up when running it in the shell saying "No module named urllib2" Most of this code is from the internet... what do I do to make urllib2 found?

Here is my python code:

import urllib2
from bs4 import BeautifulSoup
quote_page = 'https://amazon.com'
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
name_box = soup.find('h1', attrs{class': 'name'})
name = name_box.text.strip()
print(name)
Does anybody know the issue? (sorry if I'm being stupid here).

bs4 cannot also be found... Any clue?
Reply
#2
Urllib2 is python2.x

Use requests instead

pip install requests beautifulsoup4
And check our tutorials section for web scraping
https://python-forum.io/Thread-Web-Scraping-part-1
Recommended Tutorials:
Reply
#3
- I installed in terminal "pip install requests beautifulsoup4"

and changed the code in my file to:

import urllib.request
from bs4 import BeautifulSoup
quote_page = 'https://amazon.com'
page = urllib.request.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
name_box = soup.find('h1', attrs{class': 'name'})
name = name_box.text.strip()
print(name)
It still says bs4 module cannot be found. Don't get what I'm getting wrong here...
Reply
#4
I don't know where you've got the code from but it's not going to work. Instead of copy/paste way to "writing" programs, learn so you can do it by yourself.

import urllib.request
from bs4 import BeautifulSoup

quote_page = 'https://amazon.com'

# urllib.request.urlopen returns an object and you have to use read() method to get the content, the web page
page = urllib.request.urlopen(quote_page).read()
soup = BeautifulSoup(page, 'html.parser')

# you can replace this with: name_box = soup.find('h1', _class='name')
name_box = soup.find('h1', attrs{'class': 'name'})  # missing quote ('class'). could be a typo
name = name_box.text.strip()
print(name)
I can't tell anything about the missing ms4 module. Windows?
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#5
what python version do you use? did you install python3 alongside the pre-installed python2, so having two python installations?
https://docs.python.org/3/using/mac.html...-macpython
i guess you try to run this code with python3 but installed requests and bs4 for the py3 installation
Reply
#6
Thanks guys -with your help, I managed to figure out the "modules don't exist" issues.

Edit: The Python I downloaded was straight from Python.org, the version was 3.6.4

- I also changed my code to @Wavics sample code (I got my code from an article on how to make a simple scraper) I have been learning Python for a solid two days now - I'm really enjoying it, just I wanted to make something which gave me satisfaction to keep me going for the main prize; which is to be able to make beautiful softwares in the future.

Anyway, enough ranting from me...

Pythons shell finally let me run the script... but now it errors with this:

[Image: Screen_Shot_2018_02_17_at_06_55_09.png]

Any idea? Thank you for your help guys.
Reply
#7
please, don't post images. copy/paste full traceback in error tags. also, post the latest version of the code that produce the error, in code tags
Reply
#8
Do not use urllib always Requests and amazon.com is a difficult site to start with.
To check that all work.
from bs4 import BeautifulSoup
import requests

url = 'https://www.python.org/'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'lxml')
print(soup.select('head > title')[0].text)
Output:
Welcome to Python.org
If change to url = 'https://amazon.com'
Output:
Robot Check
So as mention amazon is a difficult site to start with,
switching to Selenium may for sure be needed(Amazon use a lot of JavaScript).
That may get pass Robot Check or not.
Reply
#9
Try
import requests
requests.packages.urllib3.disable_warnings()

import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    # Legacy Python that doesn't verify HTTPS certificates by default
    pass
else:
    # Handle target environment that doesn't support HTTPS verification
    ssl._create_default_https_context = _create_unverified_https_context
But i would use requests module instead of urllib.request otherwise you are just asking to complicate your code drastically. Not many people use the standard libraries to make bots so your going to get errors that we havent seen in years by going agaisnt the grain of the majority. There is a reason why one made the requests library as it simplifies and automates the boiler plate code
Recommended Tutorials:
Reply
#10
Thanks again for your replies guys.

My current code (took the latest sample):
Python Code: (Double-click to select all)
from bs4 import BeautifulSoup
import requests
 
url = 'https://www.python.org/'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'lxml')
print(soup.select('head > title')[0].text)
The current error:
Error:
Traceback (most recent call last): File "/Users/Matt/Desktop/python bible/scraper-3.py", line 6, in <module> soup = BeautifulSoup(url_get.content, '1xml1') File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/bs4/__init__.py", line 165, in __init__ % ",".join(features)) bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: 1xml1. Do you need to install a parser library?
I tried to be independent and find the reason for the error, but I've had no luck. I WILL GET THERE EVENTUALLY, lol... Thanks again and sorry for all the questions. Cannot wait until I can troubleshoot effectively myself.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Help needed with img scraping script daajax 1 254 Mar-26-2024, 08:45 PM
Last Post: snippsat
Thumbs Up Issue facing while scraping the data from different websites in single script. Balamani 1 2,126 Oct-20-2020, 09:56 AM
Last Post: Larz60+
  Safely running a web scraping script londonhdi 1 1,887 Feb-17-2020, 08:08 AM
Last Post: Larz60+
  Web scraping and java script yoz69100 2 1,888 Oct-14-2019, 07:41 PM
Last Post: yoz69100

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020