My first Python scraping script not working...

MattH · (This post was last modified: Feb-17-2018, 04:09 AM by MattH.)

I'm learning Python and have a huge interest in bots and scraping.

I made the code below to extract the h1 text from a web page, but an error comes up when running it in the shell saying "No module named urllib2" Most of this code is from the internet... what do I do to make urllib2 found?

Here is my python code:

import urllib2
from bs4 import BeautifulSoup
quote_page = 'https://amazon.com'
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
name_box = soup.find('h1', attrs{class': 'name'})
name = name_box.text.strip()
print(name)

Does anybody know the issue? (sorry if I'm being stupid here).

bs4 cannot also be found... Any clue?

***metulburr*** · (This post was last modified: Feb-17-2018, 04:27 AM by metulburr.)

Urllib2 is python2.x

Use requests instead

pip install requests beautifulsoup4

And check our tutorials section for web scraping
https://python-forum.io/Thread-Web-Scraping-part-1

MattH · Feb-17-2018, 06:04 AM

- I installed in terminal "pip install requests beautifulsoup4"

and changed the code in my file to:

import urllib.request
from bs4 import BeautifulSoup
quote_page = 'https://amazon.com'
page = urllib.request.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
name_box = soup.find('h1', attrs{class': 'name'})
name = name_box.text.strip()
print(name)

It still says bs4 module cannot be found. Don't get what I'm getting wrong here...

wavic · (This post was last modified: Feb-17-2018, 06:41 AM by wavic.)

I don't know where you've got the code from but it's not going to work. Instead of copy/paste way to "writing" programs, learn so you can do it by yourself.

import urllib.request
from bs4 import BeautifulSoup

quote_page = 'https://amazon.com'

# urllib.request.urlopen returns an object and you have to use read() method to get the content, the web page
page = urllib.request.urlopen(quote_page).read()
soup = BeautifulSoup(page, 'html.parser')

# you can replace this with: name_box = soup.find('h1', _class='name')
name_box = soup.find('h1', attrs{'class': 'name'})  # missing quote ('class'). could be a typo
name = name_box.text.strip()
print(name)

I can't tell anything about the missing ms4 module. Windows?

**buran** · (This post was last modified: Feb-17-2018, 06:50 AM by buran.)

what python version do you use? did you install python3 alongside the pre-installed python2, so having two python installations?
https://docs.python.org/3/using/mac.html...-macpython
i guess you try to run this code with python3 but installed requests and bs4 for the py3 installation

MattH · (This post was last modified: Feb-17-2018, 07:01 AM by MattH.)

Thanks guys -with your help, I managed to figure out the "modules don't exist" issues.

Edit: The Python I downloaded was straight from Python.org, the version was 3.6.4

- I also changed my code to @Wavics sample code (I got my code from an article on how to make a simple scraper) I have been learning Python for a solid two days now - I'm really enjoying it, just I wanted to make something which gave me satisfaction to keep me going for the main prize; which is to be able to make beautiful softwares in the future.

Anyway, enough ranting from me...

Pythons shell finally let me run the script... but now it errors with this:

[Image: Screen_Shot_2018_02_17_at_06_55_09.png]

[Image: Screen_Shot_2018_02_17_at_06_55_09.png]

Any idea? Thank you for your help guys.

**buran** · Feb-17-2018, 07:32 AM

please, don't post images. copy/paste full traceback in error tags. also, post the latest version of the code that produce the error, in code tags

***snippsat*** · (This post was last modified: Feb-17-2018, 11:45 AM by snippsat.)

Do not use urllib always Requests and amazon.com is a difficult site to start with.
To check that all work.

from bs4 import BeautifulSoup
import requests

url = 'https://www.python.org/'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'lxml')
print(soup.select('head > title')[0].text)

Output:
Welcome to Python.org

If change to url = 'https://amazon.com'

Output:
Robot Check

So as mention amazon is a difficult site to start with,
switching to Selenium may for sure be needed(Amazon use a lot of JavaScript).
That may get pass Robot Check or not.

***metulburr*** · (This post was last modified: Feb-17-2018, 01:05 PM by metulburr.)

Try

import requests
requests.packages.urllib3.disable_warnings()

import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    # Legacy Python that doesn't verify HTTPS certificates by default
    pass
else:
    # Handle target environment that doesn't support HTTPS verification
    ssl._create_default_https_context = _create_unverified_https_context

But i would use requests module instead of urllib.request otherwise you are just asking to complicate your code drastically. Not many people use the standard libraries to make bots so your going to get errors that we havent seen in years by going agaisnt the grain of the majority. There is a reason why one made the requests library as it simplifies and automates the boiler plate code

MattH · (This post was last modified: Feb-17-2018, 05:48 PM by MattH.)

Thanks again for your replies guys.

My current code (took the latest sample):

Python Code: (Double-click to select all)
from bs4 import BeautifulSoup
import requests
 
url = 'https://www.python.org/'
url_get = requests.get(url)
soup = BeautifulSoup(url_get.content, 'lxml')
print(soup.select('head > title')[0].text)

The current error:

Error:Traceback (most recent call last):
  File "/Users/Matt/Desktop/python bible/scraper-3.py", line 6, in <module>
    soup = BeautifulSoup(url_get.content, '1xml1')
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/bs4/__init__.py", line 165, in __init__
    % ",".join(features))
bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: 1xml1. Do you need to install a parser library?

I tried to be independent and find the reason for the error, but I've had no luck. I WILL GET THERE EVENTUALLY, lol... Thanks again and sorry for all the questions. Cannot wait until I can troubleshoot effectively myself.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Help needed with img scraping script	daajax	1	254	Mar-26-2024, 08:45 PM Last Post: snippsat
	Issue facing while scraping the data from different websites in single script.	Balamani	1	2,126	Oct-20-2020, 09:56 AM Last Post: Larz60+
	Safely running a web scraping script	londonhdi	1	1,887	Feb-17-2020, 08:08 AM Last Post: Larz60+
	Web scraping and java script	yoz69100	2	1,888	Oct-14-2019, 07:41 PM Last Post: yoz69100

My first Python scraping script not working...

User Panel Messages

Announcements