Python Forum

Pages: 1 2 3 4 5

Al Sweigart in Automate the boring stuff with Python strongly suggests to avoid urlib2. I'm asking this because I came to possession of a book called "Web Scraping with Python" by Ryan Mitchell. Her book uses it from the first page. Should I just skip it?

Also, if you have any recommendation on books or video tutorials about Web Scraping I'll be glad to hear it.

Most people now use the requests module as it does the boilerplate code for you.
https://pypi.org/project/requests/
You can always learn urllib of course regardless. Check out our tutorials for web scraping for requests module
https://python-forum.io/Forum-Web-Scraping

I have used it, but rarely, and can't remember exactly why.
I find that I can pretty much do what I need with selenium and beautiful soup (usually use lxml with soup)
and always requests

I used earlier urlib (urlopen) for downloading files. request was in urlib only in python 3

vers = platform.python_version()
print("Python " + vers)
if vers[0] == "2":
    from urllib import urlopen
else:
    from urllib.request import urlopen

not request ... it's requests a separate and wonderful package

urllib.request is different than requests. urllib.request is in the standard library, whereas requests is a 3rd party library for python that has to be installed normally through pip pip install requests. There are a lot of 3rd party libs that most people install alongside python. bs4 (BeautifulSoup) and selenium usually go hand in hand with requests for javascript bypassing and scraping.

In other words maybe reading that book is not the best idea. I'm familiar with forum tutorials, I'll check for video tutorials and books myself. Currently I study documentation regarding web scraping ( requests, BeautifulSoup, css selectors; selenium about just to start ). Maybe I should also look for some finished project on github to see how it is to be done...

p.s. do you more often use .find() or css selectors? Is there any important difference?

Keep reading the book. What you will use to get the webpage is insignificant in most of the cases. I am using requests and Selenium most of the times but I used to use urllib in my earlier python scripts and it works just fine. I didn't know about requests. Another reason why you may want to use the built-in module is if you do not have permissions to install anything on a machine so you must use what you have installed already. You have to know at least the basics of it.

Quote:In other words maybe reading that book is not the best idea.

I wouldn't judge the book on this, unless very recent and the author is adamant about the urllib thing, you can replace that small portion of code. If the book is older, urllib was the common page fetcher prior to requests, so totally understandable that it would and should have been suggested.

It still works, and can still be used, it's just that there are many more methods available with requests.

The book is from 2015.

Pages: 1 2 3 4 5

Truman

metulburr

Larz60+

Axel_Erfurt

Larz60+

metulburr

Truman

wavic

Larz60+

Truman