Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Web scraping errors
#1
Hi!
Can someone please help a newbie here.
Im trying to download all the images from xkcd.com, using the code from the book Automate the boring stuff with python. The code does work, but I get some errors after a while:

Can someone please tell me what to do? I have searched for an answer but cant find a solution for my code to keep running / or skip over the errors.

Here's the error message (in red):

Output:
Downloading image https://imgs.xkcd.com/comics/election_night.png... Downloading page https://xkcd.com/2067/... Downloading image https:/2067/asset/challengers_header.png...
Error:
Traceback (most recent call last): File "C:/Users/Bruker/Desktop/IBE151 Practic. Program/Assignment/Web Scraping/scraping_test3.py", line 27, in <module> res = requests.get(comicUrl) File "C:\Users\Bruker\AppData\Local\Programs\Python\Python38-32\lib\site-packages\requests\api.py", line 76, in get return request('get', url, params=params, **kwargs) File "C:\Users\Bruker\AppData\Local\Programs\Python\Python38-32\lib\site-packages\requests\api.py", line 61, in request return session.request(method=method, url=url, **kwargs) File "C:\Users\Bruker\AppData\Local\Programs\Python\Python38-32\lib\site-packages\requests\sessions.py", line 516, in request prep = self.prepare_request(req) File "C:\Users\Bruker\AppData\Local\Programs\Python\Python38-32\lib\site-packages\requests\sessions.py", line 449, in prepare_request p.prepare( File "C:\Users\Bruker\AppData\Local\Programs\Python\Python38-32\lib\site-packages\requests\models.py", line 314, in prepare self.prepare_url(url, params) File "C:\Users\Bruker\AppData\Local\Programs\Python\Python38-32\lib\site-packages\requests\models.py", line 391, in prepare_url raise InvalidURL("Invalid URL %r: No host supplied" % url) requests.exceptions.InvalidURL: Invalid URL 'https:/2067/asset/challengers_header.png': No host supplied
Here is my code:
import requests
import bs4
import os

url = 'https://xkcd.com'               # starting url
os.makedirs('xkcd2', exist_ok=True)    # store comics in ./xkcd
while not url.endswith('#'):
    # Download the page.
    print('Downloading page %s...' % url)
    res = requests.get(url)
    try:
        res.raise_for_status()
    except Exception as exc:
        print('There was a problem: %s' % (exc))
        pass

    soup = bs4.BeautifulSoup(res.text, 'html.parser')

    # TODO: Find the URL of the comic image.
    comicElem = soup.select('#comic img')
    if comicElem == []:
        print('Could not find comic image.')
    else:
        comicUrl = 'https:' + comicElem[0].get('src')
        # Download the image.
        print('Downloading image %s...' % (comicUrl))
        res = requests.get(comicUrl)
        try:
            res.raise_for_status()
        except Exception as exc:
            print('There was a problem: %s' % (exc))
            pass

    # TODO: Download the image.

    # TODO: Save the image to ./xkcd.
    imageFile = open(os.path.join('xkcd2', os.path.basename(comicUrl)),'wb')
    for chunk in res.iter_content(100000):
        imageFile.write(chunk)
    imageFile.close()

    # Get the Prev button's url.
    prevLink = soup.select('a[rel="prev"]')[0]
    url = 'https://xkcd.com' + prevLink.get('href')

    # TODO: Get the Prev button's url.

print('Done.')
Reply
#2
check line 24 - you don't construct proper url
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020