Web scraping errors

julan2020

Hi!
Can someone please help a newbie here.
Im trying to download all the images from xkcd.com, using the code from the book Automate the boring stuff with python. The code does work, but I get some errors after a while:

Can someone please tell me what to do? I have searched for an answer but cant find a solution for my code to keep running / or skip over the errors.

Here's the error message (in red):

Output:Downloading image https://imgs.xkcd.com/comics/election_night.png...
Downloading page https://xkcd.com/2067/...
Downloading image https:/2067/asset/challengers_header.png...

Error:Traceback (most recent call last):
  File "C:/Users/Bruker/Desktop/IBE151 Practic. Program/Assignment/Web Scraping/scraping_test3.py", line 27, in <module>
    res = requests.get(comicUrl)
  File "C:\Users\Bruker\AppData\Local\Programs\Python\Python38-32\lib\site-packages\requests\api.py", line 76, in get
    return request('get', url, params=params, **kwargs)
  File "C:\Users\Bruker\AppData\Local\Programs\Python\Python38-32\lib\site-packages\requests\api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "C:\Users\Bruker\AppData\Local\Programs\Python\Python38-32\lib\site-packages\requests\sessions.py", line 516, in request
    prep = self.prepare_request(req)
  File "C:\Users\Bruker\AppData\Local\Programs\Python\Python38-32\lib\site-packages\requests\sessions.py", line 449, in prepare_request
    p.prepare(
  File "C:\Users\Bruker\AppData\Local\Programs\Python\Python38-32\lib\site-packages\requests\models.py", line 314, in prepare
    self.prepare_url(url, params)
  File "C:\Users\Bruker\AppData\Local\Programs\Python\Python38-32\lib\site-packages\requests\models.py", line 391, in prepare_url
    raise InvalidURL("Invalid URL %r: No host supplied" % url)
requests.exceptions.InvalidURL: Invalid URL 'https:/2067/asset/challengers_header.png': No host supplied

Here is my code:

import requests
import bs4
import os

url = 'https://xkcd.com'               # starting url
os.makedirs('xkcd2', exist_ok=True)    # store comics in ./xkcd
while not url.endswith('#'):
    # Download the page.
    print('Downloading page %s...' % url)
    res = requests.get(url)
    try:
        res.raise_for_status()
    except Exception as exc:
        print('There was a problem: %s' % (exc))
        pass

    soup = bs4.BeautifulSoup(res.text, 'html.parser')

    # TODO: Find the URL of the comic image.
    comicElem = soup.select('#comic img')
    if comicElem == []:
        print('Could not find comic image.')
    else:
        comicUrl = 'https:' + comicElem[0].get('src')
        # Download the image.
        print('Downloading image %s...' % (comicUrl))
        res = requests.get(comicUrl)
        try:
            res.raise_for_status()
        except Exception as exc:
            print('There was a problem: %s' % (exc))
            pass

    # TODO: Download the image.

    # TODO: Save the image to ./xkcd.
    imageFile = open(os.path.join('xkcd2', os.path.basename(comicUrl)),'wb')
    for chunk in res.iter_content(100000):
        imageFile.write(chunk)
    imageFile.close()

    # Get the Prev button's url.
    prevLink = soup.select('a[rel="prev"]')[0]
    url = 'https://xkcd.com' + prevLink.get('href')

    # TODO: Get the Prev button's url.

print('Done.')

**buran** · Oct-29-2020, 06:06 AM

check line 24 - you don't construct proper url

Web scraping errors

User Panel Messages

Announcements