Python Forum

Full Version: Web crawler not returning links
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hello, I am new to coding and have tried to make a web crawler to try to retrieve the urls off a web page of my choosing. However, when I run the program it does not print the urls yet it does not say there is any specific error. If someone could take a look at my code and tell me what I have done wrong your help would be greatly appreciated.(note: the code is being run in the latest version of the IDE Pycharm if that is relevant)

import requests
from bs4  import BeautifulSoup

def spider(max_pages):
    page = 1
    while page <= max_pages:
        url = "" + str(page) #url here
    source_code= requests.geturl
    plain_text = source_code.txt
    soup = BeautifulSoup(plain_text)
    for link in soup.findAll("a", {"class": "rc"}):
        href ="" + link.get("href") # place website name in the qoutes

Your code has never been run, at east not successfully!
You have several major issues. Code should look like:
import requests
from bs4 import BeautifulSoup

def spider(max_pages):
    page = 1
    while page <= max_pages:
        url = "" + str(
            page)  # url here
        source_code = requests.get(url)
        plain_text = source_code.content
        print(f'size: {len(plain_text)}')
        soup = BeautifulSoup(plain_text, "lxml")
        for link in soup.findAll("a"):
            href = "" + link.get("href")  # place website name in the qoutes
        page += 1

Watch your indentation, look at code changes, there are quite a few

This code results in (partial listing):
Thank you for your help Larz60+, it is working now. If I may ask out of curiosity what does the "lxml" do. Also why was using the class of the element incorrect for line 15 for future reference.
lxml is a parser. If you get an error when trying to use it, you need to install it with
pip install lxml