Web crawler not returning links

snakeyes22 · Dec-29-2017, 07:50 PM

Hello, I am new to coding and have tried to make a web crawler to try to retrieve the urls off a web page of my choosing. However, when I run the program it does not print the urls yet it does not say there is any specific error. If someone could take a look at my code and tell me what I have done wrong your help would be greatly appreciated.(note: the code is being run in the latest version of the IDE Pycharm if that is relevant)

import requests
from bs4  import BeautifulSoup

def spider(max_pages):
    page = 1
    while page <= max_pages:
        url = "https://www.google.com/search?q=making+a+clock+in+python&ie=utf-8&oe=utf-8&client=firefox-b-1" + str(page) #url here
    source_code= requests.geturl
    plain_text = source_code.txt
    soup = BeautifulSoup(plain_text)
    for link in soup.findAll("a", {"class": "rc"}):
        href ="https://www.google.com" + link.get("href") # place website name in the qoutes
        print(href)
        

spider(1)

**Larz60+** · Dec-29-2017, 08:13 PM

Your code has never been run, at east not successfully!
You have several major issues. Code should look like:

import requests
from bs4 import BeautifulSoup


def spider(max_pages):
    page = 1
    while page <= max_pages:
        url = "https://www.google.com/search?q=making+a+clock+in+python&ie=utf-8&oe=utf-8&client=firefox-b-1" + str(
            page)  # url here
        source_code = requests.get(url)
        plain_text = source_code.content
        print(f'size: {len(plain_text)}')
        soup = BeautifulSoup(plain_text, "lxml")
        for link in soup.findAll("a"):
            href = "https://www.google.com" + link.get("href")  # place website name in the qoutes
            print(href)
        page += 1


spider(1)

Watch your indentation, look at code changes, there are quite a few

This code results in (partial listing):

Output:https://www.google.comhttps://www.google.com/search?hl=en&tbm=isch&source=og&tab=wi
https://www.google.comhttps://maps.google.com/maps?hl=en&tab=wl
https://www.google.comhttps://play.google.com/?hl=en&tab=w8
https://www.google.comhttps://www.youtube.com/results?gl=US&tab=w1
https://www.google.comhttps://news.google.com/nwshp?hl=en&tab=wn
https://www.google.comhttps://mail.google.com/mail/?tab=wm
https://www.google.comhttps://drive.google.com/?tab=wo
https://www.google.comhttps://www.google.com/intl/en/options/

snakeyes22 · Dec-29-2017, 09:07 PM

Thank you for your help Larz60+, it is working now. If I may ask out of curiosity what does the "lxml" do. Also why was using the class of the element incorrect for line 15 for future reference.

**Larz60+** · Dec-29-2017, 11:07 PM

lxml is a parser. If you get an error when trying to use it, you need to install it with

pip install lxml

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Web Crawler help	Mr_Mafia	2	1,880	Apr-04-2020, 07:20 PM Last Post: Mr_Mafia
	webscrapping links and then enter those links to scrape data	kirito85	2	3,199	Jun-13-2019, 02:23 AM Last Post: kirito85
	Web Crawler help	takaa	39	27,208	Apr-26-2019, 12:14 PM Last Post: stateitreal

Web crawler not returning links

User Panel Messages

Announcements