Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Web crawler not returning links
#1
Hello, I am new to coding and have tried to make a web crawler to try to retrieve the urls off a web page of my choosing. However, when I run the program it does not print the urls yet it does not say there is any specific error. If someone could take a look at my code and tell me what I have done wrong your help would be greatly appreciated.(note: the code is being run in the latest version of the IDE Pycharm if that is relevant)

import requests
from bs4  import BeautifulSoup

def spider(max_pages):
    page = 1
    while page <= max_pages:
        url = "https://www.google.com/search?q=making+a+clock+in+python&ie=utf-8&oe=utf-8&client=firefox-b-1" + str(page) #url here
    source_code= requests.geturl
    plain_text = source_code.txt
    soup = BeautifulSoup(plain_text)
    for link in soup.findAll("a", {"class": "rc"}):
        href ="https://www.google.com" + link.get("href") # place website name in the qoutes
        print(href)
        

spider(1)
Reply
#2
Your code has never been run, at east not successfully!
You have several major issues. Code should look like:
import requests
from bs4 import BeautifulSoup


def spider(max_pages):
    page = 1
    while page <= max_pages:
        url = "https://www.google.com/search?q=making+a+clock+in+python&ie=utf-8&oe=utf-8&client=firefox-b-1" + str(
            page)  # url here
        source_code = requests.get(url)
        plain_text = source_code.content
        print(f'size: {len(plain_text)}')
        soup = BeautifulSoup(plain_text, "lxml")
        for link in soup.findAll("a"):
            href = "https://www.google.com" + link.get("href")  # place website name in the qoutes
            print(href)
        page += 1


spider(1)
Watch your indentation, look at code changes, there are quite a few

This code results in (partial listing):
Output:
https://www.google.comhttps://www.google.com/search?hl=en&tbm=isch&source=og&tab=wi https://www.google.comhttps://maps.google.com/maps?hl=en&tab=wl https://www.google.comhttps://play.google.com/?hl=en&tab=w8 https://www.google.comhttps://www.youtube.com/results?gl=US&tab=w1 https://www.google.comhttps://news.google.com/nwshp?hl=en&tab=wn https://www.google.comhttps://mail.google.com/mail/?tab=wm https://www.google.comhttps://drive.google.com/?tab=wo https://www.google.comhttps://www.google.com/intl/en/options/
Reply
#3
Thank you for your help Larz60+, it is working now. If I may ask out of curiosity what does the "lxml" do. Also why was using the class of the element incorrect for line 15 for future reference.
Reply
#4
lxml is a parser. If you get an error when trying to use it, you need to install it with
pip install lxml
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Web Crawler help Mr_Mafia 2 1,880 Apr-04-2020, 07:20 PM
Last Post: Mr_Mafia
  webscrapping links and then enter those links to scrape data kirito85 2 3,199 Jun-13-2019, 02:23 AM
Last Post: kirito85
  Web Crawler help takaa 39 27,208 Apr-26-2019, 12:14 PM
Last Post: stateitreal

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020