Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Web crawler not returning links
#1
Hello, I am new to coding and have tried to make a web crawler to try to retrieve the urls off a web page of my choosing. However, when I run the program it does not print the urls yet it does not say there is any specific error. If someone could take a look at my code and tell me what I have done wrong your help would be greatly appreciated.(note: the code is being run in the latest version of the IDE Pycharm if that is relevant)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import requests
from bs4  import BeautifulSoup
 
def spider(max_pages):
    page = 1
    while page <= max_pages:
        url = "https://www.google.com/search?q=making+a+clock+in+python&ie=utf-8&oe=utf-8&client=firefox-b-1" + str(page) #url here
    source_code= requests.geturl
    plain_text = source_code.txt
    soup = BeautifulSoup(plain_text)
    for link in soup.findAll("a", {"class": "rc"}):
        href ="https://www.google.com" + link.get("href") # place website name in the qoutes
        print(href)
         
 
spider(1)
Reply
#2
Your code has never been run, at east not successfully!
You have several major issues. Code should look like:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import requests
from bs4 import BeautifulSoup
 
 
def spider(max_pages):
    page = 1
    while page <= max_pages:
        url = "https://www.google.com/search?q=making+a+clock+in+python&ie=utf-8&oe=utf-8&client=firefox-b-1" + str(
            page)  # url here
        source_code = requests.get(url)
        plain_text = source_code.content
        print(f'size: {len(plain_text)}')
        soup = BeautifulSoup(plain_text, "lxml")
        for link in soup.findAll("a"):
            href = "https://www.google.com" + link.get("href"# place website name in the qoutes
            print(href)
        page += 1
 
 
spider(1)
Watch your indentation, look at code changes, there are quite a few

This code results in (partial listing):
Output:
https://www.google.comhttps://www.google.com/search?hl=en&tbm=isch&source=og&tab=wi https://www.google.comhttps://maps.google.com/maps?hl=en&tab=wl https://www.google.comhttps://play.google.com/?hl=en&tab=w8 https://www.google.comhttps://www.youtube.com/results?gl=US&tab=w1 https://www.google.comhttps://news.google.com/nwshp?hl=en&tab=wn https://www.google.comhttps://mail.google.com/mail/?tab=wm https://www.google.comhttps://drive.google.com/?tab=wo https://www.google.comhttps://www.google.com/intl/en/options/
Reply
#3
Thank you for your help Larz60+, it is working now. If I may ask out of curiosity what does the "lxml" do. Also why was using the class of the element incorrect for line 15 for future reference.
Reply
#4
lxml is a parser. If you get an error when trying to use it, you need to install it with
1
pip install lxml
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  stucked with my crawler jviure 0 1,789 Apr-15-2020, 10:47 AM
Last Post: jviure
  Web Crawler help Mr_Mafia 2 2,694 Apr-04-2020, 07:20 PM
Last Post: Mr_Mafia
  Crawler Telegram sysx90 0 2,975 Nov-30-2019, 04:32 PM
Last Post: sysx90
  web crawler problems kid_with_polio 3 3,022 Jun-15-2019, 12:33 AM
Last Post: metulburr
  webscrapping links and then enter those links to scrape data kirito85 2 4,317 Jun-13-2019, 02:23 AM
Last Post: kirito85
  Web Crawler help takaa 39 33,683 Apr-26-2019, 12:14 PM
Last Post: stateitreal
  Web Crawler Not Working chrisdas 13 14,029 Feb-06-2017, 10:45 PM
Last Post: scriptso

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020