Web crawler not returning links

snakeyes22 · Dec-29-2017, 07:50 PM

Hello, I am new to coding and have tried to make a web crawler to try to retrieve the urls off a web page of my choosing. However, when I run the program it does not print the urls yet it does not say there is any specific error. If someone could take a look at my code and tell me what I have done wrong your help would be greatly appreciated.(note: the code is being run in the latest version of the IDE Pycharm if that is relevant)

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

import requests
from bs4  import BeautifulSoup
 
def spider(max_pages):
    page = 1
    while page <= max_pages:
        url = "https://www.google.com/search?q=making+a+clock+in+python&ie=utf-8&oe=utf-8&client=firefox-b-1" + str(page) #url here
    source_code= requests.geturl
    plain_text = source_code.txt
    soup = BeautifulSoup(plain_text)
    for link in soup.findAll("a", {"class": "rc"}):
        href ="https://www.google.com" + link.get("href") # place website name in the qoutes
        print(href)
         
 
spider(1)

**Larz60+** · Dec-29-2017, 08:13 PM

Your code has never been run, at east not successfully!
You have several major issues. Code should look like:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

import requests
from bs4 import BeautifulSoup
 
 
def spider(max_pages):
    page = 1
    while page <= max_pages:
        url = "https://www.google.com/search?q=making+a+clock+in+python&ie=utf-8&oe=utf-8&client=firefox-b-1" + str(
            page)  # url here
        source_code = requests.get(url)
        plain_text = source_code.content
        print(f'size: {len(plain_text)}')
        soup = BeautifulSoup(plain_text, "lxml")
        for link in soup.findAll("a"):
            href = "https://www.google.com" + link.get("href")  # place website name in the qoutes
            print(href)
        page += 1
 
 
spider(1)

Watch your indentation, look at code changes, there are quite a few

This code results in (partial listing):

Output:https://www.google.comhttps://www.google.com/search?hl=en&tbm=isch&source=og&tab=wi
https://www.google.comhttps://maps.google.com/maps?hl=en&tab=wl
https://www.google.comhttps://play.google.com/?hl=en&tab=w8
https://www.google.comhttps://www.youtube.com/results?gl=US&tab=w1
https://www.google.comhttps://news.google.com/nwshp?hl=en&tab=wn
https://www.google.comhttps://mail.google.com/mail/?tab=wm
https://www.google.comhttps://drive.google.com/?tab=wo
https://www.google.comhttps://www.google.com/intl/en/options/

snakeyes22 · Dec-29-2017, 09:07 PM

Thank you for your help Larz60+, it is working now. If I may ask out of curiosity what does the "lxml" do. Also why was using the class of the element incorrect for line 15 for future reference.

**Larz60+** · Dec-29-2017, 11:07 PM

lxml is a parser. If you get an error when trying to use it, you need to install it with

        
              pip install lxml

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	stucked with my crawler	jviure	0	1,789	Apr-15-2020, 10:47 AM Last Post: jviure
	Web Crawler help	Mr_Mafia	2	2,694	Apr-04-2020, 07:20 PM Last Post: Mr_Mafia
	Crawler Telegram	sysx90	0	2,975	Nov-30-2019, 04:32 PM Last Post: sysx90
	web crawler problems	kid_with_polio	3	3,022	Jun-15-2019, 12:33 AM Last Post: metulburr
	webscrapping links and then enter those links to scrape data	kirito85	2	4,317	Jun-13-2019, 02:23 AM Last Post: kirito85
	Web Crawler help	takaa	39	33,683	Apr-26-2019, 12:14 PM Last Post: stateitreal
	Web Crawler Not Working	chrisdas	13	14,029	Feb-06-2017, 10:45 PM Last Post: scriptso

Web crawler not returning links

User Panel Messages

Announcements