Python Forum
No links are getting added to the parsed_links variable
Thread Rating:
  • 2 Vote(s) - 3 Average
  • 1
  • 2
  • 3
  • 4
  • 5
No links are getting added to the parsed_links variable
#1
I am working on this web crawler and I am using the parsed_links variable to store links that have been found in crawled pages.  When I print the variable though, it is empty.  What is going wrong?  I tried directly adding links to self.links.

git.r3df0x.com/snippets/1

import requests
from bs4 import BeautifulSoup
import threading
import sys
import re

### DEBUGING VARIABLES
domain = 'https://en.wikipedia.org'
all_urls = []
crawled_urls = []
operator_email = '[email protected]'
stealth_mode = False
### These will be removed in the main version
## and replaced with ways of taking input
## from the user.

system_user_agent = 'Stratofortress web crawler - Version 0.1 - Operator ' + operator_email
stealth_user_agent = 'Mozilla/5.0 (X11; Linux x86_64; rv:10.0) Gecko/20100101 Firefox/10.0'
user_agent = ''
if (stealth_mode == True):
    user_agent = stealth_user_agent
else:
    user_agent = system_user_agent

class indexer(threading.Thread):
    def __init__(self, url):
        self.__url = url
        self.links = []
        self.domain = 'unset'
    def __get(self, url, stealth=False):
        if (stealth == True):
            user_agent = stealth_user_agent
        else:
            user_agent = system_user_agent
        return requests.get(url, headers={'user-agent': user_agent})
    def run(self):
        parsed_links = []
        r = self.__get(self.__url)
        html = r.text
        soup = BeautifulSoup(html)
        links = soup.findAll('a', href=True)
        for link in links:
            print link['href']
            #if (link['href'].startswith('http://') or link['href'].startswith('https://')):
            if (re.match('(http|https):\/\/*', link['href'])):
                found_domain = link['href'].split('//')[1].split('/')[0]
                print '======= DOMAIN ==== from http(s):// =====> ' + found_domain
                if (found_domain == domain):
                    parsed_links.append(link['href'])
            if (re.match('^(\/\/)*', link['href'])):
                print '======== Matched // URL ===========> ' + link['href']
        print parsed_links

def main():
    i = indexer(domain)
    i.run()

if __name__ == '__main__':
    main()
Reply
#2
You run the indexer in another thread, but then never wait for the thread to finish.  So maybe the script finishes before the first page is even finished transferring over the network?

i = indexer(domain)
i.run()

# indexer is now running in a different thread, but you're not waiting for it to finish
# to wait for it to be done, use the .join() method:
i.join()
Reply
#3
I added that in main and it still causes a problem

Traceback (most recent call last):
  File "crawler.py", line 63, in <module>
    main()
  File "crawler.py", line 60, in main
    i.join()
  File "C:\Python27\lib\threading.py", line 929, in join
    raise RuntimeError("cannot join thread before it is started")
RuntimeError: cannot join thread before it is started
Reply
#4
Oh, that's my bad.  You're not supposed to call run(), you call start(), which then starts a new thread, and calls run() from within that thread.

Which means you never started the thread, you just ran it directly in the main thread.  So that's not even the problem.  

...ok.  So it looks like you've got self.links but then never use it.  Within run(), you've got a parsed_links, which you print out.
If you look at where you're trying to append to parsed_links, you'll see this check: if (found_domain == domain):, which is never true, since found_domain never has a scheme attached.

To illustrate, I took out all the print statements from your code, and added this one:
            if (re.match('(http|https):\/\/*', link['href'])):
               found_domain = link['href'].split('//')[1].split('/')[0]
               print("{0}: {1} == {2}".format(found_domain==domain, domain, found_domain))
               if (found_domain == domain):
                   parsed_links.append(link['href'])
And here's a subset of the output (hopefully the problem is immediately apparent):
Output:
False: https://en.wikipedia.org == lists.wikimedia.org False: https://en.wikipedia.org == lists.wikimedia.org False: https://en.wikipedia.org == wikimediafoundation.org False: https://en.wikipedia.org == commons.wikimedia.org False: https://en.wikipedia.org == www.mediawiki.org False: https://en.wikipedia.org == meta.wikimedia.org False: https://en.wikipedia.org == en.wikibooks.org False: https://en.wikipedia.org == www.wikidata.org False: https://en.wikipedia.org == en.wikinews.org False: https://en.wikipedia.org == en.wikiquote.org False: https://en.wikipedia.org == en.wikisource.org False: https://en.wikipedia.org == sk.wikipedia.org False: https://en.wikipedia.org == sl.wikipedia.org False: https://en.wikipedia.org == th.wikipedia.org False: https://en.wikipedia.org == meta.wikimedia.org False: https://en.wikipedia.org == en.wikipedia.org
Reply
#5
I don't remember what that condition was for so I'm commenting it out until I decide what to do with it. I'm going to write it to take all the links and links and store and crawl them.

Thanks for helping.
Reply
#6
If I had to guess, you were trying to only crawl that domain, not whatever other domains it links to.
But since you're building a crawler, you probably want it to troll through any domains it can, unless you were specifically trying to build a better wikipedia search engine, or something.
Reply
#7
There's going to be a feature for one domain or any domain. Multiple domains would be preferable. Just one application is to create something better then Google Alerts since that's dependent on Google crawling sites
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  webscrapping links and then enter those links to scrape data kirito85 2 3,200 Jun-13-2019, 02:23 AM
Last Post: kirito85

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020