Python Forum

Full Version: Web Scrapping Application
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hello everybody! I have a web crawler project, this is the main part of the code:
# crawler 
import urllib.request
from parse_tree import *
import re
url = '''some url here'''
s = set()
List = []
def f_go(List, s, url, iter_cnt):
   try:

       if url in s:
           return
       s.add(url)
       with urllib.request.urlopen(url) as response:
           html = response.read()
           print("here", url)
       h = html.decode("utf-8")
       lst0 = prepare_expression(list(h))
       ntr = buildNaryParseTree(lst0)
       lst1 = []
       lst2 = nary_tree_tolist(ntr, lst1)
       List.append(lst2)
       f1 = re.finditer(str_regex, h)
       l1 = []
       for tok in f1:
           ind1 = tok.span()
           l1.append(h[ind1[0]:ind1[1]])
       for exp in l1:
           f_go(List, s, exp, iter_cnt + 1)
   except:
       return
I'm afraid, that this is going to be slow  (even if I will add concurrency), and I'm seriously thinking about rewriting it in Go. All what id does is opening url's recursively in the loop, parsing it (this will be done in database)  and save to data structure. Is there a chance to improve speed here?
I would say that you are using wrong tools,can look at my tutorials here part-1, part-2.
In part-2 i talk a little about concurrency.

lxml is one of the fastest parser in any language(has C library as core).
Can be used trough BeautifulSoup BeautifulSoup(url, 'lxml') or alone.
Use Requests,then you can remove decode stuff you get correct encoded page back.
Regex is a really bad tool for html,a funny answer.

Scrapy is fast,it has build concurrency with Twisted.
(Feb-10-2017, 09:46 AM)snippsat Wrote: [ -> ]I would say that you are using wrong tools,can look at my tutorials here part-1, part-2.
In part-2 i talk a little about concurrency.

lxml is one of the fastest parser in any language(has C library as core).
Can be used trough BeautifulSoup BeautifulSoup(url, 'lxml') or alone.
Use Requests,then you can remove decode stuff you get correct encoded page back.
Regex is a really bad tool for html,a funny answer.

Scrapy is fast,it has build concurrency with Twisted.
Thanks for replay! Looks like I don't need repeat the job - there are tools to do it.
lxml looks impressive, absolutely worth to check! I know the regex is a weak link in it, but I'm not using them to parse - I just extract a domain, for example, give me all in science.bbc.com - to be fair, I don't see an alternative to regex here!
(Feb-10-2017, 09:58 AM)lion137 Wrote: [ -> ]I just extract a domain, for example, give me all in science.bbc.com - to be fair, I don't see an alternative to regex here!
Regex is okay here,if that what slow you down have to be measured. 
Remember to use re.complie() when using it recursively.
$ python -m timeit -s "import re" "re.match('hello', 'hello world')"
100000 loops, best of 3: 3.82 usec per loop

$ python -m timeit -s "import re; h=re.compile('hello')" "h.match('hello world')"
1000000 loops, best of 3: 1.26 usec per loop
There are libraries for this to,if it faster has to  measured with tool like timeit.
Profile is a other tool to look at what takes time in you function.
Quick test of tldextract with Python 3.6.
>>> import tldextract

>>> url = 'https://python-forum.io/'
>>> ext = tldextract.extract(url)
>>> ext.registered_domain
'python-forum.io'
Thanks, I will definitely profile it, if concurrency won't do the job. I use jupyter notebook - there are awesome tools to do it (and more)!