Web Scrapping Application

lion137 · Feb-10-2017, 09:00 AM

Hello everybody! I have a web crawler project, this is the main part of the code:

# crawler 
import urllib.request
from parse_tree import *
import re
url = '''some url here'''
s = set()
List = []
def f_go(List, s, url, iter_cnt):
   try:

       if url in s:
           return
       s.add(url)
       with urllib.request.urlopen(url) as response:
           html = response.read()
           print("here", url)
       h = html.decode("utf-8")
       lst0 = prepare_expression(list(h))
       ntr = buildNaryParseTree(lst0)
       lst1 = []
       lst2 = nary_tree_tolist(ntr, lst1)
       List.append(lst2)
       f1 = re.finditer(str_regex, h)
       l1 = []
       for tok in f1:
           ind1 = tok.span()
           l1.append(h[ind1[0]:ind1[1]])
       for exp in l1:
           f_go(List, s, exp, iter_cnt + 1)
   except:
       return

I'm afraid, that this is going to be slow (even if I will add concurrency), and I'm seriously thinking about rewriting it in Go. All what id does is opening url's recursively in the loop, parsing it (this will be done in database) and save to data structure. Is there a chance to improve speed here?

***snippsat*** · (This post was last modified: Feb-10-2017, 09:46 AM by snippsat.)

I would say that you are using wrong tools,can look at my tutorials here part-1, part-2.
In part-2 i talk a little about concurrency.

lxml is one of the fastest parser in any language(has C library as core).
Can be used trough BeautifulSoup BeautifulSoup(url, 'lxml') or alone.
Use Requests,then you can remove decode stuff you get correct encoded page back.
Regex is a really bad tool for html,a funny answer.

Scrapy is fast,it has build concurrency with Twisted.

lion137 · (This post was last modified: Feb-10-2017, 10:18 AM by snippsat.)

(Feb-10-2017, 09:46 AM)snippsat Wrote: I would say that you are using wrong tools,can look at my tutorials here part-1, part-2.
In part-2 i talk a little about concurrency.

lxml is one of the fastest parser in any language(has C library as core).
Can be used trough BeautifulSoup BeautifulSoup(url, 'lxml') or alone.
Use Requests,then you can remove decode stuff you get correct encoded page back.
Regex is a really bad tool for html,a funny answer.

Scrapy is fast,it has build concurrency with Twisted.

Thanks for replay! Looks like I don't need repeat the job - there are tools to do it.
lxml looks impressive, absolutely worth to check! I know the regex is a weak link in it, but I'm not using them to parse - I just extract a domain, for example, give me all in science.bbc.com - to be fair, I don't see an alternative to regex here!

***snippsat*** · (This post was last modified: Feb-10-2017, 10:44 AM by snippsat.)

(Feb-10-2017, 09:58 AM)lion137 Wrote: I just extract a domain, for example, give me all in science.bbc.com - to be fair, I don't see an alternative to regex here!

Regex is okay here,if that what slow you down have to be measured.
Remember to use re.complie() when using it recursively.

$ python -m timeit -s "import re" "re.match('hello', 'hello world')"
100000 loops, best of 3: 3.82 usec per loop

$ python -m timeit -s "import re; h=re.compile('hello')" "h.match('hello world')"
1000000 loops, best of 3: 1.26 usec per loop

There are libraries for this to,if it faster has to measured with tool like timeit.
Profile is a other tool to look at what takes time in you function.
Quick test of tldextract with Python 3.6.

>>> import tldextract

>>> url = 'https://python-forum.io/'
>>> ext = tldextract.extract(url)
>>> ext.registered_domain
'python-forum.io'

lion137 · Feb-10-2017, 11:53 AM

Thanks, I will definitely profile it, if concurrency won't do the job. I use jupyter notebook - there are awesome tools to do it (and more)!

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Problem with scrapping Website	giddyhead	1	1,620	Mar-08-2024, 08:20 AM Last Post: AhanaSharma
	python web scrapping	mg24	1	315	Mar-01-2024, 09:48 PM Last Post: snippsat
	How can I ignore empty fields when scrapping	never5000	0	1,388	Feb-11-2022, 09:19 AM Last Post: never5000
	Suggestion request for scrapping html table	Vkkindia	3	2,025	Dec-06-2021, 06:09 PM Last Post: Larz60+
	web scrapping through Python	Naheed	2	2,611	May-17-2021, 12:02 PM Last Post: Naheed
	Website scrapping and download	santoshrane	3	4,309	Apr-14-2021, 07:22 AM Last Post: kashcode
	Newbie help with lxml scrapping	chelsealoa	1	1,855	Jan-08-2021, 09:14 AM Last Post: Larz60+
	Scrapping Sport score	laplacea	1	2,248	Dec-13-2020, 04:09 PM Last Post: Larz60+
	How to export to csv the output of every iteration when scrapping with a loop	efthymios	2	2,276	Nov-30-2020, 07:46 PM Last Post: efthymios
	Web scrapping - Stopped working	peterjv26	2	3,070	Sep-23-2020, 08:30 AM Last Post: peterjv26

Web Scrapping Application

User Panel Messages

Announcements