Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Web Scrapping Application
#1
Hello everybody! I have a web crawler project, this is the main part of the code:
# crawler 
import urllib.request
from parse_tree import *
import re
url = '''some url here'''
s = set()
List = []
def f_go(List, s, url, iter_cnt):
   try:

       if url in s:
           return
       s.add(url)
       with urllib.request.urlopen(url) as response:
           html = response.read()
           print("here", url)
       h = html.decode("utf-8")
       lst0 = prepare_expression(list(h))
       ntr = buildNaryParseTree(lst0)
       lst1 = []
       lst2 = nary_tree_tolist(ntr, lst1)
       List.append(lst2)
       f1 = re.finditer(str_regex, h)
       l1 = []
       for tok in f1:
           ind1 = tok.span()
           l1.append(h[ind1[0]:ind1[1]])
       for exp in l1:
           f_go(List, s, exp, iter_cnt + 1)
   except:
       return
I'm afraid, that this is going to be slow  (even if I will add concurrency), and I'm seriously thinking about rewriting it in Go. All what id does is opening url's recursively in the loop, parsing it (this will be done in database)  and save to data structure. Is there a chance to improve speed here?
Reply
#2
I would say that you are using wrong tools,can look at my tutorials here part-1, part-2.
In part-2 i talk a little about concurrency.

lxml is one of the fastest parser in any language(has C library as core).
Can be used trough BeautifulSoup BeautifulSoup(url, 'lxml') or alone.
Use Requests,then you can remove decode stuff you get correct encoded page back.
Regex is a really bad tool for html,a funny answer.

Scrapy is fast,it has build concurrency with Twisted.
Reply
#3
(Feb-10-2017, 09:46 AM)snippsat Wrote: I would say that you are using wrong tools,can look at my tutorials here part-1, part-2.
In part-2 i talk a little about concurrency.

lxml is one of the fastest parser in any language(has C library as core).
Can be used trough BeautifulSoup BeautifulSoup(url, 'lxml') or alone.
Use Requests,then you can remove decode stuff you get correct encoded page back.
Regex is a really bad tool for html,a funny answer.

Scrapy is fast,it has build concurrency with Twisted.
Thanks for replay! Looks like I don't need repeat the job - there are tools to do it.
lxml looks impressive, absolutely worth to check! I know the regex is a weak link in it, but I'm not using them to parse - I just extract a domain, for example, give me all in science.bbc.com - to be fair, I don't see an alternative to regex here!
Reply
#4
(Feb-10-2017, 09:58 AM)lion137 Wrote: I just extract a domain, for example, give me all in science.bbc.com - to be fair, I don't see an alternative to regex here!
Regex is okay here,if that what slow you down have to be measured. 
Remember to use re.complie() when using it recursively.
$ python -m timeit -s "import re" "re.match('hello', 'hello world')"
100000 loops, best of 3: 3.82 usec per loop

$ python -m timeit -s "import re; h=re.compile('hello')" "h.match('hello world')"
1000000 loops, best of 3: 1.26 usec per loop
There are libraries for this to,if it faster has to  measured with tool like timeit.
Profile is a other tool to look at what takes time in you function.
Quick test of tldextract with Python 3.6.
>>> import tldextract

>>> url = 'https://python-forum.io/'
>>> ext = tldextract.extract(url)
>>> ext.registered_domain
'python-forum.io'
Reply
#5
Thanks, I will definitely profile it, if concurrency won't do the job. I use jupyter notebook - there are awesome tools to do it (and more)!
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Problem with scrapping Website giddyhead 1 1,533 Mar-08-2024, 08:20 AM
Last Post: AhanaSharma
  python web scrapping mg24 1 226 Mar-01-2024, 09:48 PM
Last Post: snippsat
  How can I ignore empty fields when scrapping never5000 0 1,338 Feb-11-2022, 09:19 AM
Last Post: never5000
  Suggestion request for scrapping html table Vkkindia 3 1,964 Dec-06-2021, 06:09 PM
Last Post: Larz60+
  web scrapping through Python Naheed 2 2,559 May-17-2021, 12:02 PM
Last Post: Naheed
  Website scrapping and download santoshrane 3 4,234 Apr-14-2021, 07:22 AM
Last Post: kashcode
  Newbie help with lxml scrapping chelsealoa 1 1,813 Jan-08-2021, 09:14 AM
Last Post: Larz60+
  Scrapping Sport score laplacea 1 2,208 Dec-13-2020, 04:09 PM
Last Post: Larz60+
  How to export to csv the output of every iteration when scrapping with a loop efthymios 2 2,227 Nov-30-2020, 07:46 PM
Last Post: efthymios
  Web scrapping - Stopped working peterjv26 2 2,979 Sep-23-2020, 08:30 AM
Last Post: peterjv26

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020