Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Web Scrapping Application
#1
Hello everybody! I have a web crawler project, this is the main part of the code:
# crawler 
import urllib.request
from parse_tree import *
import re
url = '''some url here'''
s = set()
List = []
def f_go(List, s, url, iter_cnt):
   try:

       if url in s:
           return
       s.add(url)
       with urllib.request.urlopen(url) as response:
           html = response.read()
           print("here", url)
       h = html.decode("utf-8")
       lst0 = prepare_expression(list(h))
       ntr = buildNaryParseTree(lst0)
       lst1 = []
       lst2 = nary_tree_tolist(ntr, lst1)
       List.append(lst2)
       f1 = re.finditer(str_regex, h)
       l1 = []
       for tok in f1:
           ind1 = tok.span()
           l1.append(h[ind1[0]:ind1[1]])
       for exp in l1:
           f_go(List, s, exp, iter_cnt + 1)
   except:
       return
I'm afraid, that this is going to be slow  (even if I will add concurrency), and I'm seriously thinking about rewriting it in Go. All what id does is opening url's recursively in the loop, parsing it (this will be done in database)  and save to data structure. Is there a chance to improve speed here?
Reply


Messages In This Thread
Web Scrapping Application - by lion137 - Feb-10-2017, 09:00 AM
RE: Web Scrapping Application - by snippsat - Feb-10-2017, 09:46 AM
RE: Web Scrapping Application - by lion137 - Feb-10-2017, 09:58 AM
RE: Web Scrapping Application - by snippsat - Feb-10-2017, 10:44 AM
RE: Web Scrapping Application - by lion137 - Feb-10-2017, 11:53 AM

Possibly Related Threads…
Thread Author Replies Views Last Post
  Problem with scrapping Website giddyhead 1 1,770 Mar-08-2024, 08:20 AM
Last Post: AhanaSharma
  python web scrapping mg24 1 496 Mar-01-2024, 09:48 PM
Last Post: snippsat
  How can I ignore empty fields when scrapping never5000 0 1,467 Feb-11-2022, 09:19 AM
Last Post: never5000
  Suggestion request for scrapping html table Vkkindia 3 2,145 Dec-06-2021, 06:09 PM
Last Post: Larz60+
  web scrapping through Python Naheed 2 2,710 May-17-2021, 12:02 PM
Last Post: Naheed
  Website scrapping and download santoshrane 3 4,517 Apr-14-2021, 07:22 AM
Last Post: kashcode
  Newbie help with lxml scrapping chelsealoa 1 1,939 Jan-08-2021, 09:14 AM
Last Post: Larz60+
  Scrapping Sport score laplacea 1 2,346 Dec-13-2020, 04:09 PM
Last Post: Larz60+
  How to export to csv the output of every iteration when scrapping with a loop efthymios 2 2,395 Nov-30-2020, 07:46 PM
Last Post: efthymios
  Web scrapping - Stopped working peterjv26 2 3,204 Sep-23-2020, 08:30 AM
Last Post: peterjv26

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020