Feb-10-2017, 09:00 AM
Hello everybody! I have a web crawler project, this is the main part of the code:
# crawler import urllib.request from parse_tree import * import re url = '''some url here''' s = set() List = [] def f_go(List, s, url, iter_cnt): try: if url in s: return s.add(url) with urllib.request.urlopen(url) as response: html = response.read() print("here", url) h = html.decode("utf-8") lst0 = prepare_expression(list(h)) ntr = buildNaryParseTree(lst0) lst1 = [] lst2 = nary_tree_tolist(ntr, lst1) List.append(lst2) f1 = re.finditer(str_regex, h) l1 = [] for tok in f1: ind1 = tok.span() l1.append(h[ind1[0]:ind1[1]]) for exp in l1: f_go(List, s, exp, iter_cnt + 1) except: returnI'm afraid, that this is going to be slow (even if I will add concurrency), and I'm seriously thinking about rewriting it in Go. All what id does is opening url's recursively in the loop, parsing it (this will be done in database) and save to data structure. Is there a chance to improve speed here?