Just finished a module that scrapes data from 460,000 html files and inserts that data into an Sqlite3 database.
Now this is a rather nerdy question as the module does this all in: '75073918 function calls in 104.138 seconds' from profiler which is pretty fast, but I want to take it to the Bonneville Salt Flats, as the profiler output shown below in python tags (for scrolling capability) shows
66.227 seconds of that time is spent executing sqlite3 cursors (expected behavior)
So, I thought pre-compiled cursors, but I guess that's not an option (as in Oracle), so a hair-brained idea, how about concurrent futures.
So before I go off on a tangent spending perhaps a day or two attempting to do this without clobbering my database, I was wondering if anyone has successfully attempted same.
Is this possible??
Now this is a rather nerdy question as the module does this all in: '75073918 function calls in 104.138 seconds' from profiler which is pretty fast, but I want to take it to the Bonneville Salt Flats, as the profiler output shown below in python tags (for scrolling capability) shows
66.227 seconds of that time is spent executing sqlite3 cursors (expected behavior)
464100 items to go 464000 items to go 463900 items to go ... 75073918 function calls in 104.138 seconds Ordered by: standard name ncalls tottime percall cumtime percall filename:lineno(function) 1 0.608 0.608 104.138 104.138 <string>:1(<module>) 3 0.000 0.000 0.003 0.001 BusinessPaths.py:6(__init__) 1 0.000 0.000 103.530 103.530 CreateCompanyDatabase.py:13(__init__) 1 13.409 13.409 91.754 91.754 CreateCompanyDatabase.py:29(process_corporations) 1 0.000 0.000 0.000 0.000 CreateDict.py:6(__init__) 1 0.000 0.000 0.000 0.000 CreateTables.py:14(db_connect) 1 0.000 0.000 1.390 1.390 CreateTables.py:21(create_tables) 5 0.000 0.000 1.389 0.278 CreateTables.py:54(create_table) 4947381 7.495 0.000 72.333 0.000 CreateTables.py:68(insert_data) 1 0.000 0.000 0.001 0.001 CreateTables.py:7(__init__) 1 0.000 0.000 0.490 0.490 CreateTables.py:87(db_close) 5 0.000 0.000 0.000 0.000 CreateTables.py:94(db_commit) 1 0.000 0.000 0.001 0.001 PrettifyPage.py:8(__init__) 1 0.002 0.002 9.892 9.892 __init__.py:274(load) 1 0.000 0.000 9.448 9.448 __init__.py:299(loads) 1 0.000 0.000 0.000 0.000 _bootlocale.py:33(getpreferredencoding) 1 0.000 0.000 0.000 0.000 codecs.py:260(__init__) 1 0.000 0.000 0.000 0.000 codecs.py:309(__init__) 1 0.000 0.000 0.148 0.148 codecs.py:319(decode) 1 0.000 0.000 9.448 9.448 decoder.py:332(decode) 1 9.448 9.448 9.448 9.448 decoder.py:343(raw_decode) 60 0.000 0.000 0.000 0.000 pathlib.py:1001(_init) 1 0.000 0.000 0.000 0.000 pathlib.py:1028(_opener) 45 0.000 0.000 0.001 0.000 pathlib.py:1146(stat) 1 0.000 0.000 0.000 0.000 pathlib.py:1167(open) 45 0.000 0.000 0.003 0.000 pathlib.py:1234(mkdir) 45 0.000 0.000 0.001 0.000 pathlib.py:1336(is_dir) 60 0.000 0.000 0.000 0.000 pathlib.py:281(splitroot) 60 0.000 0.000 0.000 0.000 pathlib.py:53(parse_parts) 60 0.000 0.000 0.000 0.000 pathlib.py:624(_parse_args) 3 0.000 0.000 0.000 0.000 pathlib.py:644(_from_parts) 57 0.000 0.000 0.000 0.000 pathlib.py:657(_from_parsed_parts) 47 0.000 0.000 0.000 0.000 pathlib.py:667(_format_parsed_parts) 57 0.000 0.000 0.000 0.000 pathlib.py:678(_make_child) 93 0.000 0.000 0.000 0.000 pathlib.py:684(__str__) 93 0.000 0.000 0.000 0.000 pathlib.py:694(__fspath__) 57 0.000 0.000 0.000 0.000 pathlib.py:891(__truediv__) 57 0.000 0.000 0.000 0.000 pathlib.py:91(join_parsed_parts) 3 0.000 0.000 0.000 0.000 pathlib.py:991(__new__) 5 0.000 0.000 0.000 0.000 posixpath.py:154(dirname) 5 0.000 0.000 0.000 0.000 posixpath.py:338(normpath) 5 0.000 0.000 0.000 0.000 posixpath.py:376(abspath) 10 0.000 0.000 0.000 0.000 posixpath.py:41(_get_sep) 5 0.000 0.000 0.000 0.000 posixpath.py:64(isabs) 60 0.000 0.000 0.000 0.000 {built-in method __new__ of type object at 0x8a8ee0} 1 0.148 0.148 0.148 0.148 {built-in method _codecs.utf_8_decode} 1 0.000 0.000 0.000 0.000 {built-in method _locale.nl_langinfo} 1 0.000 0.000 0.000 0.000 {built-in method _sqlite3.connect} 45 0.000 0.000 0.000 0.000 {built-in method _stat.S_ISDIR} 1 0.000 0.000 104.138 104.138 {built-in method builtins.exec} 136 0.000 0.000 0.000 0.000 {built-in method builtins.isinstance} 12 0.000 0.000 0.000 0.000 {built-in method builtins.len} 4641 0.082 0.000 0.082 0.000 {built-in method builtins.print} 1 0.000 0.000 0.000 0.000 {built-in method io.open} 5 0.000 0.000 0.000 0.000 {built-in method posix.chdir} 80 0.000 0.000 0.000 0.000 {built-in method posix.fspath} 45 0.001 0.000 0.001 0.000 {built-in method posix.mkdir} 1 0.000 0.000 0.000 0.000 {built-in method posix.open} 45 0.001 0.000 0.001 0.000 {built-in method posix.stat} 60 0.000 0.000 0.000 0.000 {built-in method sys.intern} 32354491 1.933 0.000 1.933 0.000 {method 'append' of 'list' objects} 1 0.001 0.001 0.001 0.001 {method 'close' of 'sqlite3.Connection' objects} 6 0.490 0.082 0.490 0.082 {method 'commit' of 'sqlite3.Connection' objects} 1 0.000 0.000 0.000 0.000 {method 'cursor' of 'sqlite3.Connection' objects} 1 0.000 0.000 0.000 0.000 {method 'disable' of '_lsprof.Profiler' objects} 2 0.000 0.000 0.000 0.000 {method 'end' of 're.Match' objects} 4947391 66.227 0.000 66.227 0.000 {method 'execute' of 'sqlite3.Cursor' objects} 464124 0.055 0.000 0.055 0.000 {method 'items' of 'dict' objects} 57 0.000 0.000 0.000 0.000 {method 'join' of 'str' objects} 1 0.000 0.000 0.000 0.000 {method 'keys' of 'dict' objects} 2 0.000 0.000 0.000 0.000 {method 'match' of 're.Pattern' objects} 1 0.294 0.294 0.441 0.441 {method 'read' of '_io.TextIOWrapper' objects} 32354326 3.942 0.000 3.942 0.000 {method 'replace' of 'str' objects} 60 0.000 0.000 0.000 0.000 {method 'reverse' of 'list' objects} 5 0.000 0.000 0.000 0.000 {method 'rfind' of 'str' objects} 5 0.000 0.000 0.000 0.000 {method 'rstrip' of 'str' objects} 8 0.000 0.000 0.000 0.000 {method 'split' of 'str' objects} 16 0.000 0.000 0.000 0.000 {method 'startswith' of 'str' objects}looking at the profiler if I could reduce this bottleneck, I could squeeze more time out of the process.
So, I thought pre-compiled cursors, but I guess that's not an option (as in Oracle), so a hair-brained idea, how about concurrent futures.
So before I go off on a tangent spending perhaps a day or two attempting to do this without clobbering my database, I was wondering if anyone has successfully attempted same.
Is this possible??