Python Forum
inserting data into sqlite3 database using concurrent futures
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
inserting data into sqlite3 database using concurrent futures
#1
Just finished a module that scrapes data from 460,000 html files and inserts that data into an Sqlite3 database.
Now this is a rather nerdy question as the module does this all in: '75073918 function calls in 104.138 seconds' from profiler which is pretty fast, but I want to take it to the Bonneville Salt Flats, as the profiler output shown below in python tags (for scrolling capability) shows
66.227 seconds of that time is spent executing sqlite3 cursors (expected behavior)
464100 items to go
464000 items to go
463900 items to go
...
         75073918 function calls in 104.138 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.608    0.608  104.138  104.138 <string>:1(<module>)
        3    0.000    0.000    0.003    0.001 BusinessPaths.py:6(__init__)
        1    0.000    0.000  103.530  103.530 CreateCompanyDatabase.py:13(__init__)
        1   13.409   13.409   91.754   91.754 CreateCompanyDatabase.py:29(process_corporations)
        1    0.000    0.000    0.000    0.000 CreateDict.py:6(__init__)
        1    0.000    0.000    0.000    0.000 CreateTables.py:14(db_connect)
        1    0.000    0.000    1.390    1.390 CreateTables.py:21(create_tables)
        5    0.000    0.000    1.389    0.278 CreateTables.py:54(create_table)
  4947381    7.495    0.000   72.333    0.000 CreateTables.py:68(insert_data)
        1    0.000    0.000    0.001    0.001 CreateTables.py:7(__init__)
        1    0.000    0.000    0.490    0.490 CreateTables.py:87(db_close)
        5    0.000    0.000    0.000    0.000 CreateTables.py:94(db_commit)
        1    0.000    0.000    0.001    0.001 PrettifyPage.py:8(__init__)
        1    0.002    0.002    9.892    9.892 __init__.py:274(load)
        1    0.000    0.000    9.448    9.448 __init__.py:299(loads)
        1    0.000    0.000    0.000    0.000 _bootlocale.py:33(getpreferredencoding)
        1    0.000    0.000    0.000    0.000 codecs.py:260(__init__)
        1    0.000    0.000    0.000    0.000 codecs.py:309(__init__)
        1    0.000    0.000    0.148    0.148 codecs.py:319(decode)
        1    0.000    0.000    9.448    9.448 decoder.py:332(decode)
        1    9.448    9.448    9.448    9.448 decoder.py:343(raw_decode)
       60    0.000    0.000    0.000    0.000 pathlib.py:1001(_init)
        1    0.000    0.000    0.000    0.000 pathlib.py:1028(_opener)
       45    0.000    0.000    0.001    0.000 pathlib.py:1146(stat)
        1    0.000    0.000    0.000    0.000 pathlib.py:1167(open)
       45    0.000    0.000    0.003    0.000 pathlib.py:1234(mkdir)
       45    0.000    0.000    0.001    0.000 pathlib.py:1336(is_dir)
       60    0.000    0.000    0.000    0.000 pathlib.py:281(splitroot)
       60    0.000    0.000    0.000    0.000 pathlib.py:53(parse_parts)
       60    0.000    0.000    0.000    0.000 pathlib.py:624(_parse_args)
        3    0.000    0.000    0.000    0.000 pathlib.py:644(_from_parts)
       57    0.000    0.000    0.000    0.000 pathlib.py:657(_from_parsed_parts)
       47    0.000    0.000    0.000    0.000 pathlib.py:667(_format_parsed_parts)
       57    0.000    0.000    0.000    0.000 pathlib.py:678(_make_child)
       93    0.000    0.000    0.000    0.000 pathlib.py:684(__str__)
       93    0.000    0.000    0.000    0.000 pathlib.py:694(__fspath__)
       57    0.000    0.000    0.000    0.000 pathlib.py:891(__truediv__)
       57    0.000    0.000    0.000    0.000 pathlib.py:91(join_parsed_parts)
        3    0.000    0.000    0.000    0.000 pathlib.py:991(__new__)
        5    0.000    0.000    0.000    0.000 posixpath.py:154(dirname)
        5    0.000    0.000    0.000    0.000 posixpath.py:338(normpath)
        5    0.000    0.000    0.000    0.000 posixpath.py:376(abspath)
       10    0.000    0.000    0.000    0.000 posixpath.py:41(_get_sep)
        5    0.000    0.000    0.000    0.000 posixpath.py:64(isabs)
       60    0.000    0.000    0.000    0.000 {built-in method __new__ of type object at 0x8a8ee0}
        1    0.148    0.148    0.148    0.148 {built-in method _codecs.utf_8_decode}
        1    0.000    0.000    0.000    0.000 {built-in method _locale.nl_langinfo}
        1    0.000    0.000    0.000    0.000 {built-in method _sqlite3.connect}
       45    0.000    0.000    0.000    0.000 {built-in method _stat.S_ISDIR}
        1    0.000    0.000  104.138  104.138 {built-in method builtins.exec}
      136    0.000    0.000    0.000    0.000 {built-in method builtins.isinstance}
       12    0.000    0.000    0.000    0.000 {built-in method builtins.len}
     4641    0.082    0.000    0.082    0.000 {built-in method builtins.print}
        1    0.000    0.000    0.000    0.000 {built-in method io.open}
        5    0.000    0.000    0.000    0.000 {built-in method posix.chdir}
       80    0.000    0.000    0.000    0.000 {built-in method posix.fspath}
       45    0.001    0.000    0.001    0.000 {built-in method posix.mkdir}
        1    0.000    0.000    0.000    0.000 {built-in method posix.open}
       45    0.001    0.000    0.001    0.000 {built-in method posix.stat}
       60    0.000    0.000    0.000    0.000 {built-in method sys.intern}
 32354491    1.933    0.000    1.933    0.000 {method 'append' of 'list' objects}
        1    0.001    0.001    0.001    0.001 {method 'close' of 'sqlite3.Connection' objects}
        6    0.490    0.082    0.490    0.082 {method 'commit' of 'sqlite3.Connection' objects}
        1    0.000    0.000    0.000    0.000 {method 'cursor' of 'sqlite3.Connection' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        2    0.000    0.000    0.000    0.000 {method 'end' of 're.Match' objects}
  4947391   66.227    0.000   66.227    0.000 {method 'execute' of 'sqlite3.Cursor' objects}
   464124    0.055    0.000    0.055    0.000 {method 'items' of 'dict' objects}
       57    0.000    0.000    0.000    0.000 {method 'join' of 'str' objects}
        1    0.000    0.000    0.000    0.000 {method 'keys' of 'dict' objects}
        2    0.000    0.000    0.000    0.000 {method 'match' of 're.Pattern' objects}
        1    0.294    0.294    0.441    0.441 {method 'read' of '_io.TextIOWrapper' objects}
 32354326    3.942    0.000    3.942    0.000 {method 'replace' of 'str' objects}
       60    0.000    0.000    0.000    0.000 {method 'reverse' of 'list' objects}
        5    0.000    0.000    0.000    0.000 {method 'rfind' of 'str' objects}
        5    0.000    0.000    0.000    0.000 {method 'rstrip' of 'str' objects}
        8    0.000    0.000    0.000    0.000 {method 'split' of 'str' objects}
       16    0.000    0.000    0.000    0.000 {method 'startswith' of 'str' objects}
looking at the profiler if I could reduce this bottleneck, I could squeeze more time out of the process.

So, I thought pre-compiled cursors, but I guess that's not an option (as in Oracle), so a hair-brained idea, how about concurrent futures.
So before I go off on a tangent spending perhaps a day or two attempting to do this without clobbering my database, I was wondering if anyone has successfully attempted same.

Is this possible??
Reply
#2
sqlite is actually threadsafe, as long as it was compiled with the default thread safety level. So you could load up what you want to insert in a queue, and have a couple worker threads/processes insert from the queue.

*looks up what concurrent futures are* Oh, that's what you're already talking about doing. Ok nevermind lol
Reply
#3
So, I'm going to do it. This is for my local April Maker Presentation. I'll post results along with the Jupiter notebook docs after the meeting. I'm curious as to how many seconds I can slice off from the 66 that It now takes.
Reply
#4
I wrote a small project which scrapes a bunch of web pages and gets some data. The first try was straight forward and took about 30 sec. Then I used some multiprocessing and reduced the time under 6 sec.
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  unable to view sqlite3 database in VSC Editor KatMac 2 5,061 May-27-2021, 05:13 AM
Last Post: buran

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020