Mult-threading and locking file

mr_byte31 · (This post was last modified: Oct-15-2021, 06:54 PM by mr_byte31.)

I had a code which is running correctly but I feel it is not a professional code. I want your opinion to enhance it.

Code :
I create a lot of threads to web scrap some information that I am interested in it.
I have 5000 threads to be triggered but I run them 50 in queue.

from queue import *
q = Queue(maxsize=50)

Target : I collect all the information and save it in CSV file.

First project: I created dataframe using panda. A lot of threads were updating different entries of dataframe (but no thread modify an entry of other thread) and then I noticed that some enteries are not filled correctly. I learnt that dataframe are not secure with multhreading.

Second project: I made each thread to create CSV file (part of information) and after all threads finish working. I run code to collect all the CSV files in one file. it is running now perfectly.

Third project: I didn't implement this solution but I was thinking to keep the threading system and create one Sqllite DB and lock it for each thread.

I was wondering what would be the most professional way to do the threading and and storing in sqlite via multiple threads.

SamHobbs · Oct-15-2021, 04:59 PM

I think you just need to create a thread (one and only one) that does the writing. Access to the thread from other threads need to be synchronized. I am a beginner to Python; others can guide you better but probably you can use a lock to do the synchronization.

**Larz60+** · Oct-15-2021, 06:06 PM

I wrote a tutorial on automatically creating a database from csv files.
Note: Project uses sqlite database. Other DBMS can be used with a simple change to the model.

The project demonstrates dynamic model creation, and subsequent database load, achieving the following:

Given a path to a directory that contains all of the CSV files (one for each table) do:
- Determine columns and sizes required for each.
- Auto-create an SQLalchemy model class which represents the table that will be associated with the data from each CSV file.
- Using pandas, populate the database.

to show the simplicity of the code, here's the database load module:

from FlightPaths import FlightPaths
import pandas as pd
import DbModel
import csv


class LoadDatabase:
    def __init__(self):
        self.fpath = FlightPaths()

        self.Model = DbModel

    def dispatch(self):
        self.load_all()

    def load_all(self):
        datafiles = self.fpath.ourairport_datafiles

        filelist = [fn for fn in datafiles.iterdir() if fn.is_file() and fn.suffix == '.csv']
        for fn in filelist:
            tablename = fn.stem.replace('-', '_')
            df = pd.read_csv(fn)
            print(f"engine: {self.Model.engine}")
            df.to_sql(tablename, con=self.Model.engine, if_exists='replace')


def main():
    lDb = LoadDatabase()
    lDb.load_all()


if __name__ == '__main__':
    main()

If you are interested, you will find the tutorial here: https://python-forum.io/thread-33843.html

mr_byte31 · (This post was last modified: Oct-15-2021, 06:50 PM by mr_byte31.)

(Oct-15-2021, 06:06 PM)Larz60+ Wrote: I wrote a tutorial on automatically creating a database from csv files.
Note: Project uses sqlite database. Other DBMS can be used with a simple change to the model.

The project demonstrates dynamic model creation, and subsequent database load, achieving the following:

Given a path to a directory that contains all of the CSV files (one for each table) do:

Determine columns and sizes required for each.

Auto-create an SQLalchemy model class which represents the table that will be associated with the data from each CSV file.

Using pandas, populate the database.

to show the simplicity of the code, here's the database load module:
from FlightPaths import FlightPaths
import pandas as pd
import DbModel
import csv


class LoadDatabase:
    def __init__(self):
        self.fpath = FlightPaths()

        self.Model = DbModel

    def dispatch(self):
        self.load_all()

    def load_all(self):
        datafiles = self.fpath.ourairport_datafiles

        filelist = [fn for fn in datafiles.iterdir() if fn.is_file() and fn.suffix == '.csv']
        for fn in filelist:
            tablename = fn.stem.replace('-', '_')
            df = pd.read_csv(fn)
            print(f"engine: {self.Model.engine}")
            df.to_sql(tablename, con=self.Model.engine, if_exists='replace')


def main():
    lDb = LoadDatabase()
    lDb.load_all()


if __name__ == '__main__':
    main()
If you are interested, you will find the tutorial here: https://python-forum.io/thread-33843.html

Thanks a lot. I was doing this work manually. This helps.

What about the threading system?

**Larz60+** · (This post was last modified: Oct-16-2021, 01:54 AM by Larz60+.)

I'm not an expert on threading, have used mutiprocessing see: https://docs.python.org/3/library/multiprocessing.html.
Multiprocessing is similar to threading, except it side steps the GIL which can be a problem when threading.
Sorry I cannot give you in depth details as I haven't used either is quite some time.

I think locking the database can get you in trouble, however if you switch to Postgresql, you can use row level locking which only locks the row you are using, and that in most cases is quite safe.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Concurrent futures threading running at same speed as non-threading	billykid999	13	4,920	May-03-2023, 08:22 AM Last Post: billykid999
	Tutorials on sockets, threading and multi-threading?	muzikman	2	2,984	Oct-01-2021, 08:32 PM Last Post: muzikman
	File access / locking.	MuntyScruntfundle	1	2,700	Oct-16-2018, 02:41 PM Last Post: wavic

Mult-threading and locking file

User Panel Messages

Announcements