Python Forum
Mult-threading and locking file
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Mult-threading and locking file
#1
I had a code which is running correctly but I feel it is not a professional code. I want your opinion to enhance it.

Code :
I create a lot of threads to web scrap some information that I am interested in it.
I have 5000 threads to be triggered but I run them 50 in queue.
from queue import *
q = Queue(maxsize=50)
Target : I collect all the information and save it in CSV file.

First project: I created dataframe using panda. A lot of threads were updating different entries of dataframe (but no thread modify an entry of other thread) and then I noticed that some enteries are not filled correctly. I learnt that dataframe are not secure with multhreading.

Second project: I made each thread to create CSV file (part of information) and after all threads finish working. I run code to collect all the CSV files in one file. it is running now perfectly.

Third project: I didn't implement this solution but I was thinking to keep the threading system and create one Sqllite DB and lock it for each thread.

I was wondering what would be the most professional way to do the threading and and storing in sqlite via multiple threads.
Reply
#2
I think you just need to create a thread (one and only one) that does the writing. Access to the thread from other threads need to be synchronized. I am a beginner to Python; others can guide you better but probably you can use a lock to do the synchronization.
Reply
#3
I wrote a tutorial on automatically creating a database from csv files.
Note: Project uses sqlite database. Other DBMS can be used with a simple change to the model.

The project demonstrates dynamic model creation, and subsequent database load, achieving the following:

  1. Given a path to a directory that contains all of the CSV files (one for each table) do:
    • Determine columns and sizes required for each.
    • Auto-create an SQLalchemy model class which represents the table that will be associated with the data from each CSV file.
    • Using pandas, populate the database.

to show the simplicity of the code, here's the database load module:
from FlightPaths import FlightPaths
import pandas as pd
import DbModel
import csv


class LoadDatabase:
    def __init__(self):
        self.fpath = FlightPaths()

        self.Model = DbModel

    def dispatch(self):
        self.load_all()

    def load_all(self):
        datafiles = self.fpath.ourairport_datafiles

        filelist = [fn for fn in datafiles.iterdir() if fn.is_file() and fn.suffix == '.csv']
        for fn in filelist:
            tablename = fn.stem.replace('-', '_')
            df = pd.read_csv(fn)
            print(f"engine: {self.Model.engine}")
            df.to_sql(tablename, con=self.Model.engine, if_exists='replace')


def main():
    lDb = LoadDatabase()
    lDb.load_all()


if __name__ == '__main__':
    main()
If you are interested, you will find the tutorial here: https://python-forum.io/thread-33843.html
Reply
#4
(Oct-15-2021, 06:06 PM)Larz60+ Wrote: I wrote a tutorial on automatically creating a database from csv files.
Note: Project uses sqlite database. Other DBMS can be used with a simple change to the model.

The project demonstrates dynamic model creation, and subsequent database load, achieving the following:

  1. Given a path to a directory that contains all of the CSV files (one for each table) do:
    • Determine columns and sizes required for each.
    • Auto-create an SQLalchemy model class which represents the table that will be associated with the data from each CSV file.
    • Using pandas, populate the database.

to show the simplicity of the code, here's the database load module:
from FlightPaths import FlightPaths
import pandas as pd
import DbModel
import csv


class LoadDatabase:
    def __init__(self):
        self.fpath = FlightPaths()

        self.Model = DbModel

    def dispatch(self):
        self.load_all()

    def load_all(self):
        datafiles = self.fpath.ourairport_datafiles

        filelist = [fn for fn in datafiles.iterdir() if fn.is_file() and fn.suffix == '.csv']
        for fn in filelist:
            tablename = fn.stem.replace('-', '_')
            df = pd.read_csv(fn)
            print(f"engine: {self.Model.engine}")
            df.to_sql(tablename, con=self.Model.engine, if_exists='replace')


def main():
    lDb = LoadDatabase()
    lDb.load_all()


if __name__ == '__main__':
    main()
If you are interested, you will find the tutorial here: https://python-forum.io/thread-33843.html

Thanks a lot. I was doing this work manually. This helps.

What about the threading system?
Reply
#5
I'm not an expert on threading, have used mutiprocessing see: https://docs.python.org/3/library/multiprocessing.html.
Multiprocessing is similar to threading, except it side steps the GIL which can be a problem when threading.
Sorry I cannot give you in depth details as I haven't used either is quite some time.

I think locking the database can get you in trouble, however if you switch to Postgresql, you can use row level locking which only locks the row you are using, and that in most cases is quite safe.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Concurrent futures threading running at same speed as non-threading billykid999 13 1,714 May-03-2023, 08:22 AM
Last Post: billykid999
  Tutorials on sockets, threading and multi-threading? muzikman 2 2,076 Oct-01-2021, 08:32 PM
Last Post: muzikman
  File access / locking. MuntyScruntfundle 1 2,117 Oct-16-2018, 02:41 PM
Last Post: wavic

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020