Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
advice needed
#1
Hi,
I could use some advice ! Prayer cards are an endless source of challenges.
In our centre, over the years, zillions of documents have been filed on a server,
according to a certain (dir/subdir/etc...) system, that made sense at the time.

Today new OCR techniques have lead to new insights and a different approach.
I have an SQL database with OCRred content
of a certain % of the documents on the server, and growing.

To make it simple :
the tables in the database have a field: - the document filename, but not the server-path to it.
(Of course I have a temp solution for current users, but that is not sustainable, given the volumes involved)

My plan:
I can write a program (style "for root, dirs, subdirs,files...") that inserts the path to every document into an indexed table.
That table is then used as a GPS to find the doc and show it to the user.
The question: is this a good plan, or does python offer other (better) possibilities.
I don't need a document management system: i don't care about fields like " author, size, issue date, number of pages...etc".
I just want to know where it is on the server .
Any insights you'd liketo share?
Thanks
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply
#2
You could perhaps look for file indexers in Python.
Reply
#3
(Oct-03-2023, 09:56 AM)Gribouillis Wrote: file indexers in Python.
Thanks, I had a look. Things like pyFileIndex , pyLucene and others.
What I found was that they do more or less what I plan to do,
sometimes with extra whistles ans bells, but there is no magic.
So I'm on the right track, but I can't create a directory list every time,
I need a permanent, indexed table.
I've looked at "fixed length" text files, doesn't really help.
The only thing stopping me from doing it in SQL is:
- traversing the file system with os.walk is lightning fast.
- But inserting, just 50.000 records into an indexed SQL table takes forever.
There would be multiple tables, each with a few million records at least.
Guess I'll do some more tests.
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply
#4
If you only want to locate the files in the file system, you could use tools such as the commands plocate and updatedb.plocate in linux. The updatedb command updates a database containing the information and the plocate command finds files in the database.

The drawback is that you'd need to call the external command plocate to query the database in a subprocess, unless you find a way to read the database directly from Python. I know that some Python modules where able to read mlocate databases in the past, but the more recent plocate databases may have a different structure. Also it may not be a better idea than using plocate directly.
Reply
#5
(Oct-03-2023, 02:31 PM)Gribouillis Wrote: unless you find a way to read the database directly from Python.
Good idea, worth trying.
I am currently investigating if pywin32 has a command will do the trick.
I have to resolve a few problems with the "import pywin32" first.
thx,
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply
#6
I have tried different things, but these "popen or plocate" equivalents, seem very
complicated and tricky. And how this is going to behave on different user machines remains to be seen.

In the mean time I picked up some tips and tricks on how to speed up SQLite insert speed using PRAGMA settings
and batch inserts. 53.000 indexed records does not take forever, but 10 minutes tops. = 300.000/hour.
I guess I'll live with that.
thx,
Paul
Edit: there is one strange parameter:
" 'PRAGMA synchronous = 0' "
If you switch it off, insert of 53.000 records is almost instantaneous, like a .txt file.
The caveat is, that any problem with the program, the os, the environment, whist running,
can cause database corruption. Not a good idea ... except...
If it takes only a few seconds for zillions of records, I can create a new database every time.
And if it gets corrupted, no problem, we'll do it again.
Time is money, even when dealing with prayer cards.
Paul
Gribouillis likes this post
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020