Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
SQL FTS 5
#1
Hi,
After OCR, I am left with texts that contain data.
I analyse these texts with python and put the relevant data into records that are inserted into an SQL db. (classic style)
Only very recently I have discovered there is such a thing as FTS 5, which I implemented immediately.
It is indeed lightning fast.
I read that i can make my queries more sophisticated with FTS 5, but the rare and incomplete docs I find,
give me more questions than answers. Hence my question:
Where can I find a complete and definitive overview of FTS 5 syntax, compatible with python of course.
e.g. In one place they talk about 'phrases' and 'tokens', in other places i cannot find confirmation of this type of syntax.
e.g. Somewhere they talk about wildcarding, but they do not drive the nail home, causing more "how to " questions.
thx,
Paul
Gribouillis likes this post
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply
#2
Nice information. I didn't know about FTS5. This page contains the query syntax in BNF form (it is also the first result in my search engine).
Reply
#3
Yes, thanks.
I have seen that, but I was hoping for something like 'FTS5 for semi-dummies". Cool
Things like:
"FTS queries are made up of phrases. A phrase is an ordered list of one or more tokens. A string is transformed into a phrase by passing it to the FTS table tokenizer." ...
Leave me with more questions:
- What is a token,
- Why an "ordered list", what does that mean,
- What is a tokenizer ...

I'll try to make some sense of this, especially the wildcarding.
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply
#4
There is a dummy video on:
https://mycelial.com/blog/sqlite-full-text-search/

But no structured manual Angry
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply
#5
I think the difficult part is that of tokenizers, because apparently FTS5 allows using custom tokenizers.
Reply
#6
sqlite documentation has verbose writeup of the extension here: https://www.sqlite.org/fts5.html
I don't know if this will be of value or not, you decide.
Reply
#7
Hi,
I'm piecing together what I need to use this to the full, because it is exactly what I need for my data.
My experience so far:
1) Creating db, table, fields in table: no problem actually easier than the classic method.
2) Inserting data, no problem, just as slow as always (100.000 records / hour )
3) Search for words 'XXX' and 'YYY':
sql = f'SELECT data FROM myTable WHERE datafield MATCH "{XXX} {YYY}"' => OK, but:
- Are XXX, YYY what they call tokens ?
- How many can I have with this syntax ? 2: I tried already, 3 ....more ?
- I understand, syntaxed like this, that this means, with a blank in between, XXX AND YYY.
- It will find words like ZXXXZ and KYYYX
- What is a tokenizer for then ?
Just for starters ...
Paul Wink
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply
#8
(May-17-2023, 05:25 PM)DPaul Wrote: Some painstaking tests reveal that I must correct:
- It will find words like ZXXXZ and KYYYX
Correction:
It will NOT find words like ZXXXZ and KYYYX
Gets a bit confusing. Confused
So MATCH does not behave like python "IN <string>" statement.

"ZXXX*" and "KYY*" will be found. But not e.g. "*XXXZ".
More tests needed!
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply
#9
Hi,
Providing better documentation, or a better insight into everything below, I may have figured this out for my needs.
The difficulty is in the queries.
sql = f'SELECT filename FROM myTable WHERE datafield MATCH "{search1} {search2} {search3}"'
This works, even with 3 AND searchwords , although I find the wording in the SQLite FTS5 Extension confusing.
e.g. paragr. 3.3
"the first two queries in the following block will match any document that contains the token "one" immediately followed by the token "two"

What is "immediately followed" ? I have matches where search 1,2 and 3 are scattered in the text.

This will also work with wildcarding but only with the start of the word as prefix.
e.g. "Johns*" -> will find Johnson, but "*son" is not allowed with MATCH.

If you want to do wildcarding as suffix, you need to declare the table at the outset with the "trigram" tokenizer. Like so:

create virtual table if not exists myTable using fts5 (filename,data,register, tokenize='trigram')
Then you can search for LIKE "%son%" , but only one token allowed in the select statement, not 2 or 3. Edit: Unless you use the AND operator, but that is unclear to me.

I found that the MATCH syntax also works on tables initiated as trigram.

There maybe other goodies hidden in this FTS5 extension, but if I add more
possibilities, users may get very confused. :-)
Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply
#10
Searching for documentation about this I came across the duckdb database, which also implements a Full Text Search capability. I've never heard of this database. It may be an interesting alternative to sqlite.
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020