fuzzywuzzy search string in text file

fuzzywuzzy search string in text file - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: fuzzywuzzy search string in text file (/thread-34460.html)

fuzzywuzzy search string in text file - marfer - Aug-02-2021

I have several text files in a folder and I would like to find a file by searching for a specific string. I started by using regular expressions:

import os, re
for file in os.listdir('C:/Users/mydirectory'):
    with open('C:/Users/mydirectory/'+file) as f:
        for num,line in enumerate(f): 
            if re.findall(r'john', line):
                print(file)
    f.close()

This code works, however in some of my text files there are typos, for example, instead of having "john" I have "johnn", so I thought that fuzzywuzzy should be a better approach.

#pip install fuzzywuzzy
from fuzzywuzzy import fuzz, process
import os
for file in os.listdir('C:/Users/mydirectory'):
    with open('C:/Users/mydirectory/'+file) as f:
        for num,line in enumerate(f):
            if process.extract("john", line):
                print(file)
    f.close()

This code only prints me the files in the folder, and not even by solution order. Could anyone explain me what is the problem in my code?

RE: fuzzywuzzy search string in text file - deanhystad - Aug-02-2021

extract is going to return a list of matches ordered by score. If line contains only 1 string, process will return that string along with the score. You can see this with a small mod to your script.

from fuzzywuzzy import fuzz, process
import os
for file in os.listdir('C:/Users/mydirectory'):
    with open('C:/Users/mydirectory/'+file) as f:
        for num,line in enumerate(f):
            print(process.extract("john", line))
    f.close()

I think ratio is a better choice for what you are doing. Or if you want the best match, use process with a list of all the files in the folder. And then use the weight to determine if the best match is good enough.

RE: fuzzywuzzy search string in text file - marfer - Aug-02-2021

deanhystad, thank you for your comment, but I still was not able to understand how to obtain the best match. I also tried:

process.extractOne("john", line, scorer=fuzz.token_set_ratio)

but python only prints me the list of files in the folder.

RE: fuzzywuzzy search string in text file - deanhystad - Aug-02-2021

I was confused myself. I thought you were looking for matching file names. Not files that contain matching text. That is a much tougher problem.

This is not going to give you good results:

process.extractOne("john", line, scorer=fuzz.token_set_ratio)

line is supposed to be a list of choices. for extractOne('john', ['John', 'Joan', 'Jean', 'Jon']), fuzzywuzzy would calculate the score for each word and return a tuple returning the best match and score. To do this for an entire file you'll need to split the file into a list of words. If the files aren't very big you could do something like this:

best_match = (None, 0, None)  # Remember the best match.
for file in os.listdir('C:/Users/mydirectory'):
    with open('C:/Users/mydirectory/'+file) as f:
        words = set(f.read().split())
        match = process.extractOne("john", words, scorer=fuzz.token_set_ratio):
        if match[1] > best_match[1]:
            best_match = (match[0], match[1], file)
print(best_match)

This would find the best matching file. If you can have multiple matching files you can forget about "best_match" and just report files with a match score better than some threshold value.

If read().split() is not doing a good enough job tokenizing the file, you could investigate using the natural language toolkit (NLTK).

I used "set()" so fuzzywuzzy doesn't have to check duplicate words. This is going to be slow enough already.

When using "with open" the file automatically closes when you leave the context. You don't have to call f.close().

RE: fuzzywuzzy search string in text file - marfer - Aug-02-2021

I was doing

for file in os.listdir('C:/Users/mydirectory'):
    with open('C:/Users/mydirectory/'+file) as f:
        for num,line in enumerate(f):
            if process.extractOne("john", line, scorer=fuzz.token_set_ratio):
                print(file)
    f.close()

RE: fuzzywuzzy search string in text file - deanhystad - Aug-02-2021

Sorry, I was editing my response. See above.

You cannot use the process.extractOne() return value as a True/False. extractOne returns a tuple (match, score) and the only tuple that evaluates to False is an emtpy tuple (,). Even if every choice results in a matching score of 0, extractOne will return a tuple containing the best match and the score.

Proof that only an empty tuple is False.

def evaluate(thing):
    if thing:
        print(f'{thing} is True')
    else:
        print(f'{thing} is False')

evaluate((False, False))
evaluate((False,))
evaluate(tuple())

Output:(False, False) is True
(False,) is True
() is False

RE: fuzzywuzzy search string in text file - marfer - Aug-02-2021

Thank you, deanhystad for your code and explaining me why my code was not working. Your code works well and from your code I also understood how to report the files with a match score better than some threshold value.
I have just one last question. What about if I want to search for more than one string, to improve my matching? I tried

keywords=["john", "mary"] #I know that there is one text file talking about john and mary
best_match = (None, 0, None)  # Remember the best match.
for file in os.listdir('C:/Users/mydirectory'):
    with open('C:/Users/mydirectory/'+file) as f:
        words = set(f.read().split())
        match = process.extractOne(keywords, words, scorer=fuzz.token_set_ratio):
        if match[1] > best_match[1]:
            best_match = (match[0], match[1], file)
print(best_match)

but I obtained the error "expected string or bytes-like object".

RE: fuzzywuzzy search string in text file - deanhystad - Aug-02-2021

Not a lot of documentation online but you can go to the project page and look at the source.

Quote:def extractOne(query, choices, processor=default_processor, scorer=default_scorer, score_cutoff=0):
"""Find the single best match above a score in a list of choices.
This is a convenience method which returns the single best choice.
See extract() for the full arguments list.
Args:
query: A string to match against
choices: A list or dictionary of choices, suitable for use with
extract().
processor: Optional function for transforming choices before matching.
See extract().
scorer: Scoring function for extract().
score_cutoff: Optional argument for score threshold. If the best
match is found, but it is not greater than this number, then
return None anyway ("not a good enough match"). Defaults to 0.
Returns:
A tuple containing a single match and its score, if a match
was found that was above score_cutoff. Otherwise, returns None.
"""
best_list = extractWithoutOrder(query, choices, processor, scorer, score_cutoff)
try:
return max(best_list, key=lambda i: i[1])
except ValueError:
return None

According to this, "query" (the word you are looking for) is a string. If you want to look for one of many words I think you need to repeat the extractOne() call once for each query word. You'll have to come up with a "matching calculus" to decide which is the best fit. I'd start with computing the average match score of the query words.

RE: fuzzywuzzy search string in text file - marfer - Aug-02-2021

And if I search "john smith", he only searches for "john". Thank you for your comments and support. I think I will be able to obtain what I want using regular expressions and your code for fuzzywuzzy. Thank you!

RE: fuzzywuzzy search string in text file - deanhystad - Aug-03-2021

You cannot find "john smith" because that is two words and the "tokenizer" code splits up the text file into individual words. I would try the various ratio functions and see how well "john smith" matches the entire text file content treated as one long string. Take a look at this datacamp article:

https://www.datacamp.com/community/tutorials/fuzzy-string-python

If your files are really short I think token_set_ratio looks promising.