fuzzywuzzy search string in text file

marfer · Aug-02-2021, 01:50 PM

I have several text files in a folder and I would like to find a file by searching for a specific string. I started by using regular expressions:

import os, re
for file in os.listdir('C:/Users/mydirectory'):
    with open('C:/Users/mydirectory/'+file) as f:
        for num,line in enumerate(f): 
            if re.findall(r'john', line):
                print(file)
    f.close()

This code works, however in some of my text files there are typos, for example, instead of having "john" I have "johnn", so I thought that fuzzywuzzy should be a better approach.

#pip install fuzzywuzzy
from fuzzywuzzy import fuzz, process
import os
for file in os.listdir('C:/Users/mydirectory'):
    with open('C:/Users/mydirectory/'+file) as f:
        for num,line in enumerate(f):
            if process.extract("john", line):
                print(file)
    f.close()

This code only prints me the files in the folder, and not even by solution order. Could anyone explain me what is the problem in my code?

**deanhystad** · Aug-02-2021, 02:35 PM

extract is going to return a list of matches ordered by score. If line contains only 1 string, process will return that string along with the score. You can see this with a small mod to your script.

from fuzzywuzzy import fuzz, process
import os
for file in os.listdir('C:/Users/mydirectory'):
    with open('C:/Users/mydirectory/'+file) as f:
        for num,line in enumerate(f):
            print(process.extract("john", line))
    f.close()

I think ratio is a better choice for what you are doing. Or if you want the best match, use process with a list of all the files in the folder. And then use the weight to determine if the best match is good enough.

marfer · Aug-02-2021, 04:04 PM

deanhystad, thank you for your comment, but I still was not able to understand how to obtain the best match. I also tried:

process.extractOne("john", line, scorer=fuzz.token_set_ratio)

but python only prints me the list of files in the folder.

**deanhystad** · (This post was last modified: Aug-02-2021, 05:56 PM by deanhystad.)

I was confused myself. I thought you were looking for matching file names. Not files that contain matching text. That is a much tougher problem.

This is not going to give you good results:

process.extractOne("john", line, scorer=fuzz.token_set_ratio)

line is supposed to be a list of choices. for extractOne('john', ['John', 'Joan', 'Jean', 'Jon']), fuzzywuzzy would calculate the score for each word and return a tuple returning the best match and score. To do this for an entire file you'll need to split the file into a list of words. If the files aren't very big you could do something like this:

best_match = (None, 0, None)  # Remember the best match.
for file in os.listdir('C:/Users/mydirectory'):
    with open('C:/Users/mydirectory/'+file) as f:
        words = set(f.read().split())
        match = process.extractOne("john", words, scorer=fuzz.token_set_ratio):
        if match[1] > best_match[1]:
            best_match = (match[0], match[1], file)
print(best_match)

This would find the best matching file. If you can have multiple matching files you can forget about "best_match" and just report files with a match score better than some threshold value.

If read().split() is not doing a good enough job tokenizing the file, you could investigate using the natural language toolkit (NLTK).

I used "set()" so fuzzywuzzy doesn't have to check duplicate words. This is going to be slow enough already.

When using "with open" the file automatically closes when you leave the context. You don't have to call f.close().

marfer · Aug-02-2021, 04:50 PM

I was doing

for file in os.listdir('C:/Users/mydirectory'):
    with open('C:/Users/mydirectory/'+file) as f:
        for num,line in enumerate(f):
            if process.extractOne("john", line, scorer=fuzz.token_set_ratio):
                print(file)
    f.close()

**deanhystad** · (This post was last modified: Aug-02-2021, 05:05 PM by deanhystad.)

Sorry, I was editing my response. See above.

You cannot use the process.extractOne() return value as a True/False. extractOne returns a tuple (match, score) and the only tuple that evaluates to False is an emtpy tuple (,). Even if every choice results in a matching score of 0, extractOne will return a tuple containing the best match and the score.

Proof that only an empty tuple is False.

def evaluate(thing):
    if thing:
        print(f'{thing} is True')
    else:
        print(f'{thing} is False')

evaluate((False, False))
evaluate((False,))
evaluate(tuple())

Output:(False, False) is True
(False,) is True
() is False

marfer · Aug-02-2021, 06:21 PM

Thank you, deanhystad for your code and explaining me why my code was not working. Your code works well and from your code I also understood how to report the files with a match score better than some threshold value.
I have just one last question. What about if I want to search for more than one string, to improve my matching? I tried

keywords=["john", "mary"] #I know that there is one text file talking about john and mary
best_match = (None, 0, None)  # Remember the best match.
for file in os.listdir('C:/Users/mydirectory'):
    with open('C:/Users/mydirectory/'+file) as f:
        words = set(f.read().split())
        match = process.extractOne(keywords, words, scorer=fuzz.token_set_ratio):
        if match[1] > best_match[1]:
            best_match = (match[0], match[1], file)
print(best_match)

but I obtained the error "expected string or bytes-like object".

**deanhystad** · (This post was last modified: Aug-02-2021, 07:11 PM by deanhystad.)

Not a lot of documentation online but you can go to the project page and look at the source.

Quote:def extractOne(query, choices, processor=default_processor, scorer=default_scorer, score_cutoff=0):
"""Find the single best match above a score in a list of choices.
This is a convenience method which returns the single best choice.
See extract() for the full arguments list.
Args:
query: A string to match against
choices: A list or dictionary of choices, suitable for use with
extract().
processor: Optional function for transforming choices before matching.
See extract().
scorer: Scoring function for extract().
score_cutoff: Optional argument for score threshold. If the best
match is found, but it is not greater than this number, then
return None anyway ("not a good enough match"). Defaults to 0.
Returns:
A tuple containing a single match and its score, if a match
was found that was above score_cutoff. Otherwise, returns None.
"""
best_list = extractWithoutOrder(query, choices, processor, scorer, score_cutoff)
try:
return max(best_list, key=lambda i: i[1])
except ValueError:
return None

According to this, "query" (the word you are looking for) is a string. If you want to look for one of many words I think you need to repeat the extractOne() call once for each query word. You'll have to come up with a "matching calculus" to decide which is the best fit. I'd start with computing the average match score of the query words.

marfer · Aug-02-2021, 09:22 PM

And if I search "john smith", he only searches for "john". Thank you for your comments and support. I think I will be able to obtain what I want using regular expressions and your code for fuzzywuzzy. Thank you!

**deanhystad** · Aug-03-2021, 02:41 AM

You cannot find "john smith" because that is two words and the "tokenizer" code splits up the text file into individual words. I would try the various ratio functions and see how well "john smith" matches the entire text file content treated as one long string. Take a look at this datacamp article:

https://www.datacamp.com/community/tutor...ing-python

If your files are really short I think token_set_ratio looks promising.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	How to read a file as binary or hex "string" so that I can do regex search?	tatahuft	3	1,239	Dec-19-2024, 11:57 AM Last Post: snippsat
	Search in a file using regular expressions	ADELE80	2	766	Dec-18-2024, 12:29 PM Last Post: ADELE80
	Writing a Linear Search algorithm - malformed string representation	Drone4four	10	4,306	Jan-10-2024, 08:39 AM Last Post: gulshan212
	Search Excel File with a list of values	huzzug	4	2,902	Nov-03-2023, 05:35 PM Last Post: huzzug
	Need to replace a string with a file (HTML file)	tester_V	1	1,978	Aug-30-2023, 03:42 AM Last Post: Larz60+
	Search for multiple unknown 3 (2) Byte combinations in a file.	lastyle	7	3,372	Aug-14-2023, 02:28 AM Last Post: deanhystad
	How to split a String from Text Input into 40 char chunks?	lastyle	7	2,705	Aug-01-2023, 09:36 AM Last Post: Pedroski55
	search file by regex	SamLiu	1	1,721	Feb-23-2023, 01:19 PM Last Post: deanhystad
	Need to compare the Excel file name with a directory text file.	veeran1991	1	2,083	Dec-15-2022, 04:32 PM Last Post: Larz60+
	If function is false search next file	mattbatt84	2	1,987	Sep-04-2022, 01:56 PM Last Post: deanhystad

fuzzywuzzy search string in text file

User Panel Messages

Announcements