Python Forum
fuzzywuzzy search string in text file
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
fuzzywuzzy search string in text file
#1
I have several text files in a folder and I would like to find a file by searching for a specific string. I started by using regular expressions:
import os, re
for file in os.listdir('C:/Users/mydirectory'):
    with open('C:/Users/mydirectory/'+file) as f:
        for num,line in enumerate(f): 
            if re.findall(r'john', line):
                print(file)
    f.close()
This code works, however in some of my text files there are typos, for example, instead of having "john" I have "johnn", so I thought that fuzzywuzzy should be a better approach.
#pip install fuzzywuzzy
from fuzzywuzzy import fuzz, process
import os
for file in os.listdir('C:/Users/mydirectory'):
    with open('C:/Users/mydirectory/'+file) as f:
        for num,line in enumerate(f):
            if process.extract("john", line):
                print(file)
    f.close()
This code only prints me the files in the folder, and not even by solution order. Could anyone explain me what is the problem in my code?
Reply
#2
extract is going to return a list of matches ordered by score. If line contains only 1 string, process will return that string along with the score. You can see this with a small mod to your script.
from fuzzywuzzy import fuzz, process
import os
for file in os.listdir('C:/Users/mydirectory'):
    with open('C:/Users/mydirectory/'+file) as f:
        for num,line in enumerate(f):
            print(process.extract("john", line))
    f.close()
I think ratio is a better choice for what you are doing. Or if you want the best match, use process with a list of all the files in the folder. And then use the weight to determine if the best match is good enough.
Reply
#3
deanhystad, thank you for your comment, but I still was not able to understand how to obtain the best match. I also tried:
process.extractOne("john", line, scorer=fuzz.token_set_ratio)
but python only prints me the list of files in the folder.
Reply
#4
I was confused myself. I thought you were looking for matching file names. Not files that contain matching text. That is a much tougher problem.

This is not going to give you good results:
process.extractOne("john", line, scorer=fuzz.token_set_ratio)
line is supposed to be a list of choices. for extractOne('john', ['John', 'Joan', 'Jean', 'Jon']), fuzzywuzzy would calculate the score for each word and return a tuple returning the best match and score. To do this for an entire file you'll need to split the file into a list of words. If the files aren't very big you could do something like this:
best_match = (None, 0, None)  # Remember the best match.
for file in os.listdir('C:/Users/mydirectory'):
    with open('C:/Users/mydirectory/'+file) as f:
        words = set(f.read().split())
        match = process.extractOne("john", words, scorer=fuzz.token_set_ratio):
        if match[1] > best_match[1]:
            best_match = (match[0], match[1], file)
print(best_match)
This would find the best matching file. If you can have multiple matching files you can forget about "best_match" and just report files with a match score better than some threshold value.

If read().split() is not doing a good enough job tokenizing the file, you could investigate using the natural language toolkit (NLTK).

I used "set()" so fuzzywuzzy doesn't have to check duplicate words. This is going to be slow enough already.

When using "with open" the file automatically closes when you leave the context. You don't have to call f.close().
Reply
#5
I was doing
for file in os.listdir('C:/Users/mydirectory'):
    with open('C:/Users/mydirectory/'+file) as f:
        for num,line in enumerate(f):
            if process.extractOne("john", line, scorer=fuzz.token_set_ratio):
                print(file)
    f.close()
Reply
#6
Sorry, I was editing my response. See above.

You cannot use the process.extractOne() return value as a True/False. extractOne returns a tuple (match, score) and the only tuple that evaluates to False is an emtpy tuple (,). Even if every choice results in a matching score of 0, extractOne will return a tuple containing the best match and the score.

Proof that only an empty tuple is False.
def evaluate(thing):
    if thing:
        print(f'{thing} is True')
    else:
        print(f'{thing} is False')

evaluate((False, False))
evaluate((False,))
evaluate(tuple())
Output:
(False, False) is True (False,) is True () is False
Reply
#7
Thank you, deanhystad for your code and explaining me why my code was not working. Your code works well and from your code I also understood how to report the files with a match score better than some threshold value.
I have just one last question. What about if I want to search for more than one string, to improve my matching? I tried
keywords=["john", "mary"] #I know that there is one text file talking about john and mary
best_match = (None, 0, None)  # Remember the best match.
for file in os.listdir('C:/Users/mydirectory'):
    with open('C:/Users/mydirectory/'+file) as f:
        words = set(f.read().split())
        match = process.extractOne(keywords, words, scorer=fuzz.token_set_ratio):
        if match[1] > best_match[1]:
            best_match = (match[0], match[1], file)
print(best_match)
but I obtained the error "expected string or bytes-like object".
Reply
#8
Not a lot of documentation online but you can go to the project page and look at the source.
Quote:def extractOne(query, choices, processor=default_processor, scorer=default_scorer, score_cutoff=0):
"""Find the single best match above a score in a list of choices.
This is a convenience method which returns the single best choice.
See extract() for the full arguments list.
Args:
query: A string to match against
choices: A list or dictionary of choices, suitable for use with
extract().
processor: Optional function for transforming choices before matching.
See extract().
scorer: Scoring function for extract().
score_cutoff: Optional argument for score threshold. If the best
match is found, but it is not greater than this number, then
return None anyway ("not a good enough match"). Defaults to 0.
Returns:
A tuple containing a single match and its score, if a match
was found that was above score_cutoff. Otherwise, returns None.
"""
best_list = extractWithoutOrder(query, choices, processor, scorer, score_cutoff)
try:
return max(best_list, key=lambda i: i[1])
except ValueError:
return None
According to this, "query" (the word you are looking for) is a string. If you want to look for one of many words I think you need to repeat the extractOne() call once for each query word. You'll have to come up with a "matching calculus" to decide which is the best fit. I'd start with computing the average match score of the query words.
Reply
#9
And if I search "john smith", he only searches for "john". Thank you for your comments and support. I think I will be able to obtain what I want using regular expressions and your code for fuzzywuzzy. Thank you!
Reply
#10
You cannot find "john smith" because that is two words and the "tokenizer" code splits up the text file into individual words. I would try the various ratio functions and see how well "john smith" matches the entire text file content treated as one long string. Take a look at this datacamp article:

https://www.datacamp.com/community/tutor...ing-python

If your files are really short I think token_set_ratio looks promising.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Writing a Linear Search algorithm - malformed string representation Drone4four 10 829 Jan-10-2024, 08:39 AM
Last Post: gulshan212
  Search Excel File with a list of values huzzug 4 1,147 Nov-03-2023, 05:35 PM
Last Post: huzzug
  Need to replace a string with a file (HTML file) tester_V 1 698 Aug-30-2023, 03:42 AM
Last Post: Larz60+
  Search for multiple unknown 3 (2) Byte combinations in a file. lastyle 7 1,256 Aug-14-2023, 02:28 AM
Last Post: deanhystad
Sad How to split a String from Text Input into 40 char chunks? lastyle 7 1,054 Aug-01-2023, 09:36 AM
Last Post: Pedroski55
  search file by regex SamLiu 1 860 Feb-23-2023, 01:19 PM
Last Post: deanhystad
Thumbs Up Need to compare the Excel file name with a directory text file. veeran1991 1 1,061 Dec-15-2022, 04:32 PM
Last Post: Larz60+
  If function is false search next file mattbatt84 2 1,111 Sep-04-2022, 01:56 PM
Last Post: deanhystad
  Modify values in XML file by data from text file (without parsing) Paqqno 2 1,574 Apr-13-2022, 06:02 AM
Last Post: Paqqno
  Editing text between two string from different lines Paqqno 1 1,287 Apr-06-2022, 10:34 PM
Last Post: BashBedlam

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020