Python Forum
fuzzywuzzy search string in text file
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
fuzzywuzzy search string in text file
#1
I have several text files in a folder and I would like to find a file by searching for a specific string. I started by using regular expressions:
import os, re
for file in os.listdir('C:/Users/mydirectory'):
    with open('C:/Users/mydirectory/'+file) as f:
        for num,line in enumerate(f): 
            if re.findall(r'john', line):
                print(file)
    f.close()
This code works, however in some of my text files there are typos, for example, instead of having "john" I have "johnn", so I thought that fuzzywuzzy should be a better approach.
#pip install fuzzywuzzy
from fuzzywuzzy import fuzz, process
import os
for file in os.listdir('C:/Users/mydirectory'):
    with open('C:/Users/mydirectory/'+file) as f:
        for num,line in enumerate(f):
            if process.extract("john", line):
                print(file)
    f.close()
This code only prints me the files in the folder, and not even by solution order. Could anyone explain me what is the problem in my code?
Reply
#2
extract is going to return a list of matches ordered by score. If line contains only 1 string, process will return that string along with the score. You can see this with a small mod to your script.
from fuzzywuzzy import fuzz, process
import os
for file in os.listdir('C:/Users/mydirectory'):
    with open('C:/Users/mydirectory/'+file) as f:
        for num,line in enumerate(f):
            print(process.extract("john", line))
    f.close()
I think ratio is a better choice for what you are doing. Or if you want the best match, use process with a list of all the files in the folder. And then use the weight to determine if the best match is good enough.
Reply
#3
deanhystad, thank you for your comment, but I still was not able to understand how to obtain the best match. I also tried:
process.extractOne("john", line, scorer=fuzz.token_set_ratio)
but python only prints me the list of files in the folder.
Reply
#4
I was confused myself. I thought you were looking for matching file names. Not files that contain matching text. That is a much tougher problem.

This is not going to give you good results:
process.extractOne("john", line, scorer=fuzz.token_set_ratio)
line is supposed to be a list of choices. for extractOne('john', ['John', 'Joan', 'Jean', 'Jon']), fuzzywuzzy would calculate the score for each word and return a tuple returning the best match and score. To do this for an entire file you'll need to split the file into a list of words. If the files aren't very big you could do something like this:
best_match = (None, 0, None)  # Remember the best match.
for file in os.listdir('C:/Users/mydirectory'):
    with open('C:/Users/mydirectory/'+file) as f:
        words = set(f.read().split())
        match = process.extractOne("john", words, scorer=fuzz.token_set_ratio):
        if match[1] > best_match[1]:
            best_match = (match[0], match[1], file)
print(best_match)
This would find the best matching file. If you can have multiple matching files you can forget about "best_match" and just report files with a match score better than some threshold value.

If read().split() is not doing a good enough job tokenizing the file, you could investigate using the natural language toolkit (NLTK).

I used "set()" so fuzzywuzzy doesn't have to check duplicate words. This is going to be slow enough already.

When using "with open" the file automatically closes when you leave the context. You don't have to call f.close().
Reply
#5
I was doing
for file in os.listdir('C:/Users/mydirectory'):
    with open('C:/Users/mydirectory/'+file) as f:
        for num,line in enumerate(f):
            if process.extractOne("john", line, scorer=fuzz.token_set_ratio):
                print(file)
    f.close()
Reply
#6
Sorry, I was editing my response. See above.

You cannot use the process.extractOne() return value as a True/False. extractOne returns a tuple (match, score) and the only tuple that evaluates to False is an emtpy tuple (,). Even if every choice results in a matching score of 0, extractOne will return a tuple containing the best match and the score.

Proof that only an empty tuple is False.
def evaluate(thing):
    if thing:
        print(f'{thing} is True')
    else:
        print(f'{thing} is False')

evaluate((False, False))
evaluate((False,))
evaluate(tuple())
Output:
(False, False) is True (False,) is True () is False
Reply
#7
Thank you, deanhystad for your code and explaining me why my code was not working. Your code works well and from your code I also understood how to report the files with a match score better than some threshold value.
I have just one last question. What about if I want to search for more than one string, to improve my matching? I tried
keywords=["john", "mary"] #I know that there is one text file talking about john and mary
best_match = (None, 0, None)  # Remember the best match.
for file in os.listdir('C:/Users/mydirectory'):
    with open('C:/Users/mydirectory/'+file) as f:
        words = set(f.read().split())
        match = process.extractOne(keywords, words, scorer=fuzz.token_set_ratio):
        if match[1] > best_match[1]:
            best_match = (match[0], match[1], file)
print(best_match)
but I obtained the error "expected string or bytes-like object".
Reply
#8
Not a lot of documentation online but you can go to the project page and look at the source.
Quote:def extractOne(query, choices, processor=default_processor, scorer=default_scorer, score_cutoff=0):
"""Find the single best match above a score in a list of choices.
This is a convenience method which returns the single best choice.
See extract() for the full arguments list.
Args:
query: A string to match against
choices: A list or dictionary of choices, suitable for use with
extract().
processor: Optional function for transforming choices before matching.
See extract().
scorer: Scoring function for extract().
score_cutoff: Optional argument for score threshold. If the best
match is found, but it is not greater than this number, then
return None anyway ("not a good enough match"). Defaults to 0.
Returns:
A tuple containing a single match and its score, if a match
was found that was above score_cutoff. Otherwise, returns None.
"""
best_list = extractWithoutOrder(query, choices, processor, scorer, score_cutoff)
try:
return max(best_list, key=lambda i: i[1])
except ValueError:
return None
According to this, "query" (the word you are looking for) is a string. If you want to look for one of many words I think you need to repeat the extractOne() call once for each query word. You'll have to come up with a "matching calculus" to decide which is the best fit. I'd start with computing the average match score of the query words.
Reply
#9
And if I search "john smith", he only searches for "john". Thank you for your comments and support. I think I will be able to obtain what I want using regular expressions and your code for fuzzywuzzy. Thank you!
Reply
#10
You cannot find "john smith" because that is two words and the "tokenizer" code splits up the text file into individual words. I would try the various ratio functions and see how well "john smith" matches the entire text file content treated as one long string. Take a look at this datacamp article:

https://www.datacamp.com/community/tutor...ing-python

If your files are really short I think token_set_ratio looks promising.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  If function is false search next file mattbatt84 2 242 Sep-04-2022, 01:56 PM
Last Post: deanhystad
  Modify values in XML file by data from text file (without parsing) Paqqno 2 544 Apr-13-2022, 06:02 AM
Last Post: Paqqno
  Editing text between two string from different lines Paqqno 1 612 Apr-06-2022, 10:34 PM
Last Post: BashBedlam
  Search multiple CSV files for a string or strings cubangt 7 2,606 Feb-23-2022, 12:53 AM
Last Post: Pedroski55
  Converted Pipe Delimited text file to CSV file atomxkai 4 2,233 Feb-11-2022, 12:38 AM
Last Post: atomxkai
  Search text in PDF and output its page number. atomxkai 21 2,736 Jan-21-2022, 06:20 AM
Last Post: snippsat
  Extract a string between 2 words from a text file OscarBoots 2 1,008 Nov-02-2021, 08:50 AM
Last Post: ibreeden
  Search string in mutliple .gz files SARAOOF 10 3,662 Aug-26-2021, 01:47 PM
Last Post: SARAOOF
  Replace String in multiple text-files [SOLVED] AlphaInc 5 3,432 Aug-08-2021, 04:59 PM
Last Post: Axel_Erfurt
  Cloning a directory and using a .CSV file as a reference to search and replace bg25lam 2 1,427 May-31-2021, 07:00 AM
Last Post: bowlofred

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020