filtering files using 'any()"

tester_V · May-05-2021, 07:34 PM

to 'bowlofred'
thank you for the code,
it works fine.
For some reason, I thought you could test the whole file with 'any()' without opening it.

Thanks again man!

bowlofred · (This post was last modified: May-05-2021, 07:59 PM by bowlofred.)

(May-05-2021, 07:25 PM)tester_V Wrote: The files are big and many, I do not want to do 'readlines' and if I'll do open each file then it is like doing 'for' loop, and I do not need 'any()'

I thought I can test each file as a whole file without checking each line.

Thank you.

If the files aren't large, you can just read the entire file into a string and then check if you have a string match. There is no loop necessary.

with open(filename, "r") as f:
    if matchstring in f.read():
        ...

But if the files are large, you don't want to load in the entire file into a single string. That implies that you will need to loop over chunks of the file. If it's a "sane" text file, you can assume newlines every so often and let the file iterator hand you one line at a time. Loop over each and exit when you get your first match. any() would be nice about doing that.

with open(filename, "r") as f:
    # f is the file iterator.  The expression inside the parens is a generator comprehension, so
    # the entire file is not read into memory, just each line is matched.  any() will exit after any
    # match is successful and doesn't have to read the rest of the file.
    if any(matchstring in line for line in f):
        ...

**Gribouillis** · May-05-2021, 08:48 PM

You could perhaps delegate the search to a subprocess using the findstr command (in Windows) or grep (in Linux)

tester_V · May-06-2021, 02:46 AM

To Gribouillis,
I never thought about 'findstr', I did not know it exists.
Thank you!

To bowlofred,
Thanks man! You are Da Man bro! Big Grin

**perfringo** · May-06-2021, 09:29 AM

There is also built-in mmap which can be used to search text inside file:

import mmap

with open('my_file') as f:
    mapping = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
    if mapping.find(b'latest') != -1:
        print("Found 'latest'")

It can be easily written as function and called on files needed.

***snippsat*** · May-06-2021, 10:32 AM

tester_V Wrote:The files are big and many, I do not want to do 'readlines' and if I'll do open each file then it is like doing 'for' loop, and I do not need 'any()

So i can write some code that bring most of this together,so this follow what @bowlofred mention last bye not reading whole file into memory.

import os
import re

def find_files(file_type):
    os.chdir(path)
    with os.scandir(path) as it:
        for entry in it:
            if entry.name.endswith(file_type) and entry.is_file():
                yield entry.name

def find_in_file(files, search_words):
    for file in files:
        with open(file, encoding='utf-8') as f:
            for index, line in enumerate(f, 1):
                if any(word in line for word in search_words):
                    print(f'Found <{line.strip()}> in file <{file}> on line <{index}>')

if __name__ == '__main__':
    path = r'E:\div_code\new\cat_pic'
    search_words  = ['cat', 'dog']
    file_type = '.txt'
    files = find_files(file_type)
    find_in_file(files, search_words)

Output:Found <cat> in file <Cat info.txt> on line <3>
Found <dog> in file <Dog types.txt> on line <4>
Found <dog55> in file <Dog types.txt> on line <6>

When i am at can make it more powerful bye doing similar stuff as findstr/grep as mention bye @Gribouillis.
This will use regex so now can write pattern that can all kind of matches.
Here dog most have two number to match.

import os
import re

def find_files(file_type):
    os.chdir(path)
    with os.scandir(path) as it:
        for entry in it:
            if entry.name.endswith(file_type) and entry.is_file():
                yield entry.name

def find_in_file(files, pattern):
    for file in files:
        with open(file, encoding='utf-8') as f:
            for index, line in enumerate(f, 1):
                for match in re.finditer(pattern, line):
                    print(f'Found <{line.strip()}> in file <{file}> on line <{index}>')

if __name__ == '__main__':
    path = r'E:\div_code\new\cat_pic'
    pattern = re.compile(r'cat|dog\d\d(?!\d)')
    file_type = '.txt'
    files = find_files(file_type)
    find_in_file(files, pattern)

Output:Found <cat> in file <Cat info.txt> on line <3>
Found <dog55> in file <Dog types.txt> on line <6>

DeaD_EyE · May-06-2021, 10:42 AM

(May-05-2021, 07:25 PM)tester_V Wrote: I thought I can test each file as a whole file without checking each line.

Tell me something about the content of any book. You're not allowed to read it.

How should the computer know if the word you're seeking, is at the beginning, in the middle or at the end?
You've to iterate over the content/lines to find the word.

If you know the exact position where you expect the word, you don't need to iterate.
Do you know the exact position?

tester_V · May-08-2021, 04:23 AM

To DeaD_EyE:
I see your point, I understand and I knew I had to read it line by line to check if the "search" line is n the file...
I thought maybe you guys know some magick Python tricks Wink

To snippsat:
Those are powerful examples man! Thank you! I appreciate your input!

**Gribouillis** · (This post was last modified: May-08-2021, 06:32 AM by Gribouillis.)

tester_V Wrote:I thought maybe you guys know some magick Python tricks

Here is one trick: for many years, there has been a module in Pypi named grin. It is some sort of grep command written in pure Python. It could be a fruitful idea to go and see how they solve the same problem (now there is also grin3, it may be the one to take).

***snippsat*** · (This post was last modified: May-08-2021, 12:11 PM by snippsat.)

(May-08-2021, 06:32 AM)Gribouillis Wrote: Here is one trick: for many years, there has been a module in Pypi named grin. It is some sort of grep command written in pure Python. It could be a fruitful idea to go and see how they solve the same problem (now there is also grin3

Looking at so it's a more powerful way with added command line availability using argparse/regex of what i have done in my code with regex.

Thinking of so may i use my code as a project using Click(my clear favorite) to add command line functionality.
With just a simple regex change in my script testing alice_in_wonderland.txt
How many times and where is mention in Alice in wonderland is beautiful Soup👀,could by a Python refence Think

One change in code using word boundaries(find whole word exact match).

pattern = re.compile(r'\bbeautiful Soup\b')

Output:Found <Soup of the evening, beautiful Soup!> in file <alice_in_wonderland.txt> on line <2988>
Found <Soup of the evening, beautiful Soup!> in file <alice_in_wonderland.txt> on line <2989>
Found <Beautiful, beautiful Soup!> in file <alice_in_wonderland.txt> on line <2993>
Found <Pennyworth only of beautiful Soup?> in file <alice_in_wonderland.txt> on line <2998>
Found <Pennyworth only of beautiful Soup?> in file <alice_in_wonderland.txt> on line <2999>
Found <Beautiful, beautiful Soup!'> in file <alice_in_wonderland.txt> on line <3018>

How many CHAPTER is it in Alice in wonderland?

pattern = re.compile(r'CHAPTER.*?')

Output:Found <CHAPTER I> in file <alice_in_wonderland.txt> on line <12>
Found <CHAPTER II> in file <alice_in_wonderland.txt> on line <247>
Found <CHAPTER III> in file <alice_in_wonderland.txt> on line <471>
Found <CHAPTER IV> in file <alice_in_wonderland.txt> on line <731>
Found <CHAPTER V> in file <alice_in_wonderland.txt> on line <1017>
Found <CHAPTER VI> in file <alice_in_wonderland.txt> on line <1329>
Found <CHAPTER VII> in file <alice_in_wonderland.txt> on line <1671>
Found <CHAPTER VIII> in file <alice_in_wonderland.txt> on line <2030>
Found <CHAPTER IX> in file <alice_in_wonderland.txt> on line <2354>
Found <CHAPTER X> in file <alice_in_wonderland.txt> on line <2693>
Found <CHAPTER XI> in file <alice_in_wonderland.txt> on line <3022>
Found <CHAPTER XII> in file <alice_in_wonderland.txt> on line <3301>

Whole code.

import os
import re

def find_files(file_type):
    os.chdir(path)
    with os.scandir(path) as it:
        for entry in it:
            if entry.name.endswith(file_type) and entry.is_file():
                yield entry.name

def find_in_file(files, pattern):
    for file in files:
        with open(file, encoding='utf-8') as f:
            for index, line in enumerate(f, 1):
                for match in re.finditer(pattern, line):
                    print(f'Found <{line.strip()}> in file <{file}> on line <{index}>')

if __name__ == '__main__':
    path = r'E:\div_code\new\finditer_any'
    pattern = re.compile(r'CHAPTER.*?')
    file_type = '.txt'
    files = find_files(file_type)
    find_in_file(files, pattern)

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Merge all json files in folder after filtering	deneme2	10	4,722	Sep-18-2022, 10:32 AM Last Post: deneme2
	Filtering files, for current year files	tester_V	8	5,997	Aug-07-2021, 03:58 AM Last Post: tester_V

filtering files using 'any()"

User Panel Messages

Announcements