Python Forum
filtering files using 'any()"
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
filtering files using 'any()"
#11
to 'bowlofred'
thank you for the code,
it works fine.
For some reason, I thought you could test the whole file with 'any()' without opening it.

Thanks again man!
Reply
#12
(May-05-2021, 07:25 PM)tester_V Wrote: The files are big and many, I do not want to do 'readlines' and if I'll do open each file then it is like doing 'for' loop, and I do not need 'any()'

I thought I can test each file as a whole file without checking each line.

Thank you.

If the files aren't large, you can just read the entire file into a string and then check if you have a string match. There is no loop necessary.

with open(filename, "r") as f:
    if matchstring in f.read():
        ...
But if the files are large, you don't want to load in the entire file into a single string. That implies that you will need to loop over chunks of the file. If it's a "sane" text file, you can assume newlines every so often and let the file iterator hand you one line at a time. Loop over each and exit when you get your first match. any() would be nice about doing that.

with open(filename, "r") as f:
    # f is the file iterator.  The expression inside the parens is a generator comprehension, so
    # the entire file is not read into memory, just each line is matched.  any() will exit after any
    # match is successful and doesn't have to read the rest of the file.
    if any(matchstring in line for line in f):
        ...
tester_V and snippsat like this post
Reply
#13
You could perhaps delegate the search to a subprocess using the findstr command (in Windows) or grep (in Linux)
tester_V likes this post
Reply
#14
To Gribouillis,
I never thought about 'findstr', I did not know it exists.
Thank you!

To bowlofred,
Thanks man! You are Da Man bro! Big Grin
Reply
#15
There is also built-in mmap which can be used to search text inside file:

import mmap

with open('my_file') as f:
    mapping = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
    if mapping.find(b'latest') != -1:
        print("Found 'latest'")
It can be easily written as function and called on files needed.
tester_V likes this post
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy

Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Reply
#16
tester_V Wrote:The files are big and many, I do not want to do 'readlines' and if I'll do open each file then it is like doing 'for' loop, and I do not need 'any()
So i can write some code that bring most of this together,so this follow what @bowlofred mention last bye not reading whole file into memory.
import os
import re

def find_files(file_type):
    os.chdir(path)
    with os.scandir(path) as it:
        for entry in it:
            if entry.name.endswith(file_type) and entry.is_file():
                yield entry.name

def find_in_file(files, search_words):
    for file in files:
        with open(file, encoding='utf-8') as f:
            for index, line in enumerate(f, 1):
                if any(word in line for word in search_words):
                    print(f'Found <{line.strip()}> in file <{file}> on line <{index}>')

if __name__ == '__main__':
    path = r'E:\div_code\new\cat_pic'
    search_words  = ['cat', 'dog']
    file_type = '.txt'
    files = find_files(file_type)
    find_in_file(files, search_words)
Output:
Found <cat> in file <Cat info.txt> on line <3> Found <dog> in file <Dog types.txt> on line <4> Found <dog55> in file <Dog types.txt> on line <6>
When i am at can make it more powerful bye doing similar stuff as findstr/grep as mention bye @Gribouillis.
This will use regex so now can write pattern that can all kind of matches.
Here dog most have two number to match.
import os
import re

def find_files(file_type):
    os.chdir(path)
    with os.scandir(path) as it:
        for entry in it:
            if entry.name.endswith(file_type) and entry.is_file():
                yield entry.name

def find_in_file(files, pattern):
    for file in files:
        with open(file, encoding='utf-8') as f:
            for index, line in enumerate(f, 1):
                for match in re.finditer(pattern, line):
                    print(f'Found <{line.strip()}> in file <{file}> on line <{index}>')

if __name__ == '__main__':
    path = r'E:\div_code\new\cat_pic'
    pattern = re.compile(r'cat|dog\d\d(?!\d)')
    file_type = '.txt'
    files = find_files(file_type)
    find_in_file(files, pattern)
Output:
Found <cat> in file <Cat info.txt> on line <3> Found <dog55> in file <Dog types.txt> on line <6>
Reply
#17
(May-05-2021, 07:25 PM)tester_V Wrote: I thought I can test each file as a whole file without checking each line.

Tell me something about the content of any book. You're not allowed to read it.

How should the computer know if the word you're seeking, is at the beginning, in the middle or at the end?
You've to iterate over the content/lines to find the word.

If you know the exact position where you expect the word, you don't need to iterate.
Do you know the exact position?
tester_V likes this post
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#18
To DeaD_EyE:
I see your point, I understand and I knew I had to read it line by line to check if the "search" line is n the file...
I thought maybe you guys know some magick Python tricks Wink

To snippsat:
Those are powerful examples man! Thank you! I appreciate your input!
Reply
#19
tester_V Wrote:I thought maybe you guys know some magick Python tricks
Here is one trick: for many years, there has been a module in Pypi named grin. It is some sort of grep command written in pure Python. It could be a fruitful idea to go and see how they solve the same problem (now there is also grin3, it may be the one to take).
Reply
#20
(May-08-2021, 06:32 AM)Gribouillis Wrote: Here is one trick: for many years, there has been a module in Pypi named grin. It is some sort of grep command written in pure Python. It could be a fruitful idea to go and see how they solve the same problem (now there is also grin3
Looking at so it's a more powerful way with added command line availability using argparse/regex of what i have done in my code with regex.

Thinking of so may i use my code as a project using Click(my clear favorite) to add command line functionality.
With just a simple regex change in my script testing alice_in_wonderland.txt
How many times and where is mention in Alice in wonderland is beautiful Soup👀,could by a Python refence Think
One change in code using word boundaries(find whole word exact match).
pattern = re.compile(r'\bbeautiful Soup\b')
Output:
Found <Soup of the evening, beautiful Soup!> in file <alice_in_wonderland.txt> on line <2988> Found <Soup of the evening, beautiful Soup!> in file <alice_in_wonderland.txt> on line <2989> Found <Beautiful, beautiful Soup!> in file <alice_in_wonderland.txt> on line <2993> Found <Pennyworth only of beautiful Soup?> in file <alice_in_wonderland.txt> on line <2998> Found <Pennyworth only of beautiful Soup?> in file <alice_in_wonderland.txt> on line <2999> Found <Beautiful, beautiful Soup!'> in file <alice_in_wonderland.txt> on line <3018>
How many CHAPTER is it in Alice in wonderland?
pattern = re.compile(r'CHAPTER.*?')
Output:
Found <CHAPTER I> in file <alice_in_wonderland.txt> on line <12> Found <CHAPTER II> in file <alice_in_wonderland.txt> on line <247> Found <CHAPTER III> in file <alice_in_wonderland.txt> on line <471> Found <CHAPTER IV> in file <alice_in_wonderland.txt> on line <731> Found <CHAPTER V> in file <alice_in_wonderland.txt> on line <1017> Found <CHAPTER VI> in file <alice_in_wonderland.txt> on line <1329> Found <CHAPTER VII> in file <alice_in_wonderland.txt> on line <1671> Found <CHAPTER VIII> in file <alice_in_wonderland.txt> on line <2030> Found <CHAPTER IX> in file <alice_in_wonderland.txt> on line <2354> Found <CHAPTER X> in file <alice_in_wonderland.txt> on line <2693> Found <CHAPTER XI> in file <alice_in_wonderland.txt> on line <3022> Found <CHAPTER XII> in file <alice_in_wonderland.txt> on line <3301>
Whole code.
import os
import re

def find_files(file_type):
    os.chdir(path)
    with os.scandir(path) as it:
        for entry in it:
            if entry.name.endswith(file_type) and entry.is_file():
                yield entry.name

def find_in_file(files, pattern):
    for file in files:
        with open(file, encoding='utf-8') as f:
            for index, line in enumerate(f, 1):
                for match in re.finditer(pattern, line):
                    print(f'Found <{line.strip()}> in file <{file}> on line <{index}>')

if __name__ == '__main__':
    path = r'E:\div_code\new\finditer_any'
    pattern = re.compile(r'CHAPTER.*?')
    file_type = '.txt'
    files = find_files(file_type)
    find_in_file(files, pattern)
tester_V and Gribouillis like this post
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Merge all json files in folder after filtering deneme2 10 2,251 Sep-18-2022, 10:32 AM
Last Post: deneme2
  Filtering files, for current year files tester_V 8 3,859 Aug-07-2021, 03:58 AM
Last Post: tester_V

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020