Python Forum

Full Version: Search string in mutliple .gz files
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2
Hi,
Kindly your support to provide python script to search given strings (Number, Text etc.) from multiple ".gz" text files

Directory contains multiple".gz" files date wise
Output:
/bkup/TC/XYZ/20210818 File Names: A_7235818.csv.gz A_7235819.csv.gz . .
Output:
Content of sample file. 38486,22625,XYZ_06_0_20210817204446-3997 88279,77617,XYZ_06_0_20210817204846-3998
Getting error while running below Code.

import glob
import gzip

matched_lines = []

ZIPFILES='/bkup/TC/XYZ/20210818/*.gz'

grep = raw_input('Enter Search: ')

filelist = glob.glob(ZIPFILES)
for gzfile in filelist:
     #print("#Starting " + gzfile)  #if you want to know which file is being processed
    with gzip.open( gzfile, 'rb') as f:
#    grep = raw_input('Enter Search: ')
    for line in f: # read file line by line
        if grep in line: # search for string in each line
            matched_lines.append(line) # keep a list of matched lines

file_content = ''.join(matched_lines) # join the matched lines

print(file_content)
Output:
Error:
$ ./srch6.py File "./srch6.py", line 17 for line in f: # read file line by line ^ IndentationError: expected an indented block
The error is clear. You need to fix indentation.

(I did not try to run following so there may be additional errors):
import glob
import gzip
 
matched_lines = []
 
ZIPFILES='/bkup/TC/XYZ/20210818/*.gz'
 
grep = raw_input('Enter Search: ')
 
filelist = glob.glob(ZIPFILES)
for gzfile in filelist:
    # print("#Starting " + gzfile)  #if you want to know which file is being processed
    with gzip.open( gzfile, 'rb') as f:
        # grep = raw_input('Enter Search: ')
        for line in f: # read file line by line
            if grep in line: # search for string in each line
                matched_lines.append(line) # keep a list of matched lines
 
file_content = ''.join(matched_lines) # join the matched lines
 
print(file_content)
(Aug-18-2021, 11:26 AM)Larz60+ Wrote: [ -> ]The error is clear. You need to fix indentation.

(I did not try to run following so there may be additional errors):
import glob
import gzip
 
matched_lines = []
 
ZIPFILES='/bkup/TC/XYZ/20210818/*.gz'
 
grep = raw_input('Enter Search: ')
 
filelist = glob.glob(ZIPFILES)
for gzfile in filelist:
    # print("#Starting " + gzfile)  #if you want to know which file is being processed
    with gzip.open( gzfile, 'rb') as f:
        # grep = raw_input('Enter Search: ')
        for line in f: # read file line by line
            if grep in line: # search for string in each line
                matched_lines.append(line) # keep a list of matched lines
 
file_content = ''.join(matched_lines) # join the matched lines
 
print(file_content)

Thanks - now code is working fine but not getting search result.

import glob
import gzip

matched_lines = []

ZIPFILES='/bkup/TC/XYZ/20210818/*.gz'

grep = raw_input('Enter Search: ')

filelist = glob.glob(ZIPFILES)
for gzfile in filelist:
print("#Starting " + gzfile) #if you want to know which file is being processed
with gzip.open( gzfile, 'rb') as f:
# grep = raw_input('Enter Search: ')
for line in f: # read file line by line
if grep in line: # search for string in each line
matched_lines.append(line) # keep a list of matched lines

file_content = ''.join(matched_lines) # join the matched lines

print(file_content)
It looks like an ancient example for Python 2, which is really out of date.

Here is a working example with some Python magic:
#!/usr/bin/env python3

# You should use Python 3 and don't touch Python 2
# The development of Python 2 has been stopped and
# won't get any security updates

import gzip
import sys
from collections import defaultdict
from pathlib import Path


def get_matching_files(root, contains):
    """
    Generator to iterate over gz-files in root and
    search line by line for each file a matching text.

    If a result was found the generator yields:
    >>> gzfile, (line_number, line)
    """
    for gzfile in root.glob("*.gz"):
        # open in text mode
        # this may rise an UnicodeDecodeError
        # if the encoding is messed up

        with gzip.open(gzfile, "rt") as gz:
            for line_number, line in enumerate(gz, start=1):
                if contains in line:
                    yield gzfile, (line_number, line)


if __name__ == "__main__":
    if len(sys.argv) != 3:
        raise SystemExit(f"python3 {sys.argv[0]} path_to_directory matching_text")

    zipfiles = Path(sys.argv[1])
    search = sys.argv[2]
    results = defaultdict(list)

    for gzfile, line in get_matching_files(zipfiles, search):
        # line is tuple of (line_number, line)
        results[gzfile].append(line)

    print(results)
The part to get the arguments should be done with argparse, click or typer.
(Aug-18-2021, 02:35 PM)DeaD_EyE Wrote: [ -> ]It looks like an ancient example for Python 2, which is really out of date.

Here is a working example with some Python magic:
#!/usr/bin/env python3

# You should use Python 3 and don't touch Python 2
# The development of Python 2 has been stopped and
# won't get any security updates

import gzip
import sys
from collections import defaultdict
from pathlib import Path


def get_matching_files(root, contains):
    """
    Generator to iterate over gz-files in root and
    search line by line for each file a matching text.

    If a result was found the generator yields:
    >>> gzfile, (line_number, line)
    """
    for gzfile in root.glob("*.gz"):
        # open in text mode
        # this may rise an UnicodeDecodeError
        # if the encoding is messed up

        with gzip.open(gzfile, "rt") as gz:
            for line_number, line in enumerate(gz, start=1):
                if contains in line:
                    yield gzfile, (line_number, line)


if __name__ == "__main__":
    if len(sys.argv) != 3:
        raise SystemExit(f"python3 {sys.argv[0]} path_to_directory matching_text")

    zipfiles = Path(sys.argv[1])
    search = sys.argv[2]
    results = defaultdict(list)

    for gzfile, line in get_matching_files(zipfiles, search):
        # line is tuple of (line_number, line)
        results[gzfile].append(line)

    print(results)
The part to get the arguments should be done with argparse, click or typer.


Apologize for delayed response as we have updated python 3.6.8 version.
Executing above script by updating exact file path in below line

for gzfile in root.glob("/bkup/TC/XYZ/20210818/*.gz"):

Output:
python3 ./srch.py path_to_directory matching_text
What's the reason for reimplementing zgrep?
(Aug-25-2021, 04:18 AM)ndc85430 Wrote: [ -> ]What's the reason for reimplementing zgrep?

Reimplementing it in Python is better like this:

import subprocess


def zgrep(file, pattern):
    proc = subprocess.Popen(["zgrep", pattern, file], stdout=subprocess.PIPE)
    for line in proc.stdout:
        yield line.decode(errors="replace")
Code, which utilizes zgrep, does not run on Windows.
In addition, it adds a dependency to Python + it's not Python.

The next could be, why to implement cat, sort, awk, sed, ... if we already have them on our machines?
The increase of non-pythonic solutions: https://github.com/arunsivaramanneo/GPU-...wer.py#L24


Output:
Output: python3 ./srch.py path_to_directory matching_text
Yes, what could this mean?
Have you tried python3 ./srch.py --help

It's the normal way how command line tools are controlled. They take options, arguments and parameters.
If you want to list a directory on Linux, you could type: ls -l /

The -l is an option and the / is an argument and points to the target directory which ls should show.
It seems that zgrep is just a shell script invoking gzip and grep. It could be easily rewritten in Python.
But the point is, it exists, so why bother reimplementing it?
As DeaD_EyE said above, to increase portability and to reduce dependencies. Python has a built-in gzip library, and a grep-like behavior can be obtained with re.search() . It means that a similar functionality can be obtained from the standard library. Of course someone has to make the effort (sorry I don't currently have time to do that).
Pages: 1 2