Python Forum
Search string in mutliple .gz files
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Search string in mutliple .gz files
#1
Hi,
Kindly your support to provide python script to search given strings (Number, Text etc.) from multiple ".gz" text files

Directory contains multiple".gz" files date wise
Output:
/bkup/TC/XYZ/20210818 File Names: A_7235818.csv.gz A_7235819.csv.gz . .
Output:
Content of sample file. 38486,22625,XYZ_06_0_20210817204446-3997 88279,77617,XYZ_06_0_20210817204846-3998
Getting error while running below Code.

import glob
import gzip

matched_lines = []

ZIPFILES='/bkup/TC/XYZ/20210818/*.gz'

grep = raw_input('Enter Search: ')

filelist = glob.glob(ZIPFILES)
for gzfile in filelist:
     #print("#Starting " + gzfile)  #if you want to know which file is being processed
    with gzip.open( gzfile, 'rb') as f:
#    grep = raw_input('Enter Search: ')
    for line in f: # read file line by line
        if grep in line: # search for string in each line
            matched_lines.append(line) # keep a list of matched lines

file_content = ''.join(matched_lines) # join the matched lines

print(file_content)
Output:
Error:
$ ./srch6.py File "./srch6.py", line 17 for line in f: # read file line by line ^ IndentationError: expected an indented block
Larz60+ write Aug-18-2021, 11:20 AM:
Please post all code, output and errors (it it's entirety) between their respective tags. Refer to BBCode help topic on how to post. Use the "Preview Post" button to make sure the code is presented as you expect before hitting the "Post Reply/Thread" button.
fixed for you this time, please use bbcode tags on future posts
Reply
#2
The error is clear. You need to fix indentation.

(I did not try to run following so there may be additional errors):
import glob
import gzip
 
matched_lines = []
 
ZIPFILES='/bkup/TC/XYZ/20210818/*.gz'
 
grep = raw_input('Enter Search: ')
 
filelist = glob.glob(ZIPFILES)
for gzfile in filelist:
    # print("#Starting " + gzfile)  #if you want to know which file is being processed
    with gzip.open( gzfile, 'rb') as f:
        # grep = raw_input('Enter Search: ')
        for line in f: # read file line by line
            if grep in line: # search for string in each line
                matched_lines.append(line) # keep a list of matched lines
 
file_content = ''.join(matched_lines) # join the matched lines
 
print(file_content)
Reply
#3
(Aug-18-2021, 11:26 AM)Larz60+ Wrote: The error is clear. You need to fix indentation.

(I did not try to run following so there may be additional errors):
import glob
import gzip
 
matched_lines = []
 
ZIPFILES='/bkup/TC/XYZ/20210818/*.gz'
 
grep = raw_input('Enter Search: ')
 
filelist = glob.glob(ZIPFILES)
for gzfile in filelist:
    # print("#Starting " + gzfile)  #if you want to know which file is being processed
    with gzip.open( gzfile, 'rb') as f:
        # grep = raw_input('Enter Search: ')
        for line in f: # read file line by line
            if grep in line: # search for string in each line
                matched_lines.append(line) # keep a list of matched lines
 
file_content = ''.join(matched_lines) # join the matched lines
 
print(file_content)

Thanks - now code is working fine but not getting search result.

import glob
import gzip

matched_lines = []

ZIPFILES='/bkup/TC/XYZ/20210818/*.gz'

grep = raw_input('Enter Search: ')

filelist = glob.glob(ZIPFILES)
for gzfile in filelist:
print("#Starting " + gzfile) #if you want to know which file is being processed
with gzip.open( gzfile, 'rb') as f:
# grep = raw_input('Enter Search: ')
for line in f: # read file line by line
if grep in line: # search for string in each line
matched_lines.append(line) # keep a list of matched lines

file_content = ''.join(matched_lines) # join the matched lines

print(file_content)
Larz60+ write Aug-18-2021, 06:18 PM:
Please post all code, output and errors (it it's entirety) between their respective tags. Refer to BBCode help topic on how to post. Use the "Preview Post" button to make sure the code is presented as you expect before hitting the "Post Reply/Thread" button.

Please, as requested previously, use bbcode tags on posts, it's a forum requirement.
Reply
#4
It looks like an ancient example for Python 2, which is really out of date.

Here is a working example with some Python magic:
#!/usr/bin/env python3

# You should use Python 3 and don't touch Python 2
# The development of Python 2 has been stopped and
# won't get any security updates

import gzip
import sys
from collections import defaultdict
from pathlib import Path


def get_matching_files(root, contains):
    """
    Generator to iterate over gz-files in root and
    search line by line for each file a matching text.

    If a result was found the generator yields:
    >>> gzfile, (line_number, line)
    """
    for gzfile in root.glob("*.gz"):
        # open in text mode
        # this may rise an UnicodeDecodeError
        # if the encoding is messed up

        with gzip.open(gzfile, "rt") as gz:
            for line_number, line in enumerate(gz, start=1):
                if contains in line:
                    yield gzfile, (line_number, line)


if __name__ == "__main__":
    if len(sys.argv) != 3:
        raise SystemExit(f"python3 {sys.argv[0]} path_to_directory matching_text")

    zipfiles = Path(sys.argv[1])
    search = sys.argv[2]
    results = defaultdict(list)

    for gzfile, line in get_matching_files(zipfiles, search):
        # line is tuple of (line_number, line)
        results[gzfile].append(line)

    print(results)
The part to get the arguments should be done with argparse, click or typer.
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#5
(Aug-18-2021, 02:35 PM)DeaD_EyE Wrote: It looks like an ancient example for Python 2, which is really out of date.

Here is a working example with some Python magic:
#!/usr/bin/env python3

# You should use Python 3 and don't touch Python 2
# The development of Python 2 has been stopped and
# won't get any security updates

import gzip
import sys
from collections import defaultdict
from pathlib import Path


def get_matching_files(root, contains):
    """
    Generator to iterate over gz-files in root and
    search line by line for each file a matching text.

    If a result was found the generator yields:
    >>> gzfile, (line_number, line)
    """
    for gzfile in root.glob("*.gz"):
        # open in text mode
        # this may rise an UnicodeDecodeError
        # if the encoding is messed up

        with gzip.open(gzfile, "rt") as gz:
            for line_number, line in enumerate(gz, start=1):
                if contains in line:
                    yield gzfile, (line_number, line)


if __name__ == "__main__":
    if len(sys.argv) != 3:
        raise SystemExit(f"python3 {sys.argv[0]} path_to_directory matching_text")

    zipfiles = Path(sys.argv[1])
    search = sys.argv[2]
    results = defaultdict(list)

    for gzfile, line in get_matching_files(zipfiles, search):
        # line is tuple of (line_number, line)
        results[gzfile].append(line)

    print(results)
The part to get the arguments should be done with argparse, click or typer.


Apologize for delayed response as we have updated python 3.6.8 version.
Executing above script by updating exact file path in below line

for gzfile in root.glob("/bkup/TC/XYZ/20210818/*.gz"):

Output:
python3 ./srch.py path_to_directory matching_text
Reply
#6
What's the reason for reimplementing zgrep?
Gribouillis likes this post
Reply
#7
(Aug-25-2021, 04:18 AM)ndc85430 Wrote: What's the reason for reimplementing zgrep?

Reimplementing it in Python is better like this:

import subprocess


def zgrep(file, pattern):
    proc = subprocess.Popen(["zgrep", pattern, file], stdout=subprocess.PIPE)
    for line in proc.stdout:
        yield line.decode(errors="replace")
Code, which utilizes zgrep, does not run on Windows.
In addition, it adds a dependency to Python + it's not Python.

The next could be, why to implement cat, sort, awk, sed, ... if we already have them on our machines?
The increase of non-pythonic solutions: https://github.com/arunsivaramanneo/GPU-...wer.py#L24


Output:
Output: python3 ./srch.py path_to_directory matching_text
Yes, what could this mean?
Have you tried python3 ./srch.py --help

It's the normal way how command line tools are controlled. They take options, arguments and parameters.
If you want to list a directory on Linux, you could type: ls -l /

The -l is an option and the / is an argument and points to the target directory which ls should show.
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#8
It seems that zgrep is just a shell script invoking gzip and grep. It could be easily rewritten in Python.
Reply
#9
But the point is, it exists, so why bother reimplementing it?
Reply
#10
As DeaD_EyE said above, to increase portability and to reduce dependencies. Python has a built-in gzip library, and a grep-like behavior can be obtained with re.search() . It means that a similar functionality can be obtained from the standard library. Of course someone has to make the effort (sorry I don't currently have time to do that).
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Writing a Linear Search algorithm - malformed string representation Drone4four 10 834 Jan-10-2024, 08:39 AM
Last Post: gulshan212
  splitting file into multiple files by searching for string AlphaInc 2 816 Jul-01-2023, 10:35 PM
Last Post: Pedroski55
  Search multiple CSV files for a string or strings cubangt 7 7,842 Feb-23-2022, 12:53 AM
Last Post: Pedroski55
  Replace String in multiple text-files [SOLVED] AlphaInc 5 7,963 Aug-08-2021, 04:59 PM
Last Post: Axel_Erfurt
  fuzzywuzzy search string in text file marfer 9 4,433 Aug-03-2021, 02:41 AM
Last Post: deanhystad
  how search files Ron_Crafter 4 42,261 Apr-17-2021, 11:19 AM
Last Post: Ron_Crafter
  I want to search a variable for a string D90 lostbit 3 2,585 Mar-31-2021, 07:14 PM
Last Post: lostbit
  Merging all file_name.log's files from directory to one and search “PerformanceINFO" sutra 0 1,760 Dec-09-2020, 05:14 PM
Last Post: sutra
  String search in different excel Kristenl2784 0 1,679 Jul-20-2020, 02:37 PM
Last Post: Kristenl2784
  Complex word search multiple files Kristenl2784 0 1,556 Jul-18-2020, 01:22 PM
Last Post: Kristenl2784

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020