Python Forum
Pigz inside python - Reading compressed .gz file much faster
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Pigz inside python - Reading compressed .gz file much faster
#1
Hello Pythoners-

I am a linux admin. And one of our users were wondering on how to make the below script faster using pigz or any other multi-threading methods. I have no idea regarding python. Can someone please share on how to make the below part a little bit faster? She said it currently takes around 45minutes to parse on compressed .gz file that is 1GB in size.

if infile.endswith(".gz"):
data = gzip.open(infile, 'rb')
else:
data = open(infile, "r")
outfile = infile.split(".txt")[0] +"_step1.gz"
outdata = gzip.open(outfile, "wb")

## take line by line
for line in data:
line1 = line.rstrip()
if line.startswith("@"):
....
....
....
....
....
outdata.close()
data.close()
print ">Output file: "+ outfile # end of run
Thank you. This is not a homework task. This is a biology lab's problem.
Reply
#2
you can use python magic
Although this module is in PyPi,  the name conflicts with other packages of the same name, so you have to download and install the wheel
To do this:
  • Get the wheel from PyPi as follows
  • go to:  https://pypi.python.org/pypi/python-magic/
  • Download the wheel file (Current version): python_magic-0.4.15-py2.py3-none-any.whl
  • change directory to one containing wheel
  • from command line, install with:
    pip install python_magic-0.4.15-py2.py3-none-any.whl

Once you have that package installed, use the following code to find file type:
def check_filetype(filename):
    f = magic.Magic(mime=True, uncompress=True, filename)
    return f.from_file(filename)
This will avoid having to load entire zip file.
It will return a string of type:
Output:
'text/plain'
See the documentation here: https://github.com/ahupp/python-magic
Reply
#3
Hey Larzo60+

Thanks friend. We are okay with memory. The bottle neck is the read and write speeds which is where the time is being wasted. Do you still know if python-magic helps in those areas?
Reply
#4
No, let me give you a sample for reading the files ... Be back soon

Please answer this. What is the goal of reading a zip file in this way.
There may already be a package that does what you're trying to do  in record time.

Example (built into python) see: https://docs.python.org/3.6/library/gzip.html
Reply
#5
I honestly don't know. I was asked for help to make it faster. Decided to ask someone who knows the left and right of python. I have no clue @Larzo60+.

Thanks
Reply
#6
Hard to write something without knowing what the goal is.
Reply
#7
http://aripollak.com/pythongzipbenchmarks/

Looks like the speed depends pretty heavily on which version of python you're running.  You might also gain some improvement by wrapping the gzip object in io.BufferedReader.

I wouldn't mind seeing more of your code, though, as 45minutes for 1gb sounds excessive.  Depending on what you're doing (and the power of the computer it's running on), maybe we can create a process queue and take advantage of multiple cores/processors.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
Sad problems with reading csv file. MassiJames 3 632 Nov-16-2023, 03:41 PM
Last Post: snippsat
  Navigating file directories and paths inside Jupyter Notebook Mark17 5 708 Oct-29-2023, 12:40 PM
Last Post: Mark17
  Reading a file name fron a folder on my desktop Fiona 4 916 Aug-23-2023, 11:11 AM
Last Post: Axel_Erfurt
  Reading data from excel file –> process it >>then write to another excel output file Jennifer_Jone 0 1,106 Mar-14-2023, 07:59 PM
Last Post: Jennifer_Jone
  Reading a file JonWayn 3 1,095 Dec-30-2022, 10:18 AM
Last Post: ibreeden
  Reading Specific Rows In a CSV File finndude 3 989 Dec-13-2022, 03:19 PM
Last Post: finndude
  Excel file reading problem max70990 1 898 Dec-11-2022, 07:00 PM
Last Post: deanhystad
  Reading All The RAW Data Inside a PDF NBAComputerMan 4 1,350 Nov-30-2022, 10:54 PM
Last Post: Larz60+
  Replace columns indexes reading a XSLX file Larry1888 2 989 Nov-18-2022, 10:16 PM
Last Post: Pedroski55
  Failing reading a file and cannot exit it... tester_V 8 1,821 Aug-19-2022, 10:27 PM
Last Post: tester_V

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020