Python Forum

Full Version: accessing files inside a compressed tar file
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
it looks like the tarfile module is intended to always handle individual files in the archive by only reading them in from the file system or (when creating a tar file) or writing them to the file system (when extracting from a tar file). is there a way to use that module to access the individual files. an example use case is a script that reads in a compressed tar file and uncompresses it, extracts all files, uncompresses any contained files that are compresses, rebuilds the tar file with the compressed files now uncompressed, and compresses that new tar file.
Skaperen Wrote:is there a way to use that module to access the individual files.
Can you explain the question further? It seems to me that Tarfile.extractfile(member) accesses an individual file and returns an open file object. Doesn't it suit your needs?
how do you get a file-like object to write a new member to an archive? also, how is the archive raw contents provided to the object and the archive raw contents obtained from the output archive object, all at the same time? do i need to use coroutines, threads, or processes, to avoid locking logic? let's say the input archive is arriving over socket (size and member count is unknown) and the output archive goes back out over the same socket as each compressed member is uncompressed (so the archive compression works better) ... all this being done with no writable file system space available.

the way i did this in C was with stateful objects (opaque pointer in first arg of method call) with methods to provide data and obtain data. obtaining data is pretty simple to do in Python ,,, how much data is ready is how much you get (an empty sequence would mean nothing was ready, yet, not EOF). providing data would be slightly more complicated. a mutable sequence can be reduced in size by how much the object can make use of in that call. the other way to do it with immutable sequences would be to return the number of items used and let the caller do the slicing or return a sliced like-sequence with the remaining data. in C, i used ring buffers which led me to develop the virtual ring buffer (VRB, a way to optimize ring buffers in virtual memory environments). i think i prefer the mutable sequence (e.g. bytearray) way because it is easier on the caller.
and i see no module supporting cpio. i have a few hundred cpio files i'd like to convert to tar format.
Skaperen Wrote:all this being done with no writable file system space available.
This is really an uncommon use case. Most people have some space available to work with archives. I'm afraid the tarfile module needs some space to decompress the archives. You could perhaps create a memory disk to do this in RAM if the archive is small enough or purchase an external hard drive.
Skaperen Wrote:i have a few hundred cpio files i'd like to convert to tar format.
Why not use directly linux tools for this? They'll be more efficient than python libraries. From what I read here and there, the tar command can extract cpio files. If you really want to use a python library, type cpio in pypi's search engine...
i guess i will have to use linux tools. i'm still thinking of the C way where re-implementing something doesn't make it slower. one of the goals i have is to re-implement as much as possible in an architecture-portable way that does not require re-compiling. if the tar command can't be re-implement using the tar library, and it's really just a front-end to the tar command, it's not what i want to do. i can use the tar command myself (in the code) probably more efficiently.

one of my big projects is building a cloud run-time that is ready to go in new architectures. almost all architectures are running in some IaaS or SaaS cloud service, somewhere. i saw 7 different ones in an early service about 15 years ago.