Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
tarfile but not to a file
#1
i am designing a program that will send tarballs to AWS S3. this means the data in tar format needs to be buffered and instead of writing to disk, will be given to functions in the botocore module to be stored in an S3 object without any more I/O than the reading the disk files that are being archived and the network traffic with AWS. the problem is that i see no interface for the writing of the tarball to a buffer. and i need to stream this since the size can be many gigabytes, or even many terabytes (this may be run on an EC2 instance where the kind of network capacity to do that is possible). and this will usually involve compression and may also involve encryption.

the only way i can think of doing this is to make a unix named pipe that will be set up for the tarfile module to write to in an separate process. does anyone know of a cleaner way to do this which does not involve any OS features and is totally solved in Python? Python3 will be used for this project.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#2
You can do it. TarFile takes as kwargument also fileobj.
You can make your own class to send the data uncompressed or compressed over network.

Please read this: https://docs.python.org/3/library/tarfile.html
I used the alternative mode. 'w|bzw' e.g. which is a stream writer.
No random seeking happens in this mode.
You just need to implement write and tell:

class Sender:
    def sendall(self, data):
        print(data)

class FileObj:
    def __init__(self, sender):
        self.pos = 0
        self.sender = sender
    def tell(self):
        return self.pos
    def write(self, data):
        self.pos += len(data)
        self.sender.sendall(data)

tar = tarfile.open(fileobj=FileObj(Sender()), mode='w|bz2')
Then use tar.add('file').

Don't forget to close the file. Otherwise it's not completely written.
tar.close() just writes the last missing bytes.

To be safer, use a context manager. The TarFile supports it.

You need to modify the class Sender, to allow sending objects over S3.
I don't have knowledge about S3.
Almost dead, but too lazy to die: https://sourceserver.info
All humans together. We don't need politicians!
Reply
#3
S3 only supports sequential writing. it doesn't even support appending. the only form of network resumption it supports is a kind of weird multi-part where you have to do some weird sizing steps (i may try to do this for extremely large tarballs). that's OK for me since all i will be doing is storing whole tarballs right from the beginning.

that and reading them, either to get the headers, or to restore files. yeah, it's an archival backup system. i am storing files as tarballs to store file metadata to represent things S3 cannot do. i currently use S3 as a backup with file data as the S3 object content. i can see the size and the time the object was stored in S3. symlinks and other things aren't backed up at all, so i have a separate cron job to scan the filesystem for them and make a tarball of them that ends up being backed up like a file. this new scheme will also let me store mod time, change time, owner, and group. or i might end up storing my own archival format instead of tar. either way, one part of this scheme is to save replaced or deleted data in a dated subarchive, thus having a reverse incremental backup which can be trimmed as desired or restored from any backup date. S3 can't rename objects but it can copy them to a new object damn fast (done at the storage system which is probably a bunch of big SANs).
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020