Python Forum
Decompressing bz2 in multiple sub-directories
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Decompressing bz2 in multiple sub-directories
#21
Hello guys! I have a question in relation to the code that you've helped me with before (for which I am super thankful to you all). So, I am running the code below on the AWS. But sometimes my machine disconnects due to "broken pipe" (no apparent reason for that, but connection is getting lost). Therefore, how can I adjust the code so that when I re-run it after disconnection  it "skips" those files that have already been extracted and placed in to a target folder. Otherwise, as far as I understand, the code starts the process all over and over-writes all already extracted files. Since there are 44,000 files to extract, it is very time consuming. Thank you in advance for help!

import os
import sys
import bz2
from bz2 import decompress

file_counter = 0
for dirpath, dirname, files in os.walk('/home/ec2-user/Notebook/Source'):
   for filename in files:
       file_counter += 1
       if filename.endswith('.json.bz2'):
           filepath = os.path.join(dirpath, filename)
           newfilepath = os.path.join('/home/ec2-user/Notebook/Target', "{0}.json".format(file_counter))
           with open(newfilepath, 'wb') as new_file, bz2.BZ2File(filepath, 'rb', 10000000) as file:
               for data in iter(lambda : file.read(100 * 1024), b''):
                   new_file.write(data)
Reply
#22
You have to keep a list of already decompressed files.
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#23
May I dare to ask how to do that?  Think
Reply
#24
You could use some ancillary file and after every sucessfully processed file add its bzip2's filepath to it. And on start of script load this file and then check against it while iterating. os.walk traverses in an arbitrary order, so when you start script with some files processed, you need to start your counter from number of processed files and increase counter/extract only files not in the ancillary file.

But perhaps simplest solution would be to remove your initial problem with connection... I dont know what exactly do you use to connect to your instance, but if you use ssh, then install tmux or screen on your instance and use it to run your script - with tmux/screen you can detach from your session and log out without stopping your script, or attach to a running session if you got disconnected. And if you dont use ssh, then you should start to use ssh.
Reply
#25
After a successful decompressing write the name of the file into another file.

+1 for the tmux session
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#26
Thank you for feedback, guys!

I am using SSH to connect to the EC2. To address the connection issue, I initially switched to Chrome (from Opera). That seems to fix things for a time being. Now, once it happened again I added the following to my (i.e., client machine) ssh_config:
Host *
ServerAliveInterval 120

I am going to explore tmux now. Great suggestion :)
Reply
#27
You dont need to use browser for ssh at all. Just use Terminal ...
Reply
#28
(Apr-13-2017, 05:20 PM)zivoni Wrote: You dont need to use browser for ssh at all. Just use Terminal ...

Ler me clarify. I am using terminal to connect to EC2. Then run jupyter notebook, and then open it in the browser to run Python codes. Please, feel free to school me on what could be improved here :) I guess tmux comes in to play here. And that is what I am currently addressing.
Reply
#29
While jupyter notebook is nice for interactive analysis/testing, its likely more robust to run your code directly from command line. Just save your code into a file and run it from command line with
Output:
python3 script_file.py
And if you run it in tmux session, you can detach it, log out and disconnect while it still runs.
Reply
#30
(Apr-13-2017, 05:15 PM)kiton Wrote: Thank you for feedback, guys!

I am using SSH to connect to the EC2. To address the connection issue, I initially switched to Chrome (from Opera). That seems to fix things for a time being. Now, once it happened again I added the following to my (i.e., client machine) ssh_config:
Host *
ServerAliveInterval 120

I am going to explore tmux now. Great suggestion :)

If you are using Chrome to download the files, while you have a SSH connection, maybe you should switch to the rsync command:  rsync -c host:/some/path/to/filepattern /some/local/directory/. WIth this you get:

* very robust download
* compressed download (no need to BZIP)
* avoidance of re-downloading existing local files (based on size/time or checksum, depending on options)
Unless noted otherwise, code in my posts should be understood as "coding suggestions", and its use may require more neurones than the two necessary for Ctrl-C/Ctrl-V.
Your one-stop place for all your GIMP needs: gimp-forum.net
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Organization of project directories wotoko 3 361 Mar-02-2024, 03:34 PM
Last Post: Larz60+
  Listing directories (as a text file) kiwi99 1 802 Feb-17-2023, 12:58 PM
Last Post: Larz60+
  Find duplicate files in multiple directories Pavel_47 9 2,924 Dec-27-2022, 04:47 PM
Last Post: deanhystad
  rename same file names in different directories elnk 0 680 Nov-04-2022, 05:23 PM
Last Post: elnk
  I need to copy all the directories that do not match the pattern tester_V 7 2,356 Feb-04-2022, 06:26 PM
Last Post: tester_V
  Functions to consider for file renaming and moving around directories cubangt 2 1,698 Jan-07-2022, 02:16 PM
Last Post: cubangt
  Moving specific files then unzipping/decompressing christophereccles 2 2,324 Apr-24-2021, 04:25 AM
Last Post: ndc85430
  Python create directories within directories mcesmcsc 2 2,159 Dec-17-2019, 12:32 PM
Last Post: mcesmcsc
  Shutil attempts to copy directories that don't exist ConsoleGeek 5 4,446 Oct-29-2019, 09:26 PM
Last Post: Gribouillis
  How to combine file names into a list from multiple directories? python_newbie09 3 5,134 Jul-09-2019, 07:38 PM
Last Post: python_newbie09

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020