Apr-13-2017, 02:25 PM
Hello guys! I have a question in relation to the code that you've helped me with before (for which I am super thankful to you all). So, I am running the code below on the AWS. But sometimes my machine disconnects due to "broken pipe" (no apparent reason for that, but connection is getting lost). Therefore, how can I adjust the code so that when I re-run it after disconnection it "skips" those files that have already been extracted and placed in to a target folder. Otherwise, as far as I understand, the code starts the process all over and over-writes all already extracted files. Since there are 44,000 files to extract, it is very time consuming. Thank you in advance for help!
import os import sys import bz2 from bz2 import decompress file_counter = 0 for dirpath, dirname, files in os.walk('/home/ec2-user/Notebook/Source'): for filename in files: file_counter += 1 if filename.endswith('.json.bz2'): filepath = os.path.join(dirpath, filename) newfilepath = os.path.join('/home/ec2-user/Notebook/Target', "{0}.json".format(file_counter)) with open(newfilepath, 'wb') as new_file, bz2.BZ2File(filepath, 'rb', 10000000) as file: for data in iter(lambda : file.read(100 * 1024), b''): new_file.write(data)