Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Testing Zipfiles
#1
Over the holidays my main data drive crashed, requiring a deep data scan, which resulted in hundreds of thousands of files with extensions but no names.

I am trying to devise a utility to seperate out the .epub snd .jaava files (for now, more later), and it does appear to work BUT will crash even with exception handling (as I understand it) - if it comes across a 'BAD' zipfile.

Easiest solutioh at this point is to first seperate out ALL the zipfiles into GOOD and BAD categories.
The simple code below will work fine for GOOD files, but will halt after a BAD one is processes (moved).
Simply rerunning will continue the parsing until the next BAD. Ad infinitum.

I want it to continue to loop until the entire directory is parsed and eith ignore or move BAD zips.

#!python3

import os
import shutil
import zipfile
import glob
from zipfile import ZipFile
from zipfile import BadZipfile 


# location = 'c:/4/'
location = './'



for file in os.listdir(location):
   if file.endswith(".ZIP"):
       print(file)
       myfile = file
       try:
           with ZipFile(myfile, 'r') as zipObj:
               print("zipfile is OK")
       except BadZipfile:
           print("Does not work ")
           dirPath = 'BAD'
           if not os.path.isdir(dirPath): 
               print('The directory is not present. Creating a new one..')
               os.mkdir(dirPath)
           print("Myfile is", myfile ) 
           original = myfile
           target = 'BAD\\'
           shutil.move(original,target)
           os.remove(myfile)
           break
    
       print("Good Zip ")
       dirPath = 'GOOD'
       if not os.path.isdir(dirPath): 
           print('The directory is not present. Creating a new one..')
           os.mkdir(dirPath)
       print("Myfile is", myfile ) 
       original = myfile
       target = 'GOOD\\'
       ZipFile.close(zipObj)
       shutil.move(original,target)
Forgive the crudity of the code. I am still a n00b at python, and want to get the process working before I optimize

Any ideas here?
Reply
#2
on line 34 you have break. Once you eneter the except block you will break out of the loop. You don't need the break.
You need to put the rest of the code in else part

i.e.
try:
    # code that may cause error
except BadZipdile:
    # handle the bad zip file
    # lines 24-33
else:
    # code for case when no exception occur
    # lines 36-45
tester_V likes this post
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#3
Many thanks! wprks like a charm:


#!python3

#This is a zip test script yo test and categorize a single directory of zip files into /GOOD and /BAD subdirectories. 
# It uses the Python zipfile module, and is subject to its abilities and limitations.
# More rigourous testing of /GOOD may be needed with zip / pkzip / 7zip test routines.   


import os
import shutil
import zipfile
import glob
from zipfile import ZipFile
from zipfile import BadZipfile 


# location = 'c:/4/'
location = './'



for file in os.listdir(location):
   if file.endswith(".ZIP"):
       print(file)
       myfile = file
       try:
           with ZipFile(myfile, 'r') as zipObj:
               print("zipfile is OK")
       except BadZipfile:
           print("Does not work ")
           dirPath = 'BAD'
           if not os.path.isdir(dirPath): 
               print('The directory is not present. Creating a new one..')
               os.mkdir(dirPath)
           print("Myfile is", myfile ) 
           original = myfile
           target = 'BAD\\'
           shutil.copy(original,target)
           os.remove(myfile)
       else:
           print("Good Zip ")
           dirPath = 'GOOD'
           if not os.path.isdir(dirPath): 
               print('The directory is not present. Creating a new one..')
               os.mkdir(dirPath)
           print("Myfile is", myfile ) 
           original = myfile
           target = 'GOOD\\'
           ZipFile.close(zipObj)
           shutil.copy(original,target)
           os.remove(myfile)
    
(Code is in python brackets)

However, while it is now functionally correct (though crude) - a problem has popped to with the zipfile module: False negatives. Some 'Good' files are actually extensively damaged, so badly that Multiarc.dll cannot view them, as it can easily see the file structure of the 'Bad' files.

The undetected are indeed zipfiles as 7zip can unpack at least a small percentage of their original contents. But they are highly corrupt!

Is there a 'better' zipfile testing module?
Reply
#4
Have you considered using a module such as filetype to guess the type of a file without the extension?
buran likes this post
Reply
#5
did you check https://docs.python.org/3/library/zipfil...le.testzip

also you can try https://docs.python.org/3/library/zipfil...is_zipfile and not depend on file extension.
Gribouillis likes this post
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#6
The python-magic library might be of some use as well. I also found fleep.
Reply
#7
Yes, it will be good to consider to try another library checking for zip files. As the docx' for example. It's a zip file too
"As they say in Mexico 'dosvidaniya'. That makes two vidaniyas."
https://freedns.afraid.org
Reply
#8
Filetypes indeed looks interesting, but as it might just parse the first few binary lines for specific filetype 'tags' (^mz=exe, ^PDF=pdf, ^PK=Zip etc) it might not be able to detect actual damage to the file. Fleep looks especially good for detecting and renaming the large amount of GZ files mixed in with other archive types.

But it is smart enough to separate (hopefully) zips from epubs which will make it useful in my armory of tools for restoring from partition damage.

Testzip is a subfunction of zipfile. I will see if it gives any different results in calling BadZipfile.

I will soon be back to the file 'repair' functions as soon as I reconstitute the system files of the 'lost' partition on a new drive.


Most of the available 'recovery' tools are centered on the M$ Office format filetypes, which is of little concern here. It might well be worth a test to see if they can detect badly damaged files though instead of saying 'its not a CSV or XLS file'.

The problem with partition recovery tools is that they often recover a large amount of damaged files. I am a little bit surprised that there is not more of an ecosystem for these type of utilities. Not to restore (there is a lot of utils for that) - but to restore from the restore!
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020