Jan-16-2021, 03:58 AM
I have run the following code on a sample video/mp4 to extract all images. It does it frame by frame, so there are many thousands of image files created, ..lol
and it works perfectly. Now I wanted to modify the first block of code to NOT write out the image files, but simply convert to text, and then if the text is the same as the last processed text, next frame, ..ELSE write out a text file and save the contents. This appears to work okay..
Also, some of the text is nearly the same. This is where there is two frames with the same text content, yet the frame image may be slightly different due to image distortions, etc. Can CV2 or pytesseract clean the image ??
#!/usr/bin/env python # Importing all necessary libraries import cv2 import os # Read the video from specified path cam = cv2.VideoCapture("/home/********/Downloads/Python_scripts/test1.mp4") try: # creating a folder named data if not os.path.exists('data'): os.makedirs('data') # if not created then raise error except OSError: print ('Error: Creating directory of data') # frame currentframe = 0 while(True): # reading from frame ret,frame = cam.read() if ret: # if video is still left continue creating images name = './data/frame' + str(currentframe) + '.jpg' print ('Creating...' + name) # writing the extracted images cv2.imwrite(name, frame) # increasing counter so that it will # show how many frames are created currentframe += 1 else: break # Release all space and windows once done cam.release() cv2.destroyAllWindows()but as I require the text, not the images, have tested the following on just one file ..
#!/usr/bin/env python import subprocess subprocess.run(["tesseract", "data/frame998.jpg", "stdout"])and called it as follows
python3 subprocess_test.py
and it works perfectly. Now I wanted to modify the first block of code to NOT write out the image files, but simply convert to text, and then if the text is the same as the last processed text, next frame, ..ELSE write out a text file and save the contents. This appears to work okay..
#!/usr/bin/env python # Importing all necessary libraries import cv2 import os #import subprocess import pytesseract # Read the video from specified path cam = cv2.VideoCapture("/home/********/Downloads/Python_scripts/test1.mp4") # Set the text file contents to null text_file_old = "" try: # creating a folder named data if not os.path.exists('data'): os.makedirs('data') # if not created then raise error except OSError: print ('Error: Creating directory of data') # frame currentframe = 0 while(True): # reading from frame ret,frame = cam.read() if ret: # if video is still left continue creating images name = './data/frame' + str(currentframe) + '.jpg' text_file_new = pytesseract.image_to_string(frame) if text_file_new != text_file_old: # write contents to a file filename = './data/text' + str(currentframe) + '.txt' file1 = open(filename, "w") file1.write(text_file_new) file1.close() text_file_old = text_file_new # increasing counter so that it will # show how many frames are created currentframe += 1 else: break # Release all space and windows once done cam.release() cv2.destroyAllWindows()With the writing of the text file, can it be done differently or more efficiently ? The script has been running only about 5 minutes, yet running
fdupes
already show a few duplicate files (contents). No doubt if the video showed a frame123, then frame124 and then output frame123 again, there appears to be little I can do in the python code/execution. Just run fdupes afterwards ??Also, some of the text is nearly the same. This is where there is two frames with the same text content, yet the frame image may be slightly different due to image distortions, etc. Can CV2 or pytesseract clean the image ??