Extracting all text from a video

jehoshua · Jan-16-2021, 03:58 AM

I have run the following code on a sample video/mp4 to extract all images. It does it frame by frame, so there are many thousands of image files created, ..lol

#!/usr/bin/env python

# Importing all necessary libraries 
import cv2 
import os 
  
# Read the video from specified path 
cam = cv2.VideoCapture("/home/********/Downloads/Python_scripts/test1.mp4") 
  
try: 
      
    # creating a folder named data 
    if not os.path.exists('data'): 
        os.makedirs('data') 
  
# if not created then raise error 
except OSError: 
    print ('Error: Creating directory of data') 
  
# frame 
currentframe = 0
  
while(True): 
      
    # reading from frame 
    ret,frame = cam.read() 
  
    if ret: 
        # if video is still left continue creating images 
        name = './data/frame' + str(currentframe) + '.jpg'
        print ('Creating...' + name) 
  
        # writing the extracted images 
        cv2.imwrite(name, frame) 
  
        # increasing counter so that it will 
        # show how many frames are created 
        currentframe += 1
    else: 
        break
  
# Release all space and windows once done 
cam.release() 
cv2.destroyAllWindows()

but as I require the text, not the images, have tested the following on just one file ..

#!/usr/bin/env python

import subprocess
subprocess.run(["tesseract", "data/frame998.jpg",  "stdout"])

and called it as follows

python3 subprocess_test.py

and it works perfectly. Now I wanted to modify the first block of code to NOT write out the image files, but simply convert to text, and then if the text is the same as the last processed text, next frame, ..ELSE write out a text file and save the contents. This appears to work okay..

#!/usr/bin/env python

# Importing all necessary libraries 
import cv2 
import os 

#import subprocess
import pytesseract

# Read the video from specified path 
cam = cv2.VideoCapture("/home/********/Downloads/Python_scripts/test1.mp4") 

# Set the text file contents to null
text_file_old = ""
  
try: 
      
    # creating a folder named data 
    if not os.path.exists('data'): 
        os.makedirs('data') 
  
# if not created then raise error 
except OSError: 
    print ('Error: Creating directory of data') 
  
# frame 
currentframe = 0
  
while(True): 
      
    # reading from frame 
    ret,frame = cam.read() 
  
    if ret: 
        # if video is still left continue creating images 
        name = './data/frame' + str(currentframe) + '.jpg'

        text_file_new =  pytesseract.image_to_string(frame)
        
        if text_file_new != text_file_old:
            
            # write contents to a file
            filename = './data/text' + str(currentframe) + '.txt'
            file1 = open(filename, "w")
            file1.write(text_file_new) 
            file1.close() 

            text_file_old = text_file_new
        
        # increasing counter so that it will 
        # show how many frames are created 
        currentframe += 1
        
    else: 
        break
  
# Release all space and windows once done 
cam.release() 
cv2.destroyAllWindows()

With the writing of the text file, can it be done differently or more efficiently ? The script has been running only about 5 minutes, yet running fdupes already show a few duplicate files (contents). No doubt if the video showed a frame123, then frame124 and then output frame123 again, there appears to be little I can do in the python code/execution. Just run fdupes afterwards ??

Also, some of the text is nearly the same. This is where there is two frames with the same text content, yet the frame image may be slightly different due to image distortions, etc. Can CV2 or pytesseract clean the image ??

wavic · Jan-16-2021, 08:08 AM

Perhaps one more operation on a image could reduce the time for processing on each one. Since, the text is in contrast with the rest of the image ( or it should be if you want to read it at all ). That could be used in our benefit. As a middle step.
I haven't played with OpenCV but this is the first I found.

I hope it can give you a some direction in a way to reduce the processing time and lower the errors (duplicates)

https://realpython.com/python-opencv-color-spaces/

I think you can use gamma correction and give the result to the image segmentation process for even more contrast.

jehoshua · Nov-14-2021, 09:54 PM

Thanks, I only just say your reply. :)

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	How to insert text time - into video frames?	oxidian	0	1,013	Aug-25-2024, 04:51 PM Last Post: oxidian
	How to remove footer from PDF when extracting to text	jh67	3	9,338	Dec-13-2022, 06:52 AM Last Post: DPaul
	Extracting Specific Lines from text file based on content.	jokerfmj	8	5,429	Mar-28-2022, 03:38 PM Last Post: snippsat
	Extracting the text between each "i class"	knight2000	4	3,340	May-26-2021, 09:55 AM Last Post: knight2000
	Extracting data based on specific patterns in a text file	K11	1	2,836	Aug-28-2020, 09:00 AM Last Post: Gribouillis
	Extracting Text	Evil_Patrick	6	4,035	Nov-13-2019, 08:51 AM Last Post: buran
	Extracting a portion of a text document	alarcon032002	8	5,729	Jan-17-2019, 10:35 PM Last Post: Larz60+
	Google Cloud Vision: Extracting Location of Text	pablo_castano	0	3,057	Jun-24-2018, 02:47 AM Last Post: pablo_castano

Extracting all text from a video

User Panel Messages

Announcements