I am trying to develop a strategy for automated classification of pages in a large pdf file. I have come up with a strategy in python cv2 and numpy. Basically, I have converted the pdf pages to seperate image files, and using these libraries, I compare them to a template for a given type of document. So if we match the template for Doc Type A, that image page gets so labeled, and we move on to the next image.
I am trying to consider the processing time and burden of this solution. So if I have roughly 1000 to 1500 seperate image files that have to be processed, and I have some 20 different templates to compare each with, is this a feasible strategy? Will it take forever? Once a given image page gets labeled it is out of the picture for future processing (image files only have one classification).
Any advice would be most appreciated.
I am trying to consider the processing time and burden of this solution. So if I have roughly 1000 to 1500 seperate image files that have to be processed, and I have some 20 different templates to compare each with, is this a feasible strategy? Will it take forever? Once a given image page gets labeled it is out of the picture for future processing (image files only have one classification).
Any advice would be most appreciated.