Template Matching Pages

standenman · (This post was last modified: Jul-05-2024, 10:39 PM by standenman.)

I am trying to develop a strategy for automated classification of pages in a large pdf file. I have come up with a strategy in python cv2 and numpy. Basically, I have converted the pdf pages to seperate image files, and using these libraries, I compare them to a template for a given type of document. So if we match the template for Doc Type A, that image page gets so labeled, and we move on to the next image.

I am trying to consider the processing time and burden of this solution. So if I have roughly 1000 to 1500 seperate image files that have to be processed, and I have some 20 different templates to compare each with, is this a feasible strategy? Will it take forever? Once a given image page gets labeled it is out of the picture for future processing (image files only have one classification).

Any advice would be most appreciated.

Pedroski55 · Jul-06-2024, 05:54 AM

Some examples of what you are doing would help people to answer your query. A small pdf and a template or 2.

I made an OMR programme. Let's say I have 200 answer forms with 50 questions on them. The students have chosen A, B, C, or D. I run the answers forms through my little batch scanner, which gives me a 200 page pdf.

Then pass the pdf to the marking programme, which splits the pdf into jpg files, reads the QR code on each page and finds all the marked bubbles. This goes very fast, maybe 200 pages in 2 minutes, never really timed it, with all the scores saved to Excel. But I am just looking for a marked bubble in rows of bubbles in a column.

I have no idea what your template is, or what you are actually looking for, but that should not take too long!

AdamHensley · Jul-10-2024, 09:30 PM

Using OpenCV and NumPy for template matching is a solid approach. Processing 1000 to 1500 image files against 20 templates might take some time, but it shouldn't be unbearable, significantly if you can leverage parallel processing. You could try using libraries like concurrent.futures in Python to speed things up. Also, optimize your image processing steps as much as possible. Once an image is labeled and removed from the pool, the load should be significantly reduced.

AdamHensley · Jul-14-2024, 05:48 PM

Using OpenCV and NumPy for template matching is a great idea. It might take some time with 1000-1500 images and 20 templates, but it's definitely doable. To speed things up, consider using parallel processing with something like concurrent.futures. Once an image is classified and removed from the pool, it should help reduce the overall load.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	template matching: more than one template?	jtl	1	5,075	Mar-14-2019, 12:20 AM Last Post: scidam

Template Matching Pages

User Panel Messages

Announcements