Oct-08-2023, 09:21 PM
Hi,
sorry for not spotting the error, I didn't expect it there and overread it.
Unfortunately I don't know how to deal with a grainy scan. Testing "clean" color bitmaps worked fine.
Perhaps it is possible to count rgb values which are nearly the same. I will try to think about it tomorrow...
Testing different values for x in comparison using every pixel worked for some test scans - there were nearly the same results (primary color ratio, filled ratio).
(Would it be good to test a page perhaps several times and choose pixel randomly, to get a better result?)
sorry for not spotting the error, I didn't expect it there and overread it.
Unfortunately I don't know how to deal with a grainy scan. Testing "clean" color bitmaps worked fine.
Perhaps it is possible to count rgb values which are nearly the same. I will try to think about it tomorrow...
Testing different values for x in comparison using every pixel worked for some test scans - there were nearly the same results (primary color ratio, filled ratio).
(Would it be good to test a page perhaps several times and choose pixel randomly, to get a better result?)
import numpy as np from pdf2image import convert_from_path from PIL import Image import time def primary_color_ratio(pdf_name): """Return ratio of pixels that are the "background" color.""" pages = convert_from_path(pdf_name, 300) bmp = pages[0].save("bmp.bmp", "BMP") image_rgb = np.array(Image.open("bmp.bmp").convert('RGB')) #examine just every xth pixel rgb = np.array(image_rgb).reshape(-1, 3)[::x] #get 24Bit Color b24 = rgb[:, 0] * 65536 + rgb[:, 1] * 256 + rgb[:, 2] _, counts = np.unique(b24, return_counts=True) return max(counts) / max(len(b24), 1) #examine every xth pixel x = 100 #"filled" is the content that can be read by humans (for example: writing) #"userdef_image_filled_ratio_whole_page": the ratio at which an image is considered filled userdef_image_filled_ratio_whole_page = 0.2 userdef_image_filled_ratio_x = userdef_image_filled_ratio_whole_page / x pdf_name = "t2.pdf" startTime = time.time() primColorRatio = primary_color_ratio(pdf_name) print("primColorRatio = " + str(primColorRatio)) filled_ratio = 1 - primColorRatio endTime = time.time() print("filled_ratio = " + str(filled_ratio)) print(endTime - startTime) if filled_ratio >= userdef_image_filled_ratio_whole_page: print("The image is filled.") else: print("The image is empty.")