![]() |
Count image's colors very fast - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: Count image's colors very fast (/thread-41547.html) Pages:
1
2
|
RE: Count image's colors very fast - flash77 - Feb-06-2024 Hi, thanks for your answers. And thanks, Pedroski55, for your code example! I would like to have a more friendly tone - I'm not so experienced and I'm trying to get better... I thought examining pixels would work for both (text and pictures). You both are more experienced than I am. I don't know if a solution, which is based on text recognition, is reliable. And I don't know how to detect pictures. I thought therefore "messing around with pixels" would be a good idea. That's because I'm not so experienced... I will try Pedroski55's example too, just to get more experience and to compare both things. Even though I did the effort for free, it was good for gaining more programming experience. Please let me know what you think of my code (if you are still interested): Here is what I did in the meantime: The function getcolors is limited to 256 colors, otherwise None is returned. Therefore a scan can contain more than 256 colors, I quantize the image of the scan to 2 colors. Then I use the function getcolors to count the amount of pixels (the one color is the "filled" content and the other is the "background" content). I tested it creating a 10 x 10 pixel bmp with various amounts of "filled" pixels. All counted colors were correct. I will test the speed of counting pixels tomorrow using a DIN A4 - sheet. from PIL import Image def count_colors_2(image, maxcolors=2): # reducing colors image_reduced = image.quantize(colors=maxcolors, method=None, kmeans=0, palette=None) colors = image_reduced.getcolors(maxcolors) count_background_pixel = colors[0][0] count_filled_pixel = colors[1][0] return count_background_pixel, count_filled_pixel, colors image = Image.open("25.bmp") count_background_pixel, count_filled_pixel, colors = count_colors_2(image) print("count_background_pixel: " + str(count_background_pixel)) print("count_filled_pixel: " + str(count_filled_pixel)) print(colors) RE: Count image's colors very fast - Pedroski55 - Feb-07-2024 Quote:The function getcolors is limited to 256 colors, otherwise None is returned. No, I think the upper limit is 16 million. To get images from a pdf, well, of course, many people have wanted to do this, you are not the first person to try. Look here. RE: Count image's colors very fast - deanhystad - Feb-07-2024 Quote:The function getcolors is limited to 256 colors, otherwise None is returned.256 is the default limit. You can set it much higher. But as Pedroski55 pointed out, PDF files are not pictures of pages, they are structured documents. A PDF document knows if a page is blank or not. It doesn't have to perform any image analysis; it just has to look at the information for that page. There is no image analysis algorithm that will be faster than asking the PDF document if a page contains images or text. All the image analysis work was done when the PDF file was created. Here's a tutorial: https://geekflare.com/extract-text-links-images-from-pdf-using-python/ RE: Count image's colors very fast - Pedroski55 - Feb-08-2024 @deanhystad: Nice, clear link! Thanks! Apropo colours on computers: 24-bit colours: 16.7 million (2 to the power of 24) Quote:Nearly all computers and displays over the last five to ten years come standard with support for at least 16-bit color, with newer computers supporting 24-bit and 32-bit color. Is there a different between the different levels of color? The short answer is yes. All three color bit depths use red, blue and green as standard colors, but its the number of color combinations and alpha channel that makes the difference. Whether you are viewing pictures, watching a video, or playing a video game a higher color depth is more visually appealing. RE: Count image's colors very fast - flash77 - Feb-08-2024 Dear Pedroski55, dear deanhystad! Thank you for your excellent answers! Unfortunately, it wasn't at all clear to me that PDF documents already "know" whether they contain text, for example. And that you just need to ask. I thought you had to convert the PDF into an image and analyze it. That's why I just asked about it in a previous post and worked on a solution together with Deanhystad. Later I mentioned that it was actually about PDF files. I would like to apologize for the confusion and unnecessary work on this. But it also had something good: I gained more programming experience. I would also like to apologize to Pedroski55! The only thing that bothered me was the wording: "messing around with pixels" Thank you for this nice forum with the very helpful members! @deanhystad: Thanks a lot for the link!! (I will replace the image analysis in my code with the link's content...) Have a nice evening, Pedroski55 and deanhystad!! RE: Count image's colors very fast - deanhystad - Feb-08-2024 Quote:Unfortunately, it wasn't at all clear to me that PDF documents already "know" whether they contain text, for example.It is always a good idea to start each project with research. I enjoy watching other people work, but I love using other people's work. RE: Count image's colors very fast - kumaransh - Feb-26-2024 It's great that you're exploring ways to count image colors quickly. It seems like you're facing an issue with the "colors_count_list" being NoneType. This could be due to various reasons, such as an error in the conversion process or an issue with the OpenCV2 library. One suggestion would be to double-check the image conversion process to ensure that the BMP file is being generated correctly from the PDF. Additionally, make sure that the OpenCV2 library is installed correctly and that you're importing it properly in your code. And hey, if you're looking to optimize your images further, you could try using compress jpeg to reduce file sizes without compromising quality. RE: Count image's colors very fast - flash77 - Mar-05-2024 Dear deanhystad, I just wanted to let you know that our work on counting the pixels in an image to determine the majority of pixels of the same color (PrimaryColorRatio) was not in vain. First, I went through the information contained in the link you gave me. Now I can recognize text or images on "clean", non-scanned PDFs. def text_existing is for detecting text, def image_existing is for detecting images. Here is the code for it: import os # extract text from PyPDF2 import PdfReader # extract images import fitz def text_existing(path): obj = os.scandir(path) for entry in obj: if entry.name.endswith(".pdf"): reader = PdfReader(path + entry.name) text = reader.pages[0].extract_text() if text != " ": print(text) else: # page without text detected print("page contains no text") obj.close() def image_existing(path): obj = os.scandir(path) for entry in obj: if entry.name.endswith(".pdf"): doc = fitz.open(path + entry.name) page = doc.load_page(0) image_xref = page.get_images() c = page.get_contents() if c is None: print("No image detected.") else: # get xref value of the image xref_value = image_xref[0][0] if xref_value > 0: print("Image detected.") obj.close() text_existing("D:/Daten/aktuell/detect_blank_pages/") image_existing("D:/Daten/aktuell/detect_blank_pages/")My original situation was (and still is) that I want to examine PDFs where A4 paper is scanned and a PDF is created using software (naps2). For me, these PDFs are recognized with the programming mentioned above that an image is included. Now the efforts with the PrimaryColorRatio come to fruition: By counting the pixels of the same color and looking at the largest proportion. For me it's all about white A4 paper that has some kind of print or writing on it. I extracted the image of the scanned page from the PDF and binarized it according to instructions, so I only have white and black pixels that can be counted. FilledRatio = 1 - PrimaryColorRatio This allows me to determine whether a scanned page is filled or empty (by comparing the "filled pixels" to a threshold). I completely discarded my previous idea of ​​quantization - I was able to achieve much better results with binarization. I tried to binarize myself with numpy.where (but it didn't work :-)) Now I can detect blank pages within scanned pages... The next step will be to incorporate this function into my programming with multiprocessing. In the past I have already received help from you with the "Hangerfinder", which I use to digitize Super8 films. Multiprocessing was also used here. I'll follow that. Despite the misunderstanding, there was a sensible solution to my problem with scanned pages, it wasn't for nothing... And I was able to learn more things related to Python. Thank you for your patient, clever support!! from PIL import Image from pathlib import WindowsPath import os import numpy as np from io import BytesIO # extract images from pdf import fitz def examine_pdf_from_scanned_paper(path): # is image from pdf filled, or is it empty? # a FilledRatio < filledThreshold is recognized as empty filledThreshold = 0.001 obj = os.scandir(path) for entry in obj: if entry.name.endswith(".pdf"): # extract image from single page pdf doc = fitz.open(path + "/" + str(entry.name)) page = doc.load_page(0) image_xref = page.get_images() # get xref value of the image xref_value = image_xref[0][0] img_dictionary = doc.extract_image(xref_value) # get the actual image binary data img_binary = img_dictionary["image"] # create a BytesIO object to work with the image bytes image_io = BytesIO(img_binary) # this image was in the pdf # open the image using Pillow image = Image.open(image_io) # grayscale image im_gray = image.convert("L") # create numpy array im_gray = np.array(im_gray) # rgb values above threshold shall be white, under threshold black maxval = 255 threshold = 200 im_bin = (im_gray > threshold) * maxval image_bin = Image.fromarray(np.uint8(im_bin)) image_bin.save("binarized.png") # image to numpy array rgb = np.array(image_bin).reshape(-1, 3) # counts pixel of binarized image, gets PrimaryColorRatio and FilledRatio. b24 = rgb[:, 0] * 65536 + rgb[:, 1] * 256 + rgb[:, 2] _, counts = np.unique(b24, return_counts=True) PrimaryColorRatio = max(counts) / max(len(b24), 1) FilledRatio = 1 - PrimaryColorRatio print("PrimaryColorRatio = ", PrimaryColorRatio) print("FilledRatio = ", FilledRatio) if FilledRatio > filledThreshold: print("image filled") else: print("image empty") obj.close() examine_pdf_from_scanned_paper("D:/Daten/aktuell/detect_blank_pages_gut/") RE: Count image's colors very fast - deanhystad - Mar-05-2024 You should look at pathlib as a replacement for os.scandir. Instead of this: import os def text_existing(path): obj = os.scandir(path) for entry in obj: if entry.name.endswith(".pdf"):You could do this: from pathlib import Path def text_existing(path): for entry in Path(path).glob("*.pdf"): |