identify not white pixels in bmp

flash77 · Oct-05-2023, 07:52 PM

Dear deanhystad,

thanks again for your outstandig answer!!

I was thinking about speeding up the examination with multiprocessing (like in our hangerfinder project) and examine just every xth pixel too...

I will have the time to experiment with this not before weekend...

I wish you a pleasant evening.

Smile

Thanks a lot!! Smile

flash77 · Oct-07-2023, 04:58 PM

Dear deanhystad, thank you very much for your help!!
Here I am sending what I made from your information.
I scanned a DIN A4 page (with part of the "hangerfinder's" code) and am now a bit surprised by the values.
Am I making a mistake in line 29?
Unfortunately my scanned pdf is too large to attach it.
Best regards,
flash77

import numpy as np
from pdf2image import convert_from_path
from PIL import Image
import time


def primary_color_ratio(pdf_name):
    """Return ratio of pixels that are the "background" color."""
    pages = convert_from_path(pdf_name, 300)
    pages[0].save("1.bmp", "BMP")
    image_rgb = Image.open("1.bmp").convert('RGB')
    #examine just every xth pixel
    rgb = np.array(image_rgb).reshape(-1, 3)[::x]
    #get 24Bit Color
    b24 = rgb[:, 0] * 65535 + rgb[:, 1] * 256 + rgb[:, 2]
    _, counts = np.unique(b24, return_counts=True)
    return max(counts) / max(len(b24), 1)

#examine every xth pixel
x = 100
#"filled" is the content that can be read by humans (for example: writing)
#"userdef_image_filled_ratio_whole_page": the ratio at which an image is considered filled
userdef_image_filled_ratio_whole_page = 0.2
userdef_image_filled_ratio_x = userdef_image_filled_ratio_whole_page / x
pdf_name = "t.pdf"
startTime = time.time()
primColorRatio = primary_color_ratio(pdf_name)
print("primColorRatio = " + str(primColorRatio))
filled_ratio = 1 - primColorRatio
endTime = time.time()
print("filled_ratio = " + str(filled_ratio))
print(endTime - startTime)
if filled_ratio >= userdef_image_filled_ratio_whole_page:
    print("The image is filled.")
else:
    print("The image is empty.")

**deanhystad** · Oct-07-2023, 07:01 PM

What are the surprising values?

I scanned a PDF that has a picture and a bunch of text. White makes up 46% of the pixels.

flash77 · Oct-07-2023, 07:48 PM

Hi,
primColorRatio = 0.08355574985044406
filled_ratio = 0.916444250149556

I scanned the attached text...

**deanhystad** · Oct-08-2023, 03:33 AM

You failed the test. I intentionally left an obvious error. Cannot believe you didn't spot it.

b24 = rgb[:, 0] * 65535 + rgb[:, 1] * 256 + rgb[:, 2]

should be
b24 = rgb[:, 0] * 65536 + rgb[:, 1] * 256 + rgb[:, 2]
I don't know how much difference that will make, but it really confused me for a while when I made a test image that was all red, but the primary color wasn't red.

The text file isn't very useful for testing. It is not an image or a pdf, and if I use the text to create an image or a pdf it will probably not be the same image or pdf that you created. Fonts have a lot of different colors. That is how they can have smooth looking edges. I made an image of a PDF page that was all text, and it had thousands of colors even though it was only black text on a white background. Black was the primary color, eking out white by 2%, but together they only made up 60% of the colors.

flash77 · Oct-08-2023, 09:21 PM

Hi,

sorry for not spotting the error, I didn't expect it there and overread it.

Unfortunately I don't know how to deal with a grainy scan. Testing "clean" color bitmaps worked fine.
Perhaps it is possible to count rgb values which are nearly the same. I will try to think about it tomorrow...

Testing different values for x in comparison using every pixel worked for some test scans - there were nearly the same results (primary color ratio, filled ratio).

(Would it be good to test a page perhaps several times and choose pixel randomly, to get a better result?)

import numpy as np
from pdf2image import convert_from_path
from PIL import Image
import time


def primary_color_ratio(pdf_name):
    """Return ratio of pixels that are the "background" color."""
    pages = convert_from_path(pdf_name, 300)
    bmp = pages[0].save("bmp.bmp", "BMP")
    image_rgb = np.array(Image.open("bmp.bmp").convert('RGB'))
    #examine just every xth pixel
    rgb = np.array(image_rgb).reshape(-1, 3)[::x]
    #get 24Bit Color
    b24 = rgb[:, 0] * 65536 + rgb[:, 1] * 256 + rgb[:, 2]
    _, counts = np.unique(b24, return_counts=True)
    return max(counts) / max(len(b24), 1)

#examine every xth pixel
x = 100
#"filled" is the content that can be read by humans (for example: writing)
#"userdef_image_filled_ratio_whole_page": the ratio at which an image is considered filled
userdef_image_filled_ratio_whole_page = 0.2
userdef_image_filled_ratio_x = userdef_image_filled_ratio_whole_page / x
pdf_name = "t2.pdf"
startTime = time.time()
primColorRatio = primary_color_ratio(pdf_name)
print("primColorRatio = " + str(primColorRatio))
filled_ratio = 1 - primColorRatio
endTime = time.time()
print("filled_ratio = " + str(filled_ratio))
print(endTime - startTime)
if filled_ratio >= userdef_image_filled_ratio_whole_page:
    print("The image is filled.")
else:
    print("The image is empty.")

flash77 · (This post was last modified: Oct-28-2023, 11:24 AM by flash77.)

Dear deanhystad,

I was a bit confused and tried a lot, because the detection of the right primary color ratio wasn't working properly...
(And, among other things, I tried to use Counter instead of numpy.unique - but I wasn't able to get it working.)

I'm sorry for my slightly confused posting...

What do you think of my new idea to use:

# reducing colors
    image = img.quantize(colors=2, method=None, kmeans=0, palette=None)

to reduce the colors?

My idea is to use it like a filter to get the right primary color ratio.

Testing the code with a "clean" white 10 x 10 pixel bmp-file, which contains 2 colored pixels:
The result is that primColorRatio is 0.98.
filled_ratio is about 0.02.
And the image is considered as filled.

Then I tested the code with a scanned white sheet of paper, which contained only black writing.
The result is that primColorRatio is 0.912429823754084.
filled_ratio is about 0.08757017624591601.
And the image is considered as filled.

Is this a suitable way to perform primaryColorRatio determination?
I'm pretty confident...

If that is the case, then the next step I will take is to examine every xth pixel.

Best regards,

flash77

import numpy as np
from pdf2image import convert_from_path
from PIL import Image
import time


def primary_color_ratio(pdf_name):
    """Return ratio of pixels that are the "background" color."""
    pages = convert_from_path(pdf_name, 300)
    # save pdf to bmp
    bmpImage = pages[0].save("bmpImage.bmp", "BMP")
    # open Image
    img = Image.open(r"bmpImage.bmp")
    # reducing colors
    image = img.quantize(colors=2, method=None, kmeans=0, palette=None)
    # convert to rgb
    image_rgb = image.convert('RGB')
    # I will do it later: examine just every xth pixel
    rgb = np.array(image_rgb).reshape(-1, 3)#[::x]
    #get 24Bit Color
    b24 = rgb[:, 0] * 65536 + rgb[:, 1] * 256 + rgb[:, 2]
    _, counts = np.unique(b24, return_counts=True)
    return max(counts) / max(len(b24), 1)


#"filled" is the content that can be read by humans (for example: writing)
#"userdef_image_filled_ratio_whole_page": the ratio at which an image is considered filled
pdf_name = "t2.pdf"
userdef_image_filled_ratio_whole_page = 0.02
startTime = time.time()
primColorRatio = primary_color_ratio(pdf_name)
print("primColorRatio = " + str(primColorRatio))
filled_ratio = 1 - primColorRatio
endTime = time.time()
print("filled_ratio = " + str(filled_ratio))
print("elapsed time: ", (endTime - startTime))
if filled_ratio >= userdef_image_filled_ratio_whole_page:
    print("The image is filled.")
else:
    print("The image is empty.")

flash77 · (This post was last modified: Nov-11-2023, 04:30 PM by flash77.)

Dear community,

using quantization (reduce to 2 colors) could be perhaps a way to go (but I'm not sure if it is the right way).

I had to switch some lines:

This should be better:

"filled_ratio" describes the ratio of writing, images... to the whole image.

A scanned white paper with black writing had the following output:

primColorRatio = 0.912429823754084
filled_ratio = 0.08757017624591601
elapsed time: 4.87112021446228
The image is filled.

An scanned image is filled approximately 2 thirds with green and 1 thirds with white.
In this case green is the primary color.
Output:

primColorRatio = 0.7054545454545454
filled_ratio = 0.29454545454545455
elapsed time: 4.151195764541626

import numpy as np
from pdf2image import convert_from_path
from PIL import Image
import time


def primary_color_ratio(pdf_name):
    """Return ratio of pixels that are the "background" color."""
    pages = convert_from_path(pdf_name, 300)
    # save pdf to bmp
    bmpImage = pages[0].save("bmpImage.bmp", "BMP")
    # open Image
    img = Image.open(r"bmpImage.bmp")
    # reducing colors
    image_reduced = img.quantize(colors=2, method=None, kmeans=0, palette=None)
    image_reduced.show()
    # convert to rgb
    image_rgb = image_reduced.convert('RGB')
    # I will do it later: examine just every xth pixel
    rgb = np.array(image_rgb).reshape(-1, 3)  # [::x]
    # get 24Bit Color
    b24 = rgb[:, 0] * 65536 + rgb[:, 1] * 256 + rgb[:, 2]
    _, numberOfColors = np.unique(b24, return_counts=True)
    return max(numberOfColors) / max(len(b24), 1)


# "filled" is the content that can be read by humans (for example: writing)
# "userdef_image_filled_ratio_whole_page": the ratio at which an image is considered filled
pdf_name = "grün.pdf"
userdef_image_filled_ratio_whole_page = 0.02
startTime = time.time()
primColorRatio = primary_color_ratio(pdf_name)
print("primColorRatio = " + str(primColorRatio))
filled_ratio = 1 - primColorRatio
endTime = time.time()
print("filled_ratio = " + str(filled_ratio))
print("elapsed time: ", (endTime - startTime))
if filled_ratio >= userdef_image_filled_ratio_whole_page:
    print("The image is filled.")
else:
    print("The image is empty.")

I would be happy to receive feedback...

Greetings, flash77

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	guys please help me , pycharm is not able to identify my xlsx file	CrazyGreenYT7	1	2,042	Jun-13-2021, 02:22 PM Last Post: Larz60+
	Need to identify only files created today.	tester_V	5	4,713	Feb-18-2021, 06:32 AM Last Post: tester_V
	pillow reversing the order of pixels after every row	johnEmScott	4	3,179	May-27-2020, 09:42 AM Last Post: scidam
	Need to identify sheet color in excel workbook	chewy1418	2	2,564	Feb-14-2020, 03:26 PM Last Post: chewy1418
	Convert 400 grayscale pixels into RGB	python420	1	2,486	Jan-02-2020, 04:19 PM Last Post: Clunk_Head
	Need help to identify Mersenne Primes, I do need a search pattern.	Pleiades	0	1,950	Dec-03-2019, 11:05 PM Last Post: Pleiades
	White spaces	kdiba	1	1,999	Oct-08-2019, 06:52 PM Last Post: Aurthor_King_of_the_Brittons
	including the white space parts in str.split()	Skaperen	6	3,339	Jun-20-2019, 06:03 PM Last Post: Skaperen
	replace white space with a string, is this pythonic?	Skaperen	1	2,035	Jun-18-2019, 11:36 PM Last Post: metulburr
	Syntax Error : I can't identify what's wrong!	caarsonr	11	6,400	Jun-10-2019, 11:18 PM Last Post: Yoriz

identify not white pixels in bmp

User Panel Messages

Announcements