Python Forum

Pages: 1 2

Dear community,

I found a code at stackoverflow which shall get the image's color count very fast.
https://stackoverflow.com/questions/7139...n-an-image

I'm trying to get the second function working:

def count_colors_2(cv_img: np.array) -> list: # no need to give colors

The situation is the following:

I've got a pdf-file ("t2.pdf"), which I convert to the bmp-file "bmpImage.bmp" (line 23).

Then I open the image with openCV2 (line 25).

I don't know why "colors_count_list" is NoneType (line 14).

Here is my attempt:

import time
import numpy as np
from PIL import Image
from pdf2image import convert_from_path
import cv2

colors_count_list = []


def count_colors_2(cv_image: np.array) -> list:  # no need to give colors
    pil_image = Image.fromarray(cv_image)
    colors_count_list = pil_image.getcolors()
    print('count_colors time elapsed: {:.10f}s'.format(time.time() - start_time))
    for count, c_bgr in colors_count_list:
        print('\tcolor {} appeared {} times'.format(c_bgr, count))
    return colors_count_list


if __name__ == '__main__':
    start_time = time.time()
    # save pdf to bmp
    pages = convert_from_path("t2.pdf", 300)
    pages[0].save("bmpImage.bmp", "BMP")
    # Open image using openCV2
    opencv_image = cv2.imread("bmpImage.bmp")
    colors_count_list = count_colors_2(opencv_image)
    print(colors_count_list)

Error:Traceback (most recent call last):
  File "D:\Daten\aktuell\testOpenCVColorCount\main.py", line 26, in <module>
    colors_count_list = count_colors_2(opencv_image)
  File "D:\Daten\aktuell\testOpenCVColorCount\main.py", line 14, in count_colors_2
    for count, c_bgr in colors_count_list:
TypeError: 'NoneType' object is not iterable

Process finished with exit code 1

Please be so kind and help me...

Many thanks...

From the PIL Image documentation

Quote:Image.getcolors(maxcolors=256)[source]
Returns a list of colors used in this image.

The colors will be in the image’s mode. For example, an RGB image will return a tuple of (red, green, blue) color values, and a P image will return the index of the color in the palette.

PARAMETERS:
maxcolors – Maximum number of colors. If this number is exceeded, this method returns None. The default limit is 256 colors.

Your image must have more than 256 colors. Worked fine when I passed an image with 5 colors.

Your code doesn't work for me either!

Just using image, you can get what you want:

from PIL import Image

img = '/home/pedro/Pictures/demeter2.jpeg' # multi-coloured harvest scene
img2 = '/home/pedro/Pictures/Greek-flag.jpg' # blue and white

im = Image.open(img2).convert("L") 
im1 = Image.Image.getcolors(im) # gives output
im = Image.open(img2).convert("RGB") 
im1 = Image.Image.getcolors(im) # no output
im = Image.open(img2).convert("CMYK") 
im1 = Image.Image.getcolors(im) # no output
im = Image.open(img2).convert("P") 
im1 = Image.Image.getcolors(im) # gives different output to "L"
im = Image.open(img).convert("P")
im1 = Image.Image.getcolors(im)

The last output for img:

Output:
[(123, 0), (2542, 11), (3208, 12), (1034, 13), (331, 14), (1, 15), (14, 16), (1989, 17), (7476, 18), (3983, 19), (2996, 20), (165, 21), (16, 23), (80, 24), (356, 25), (279, 26), (25, 27), (1, 31), (41, 46), (1000, 47), (2177, 48), (2679, 49), (534, 50), (2, 51), (57, 52), (2015, 53), (10738, 54), (19666, 55), (23753, 56), (3704, 57), (68, 59), (2636, 60), (20338, 61), (65101, 62), (14642, 63), (15, 66), (2996, 67), (31908, 68), (11880, 69), (1, 73), (242, 74), (265, 75), (3, 82), (3, 83), (4, 84), (19, 88), (153, 89), (253, 90), (926, 91), (1338, 92), (298, 93), (72, 95), (355, 96), (8437, 97), (36334, 98), (11585, 99), (1, 101), (13, 102), (12112, 103), (119076, 104), (106189, 105), (217, 109), (28158, 110), (64745, 111), (23, 116), (215, 117), (5, 125), (1, 127), (1, 130), (22, 131), (32, 132), (19, 133), (21, 134), (2, 135), (2, 137), (21, 138), (362, 139), (13002, 140), (16431, 141), (55, 145), (30329, 146), (84806, 147), (1285, 152), (11728, 153), (6, 174), (7, 175), (4, 176), (2, 181), (732, 182), (4137, 183), (350, 188), (4514, 189), (3, 217), (1, 218), (1, 219), (1, 224), (12, 225)]

len(im1)
97

If you just do this, you get nothing:

im = Image.open(img)
im1 = Image.Image.getcolors(im)
im1

What the parameter maxcolors=256 is supposed to do I don't know. I tried with much bigger numbers and got nothing.

Why .convert("P") is needed is also a mystery to me!

Dear deanhystad, dear Pedroski55,

thanks a lot for your answers!

In the meantime I read the information about maxcolors too...

Because I have to analyze pictures with lots of colors I will go back to the solution in thread "identify not white pixels in bmp", post #18.

I will have to analyze scanned pages (DIN A4) to find empty pages and will use multiprocessing later.

Is there a way to lower the process time beside of multiprocessing?

Many thanks...

My list of the best ways to improve speed in order of their impact.
Efficient algorithm. (I've seen thousands times faster results from a better algorithm).
Minimize amount of Python code (Use external libraries. Up to hundreds of times faster than all code written in Python).
Multi-processing (Typically 1.5 to 3 times faster if you use 2 to 4 cores),

To count the number of colours in an image, which can be as high a 16 million, I believe, use imagemagick from the command line for a quick result.

This image, img3, is quite big and has a lot of colours, but less than 250 000:

From the command line, bash shell:

Quote:identify -format %k /home/pedro/Downloads/damage_back_left_edge.jpg

The above command returns 243704:

Quote:pedro@pedro-HP:~$ identify -format %k /home/pedro/Downloads/damage_back_left_edge.jpg
243704pedro@pedro-HP:~$

Set that number in PIL, or say 250 000:

img3 = '/home/pedro/Downloads/damage_back_left_edge.jpg'
im = Image.open(img3).convert("RGB")
im1 = Image.Image.getcolors(im, maxcolors=250000)
len(im1)

Output:
243704

Hi deanhystad,
hi Pedroski55,

thanks a lot for your answers!

I experimented a bit...

There is the function on which deanhystad helped me a lot ("primary_color_ratio()") - it takes 1,634 seconds to run.

There is the function which I wrote ("pdf_to_image_array()") - it takes 1,736 seconds to run.

There is the function which I found online ("count_colors_2()") - it takes 0,012 seconds to run.

Could you please give me an advice what I should do?

My goal is:
Convert pdfs to bmps, detect empty pages in a (very) short time.
When this works, it should be optimized with multiprocessing (I got very good help within this forum at this topic already).

import numpy as np
from pdf2image import convert_from_path
from PIL import Image
import time


def primary_color_ratio(pdf_name):
    """Return ratio of pixels that are the "background" color."""
    pages = convert_from_path(pdf_name, 300)
    # save pdf to bmp
    bmpImage = pages[0].save("bmpImage.bmp", "BMP")
    # open Image
    img = Image.open(r"bmpImage.bmp")
    # reducing colors
    image_reduced = img.quantize(colors=2, method=None, kmeans=0, palette=None)
    #image_reduced.show()
    # convert to rgb
    image_rgb = image_reduced.convert('RGB')
    # I will do it later: examine just every xth pixel
    rgb = np.array(image_rgb).reshape(-1, 3)#[::x]
    # get 24Bit Color
    b24 = rgb[:, 0] * 65536 + rgb[:, 1] * 256 + rgb[:, 2]
    _, numberOfColors = np.unique(b24, return_counts=True)
    return max(numberOfColors) / max(len(b24), 1)


def pdf_to_image_array(pdf_name):
    pages = convert_from_path(pdf_name, 300)
    # save pdf to bmp
    bmpImage = pages[0].save("bmpImage.bmp", "BMP")
    # open Image
    img = Image.open("bmpImage.bmp")
    # reducing colors
    image_reduced = img.quantize(colors=2, method=None, kmeans=0, palette=None)
    image_array = np.array(image_reduced)
    return image_array


def count_colors_2(image_array) -> list:  # no need to give colors
    pil_image = Image.fromarray(image_array)
    colors_count_list = pil_image.getcolors(2)
    for count, c_bgr in colors_count_list:
        print('\tcolor {} appeared {} times'.format(c_bgr, count))
    return colors_count_list


pdf_name = "t2.pdf"
image_array = pdf_to_image_array(pdf_name)
start_time = time.time()
count_colors_2(image_array)
print('count_colors_2 time elapsed: {:.10f}s'.format(time.time() - start_time))

I wouldn't expect pil_image.getcolors(2) to take very long to find 3 colors and return None.

code source, someone has always done these things before!

I didn't get the part that you only want to find blank pages. Sorry.

If a page with no text is "a blank page" (could only contain an image I suppose) then this will save all that messing around with pixels!

import fitz

# check whether the page has text or not.
def check_page(page):
    text = page.get_text()
    return len(text.strip()) == 0

path2infile = "/home/pedro/pdfs/pdfs/doctor_visits_with_blank_pages.pdf" # 5 pages, 2 pages no text
path2outfile = "/home/pedro/pdfs/pdfs/doctor_visits_no_blank_pages.pdf" # ends up with 3 pages

input_pdf = fitz.open(path2infile)
output_pdf = fitz.open()

for pgno in range(input_pdf.page_count):
  page = input_pdf[pgno]
  if not check_page(page):
    output_pdf.insert_pdf(input_pdf,from_page=pgno,to_page = pgno)

output_pdf.save(path2outfile)
input_pdf.close()
output_pdf.close()

You can add another function to check for images, if no text is found!

But, if all pages are numbered, that is text!

Doh!

I my defense, this did start as a previous thread titled: identify not white pixels in bmp. The pdf came later. Still, Doh!

Pages: 1 2

flash77

deanhystad

Pedroski55

flash77

deanhystad

Pedroski55

flash77

deanhystad

Pedroski55

deanhystad