Posts: 148
Threads: 34
Joined: May 2020
Hi,
thanks for your answers.
And thanks, Pedroski55, for your code example!
I would like to have a more friendly tone - I'm not so experienced and I'm trying to get better...
I thought examining pixels would work for both (text and pictures).
You both are more experienced than I am.
I don't know if a solution, which is based on text recognition, is reliable.
And I don't know how to detect pictures.
I thought therefore "messing around with pixels" would be a good idea.
That's because I'm not so experienced...
I will try Pedroski55's example too, just to get more experience and to compare both things.
Even though I did the effort for free, it was good for gaining more programming experience.
Please let me know what you think of my code (if you are still interested):
Here is what I did in the meantime:
The function getcolors is limited to 256 colors, otherwise None is returned.
Therefore a scan can contain more than 256 colors, I quantize the image of the scan to 2 colors.
Then I use the function getcolors to count the amount of pixels (the one color is the "filled" content and the other is the "background" content).
I tested it creating a 10 x 10 pixel bmp with various amounts of "filled" pixels.
All counted colors were correct.
I will test the speed of counting pixels tomorrow using a DIN A4 - sheet.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
from PIL import Image
def count_colors_2(image, maxcolors = 2 ):
image_reduced = image.quantize(colors = maxcolors, method = None , kmeans = 0 , palette = None )
colors = image_reduced.getcolors(maxcolors)
count_background_pixel = colors[ 0 ][ 0 ]
count_filled_pixel = colors[ 1 ][ 0 ]
return count_background_pixel, count_filled_pixel, colors
image = Image. open ( "25.bmp" )
count_background_pixel, count_filled_pixel, colors = count_colors_2(image)
print ( "count_background_pixel: " + str (count_background_pixel))
print ( "count_filled_pixel: " + str (count_filled_pixel))
print (colors)
|
Posts: 1,093
Threads: 143
Joined: Jul 2017
Quote:The function getcolors is limited to 256 colors, otherwise None is returned.
No, I think the upper limit is 16 million.
To get images from a pdf, well, of course, many people have wanted to do this, you are not the first person to try.
Look here.
Posts: 6,783
Threads: 20
Joined: Feb 2020
Feb-07-2024, 05:14 PM
(This post was last modified: Feb-07-2024, 05:14 PM by deanhystad.)
Quote:The function getcolors is limited to 256 colors, otherwise None is returned.
256 is the default limit. You can set it much higher.
But as Pedroski55 pointed out, PDF files are not pictures of pages, they are structured documents. A PDF document knows if a page is blank or not. It doesn't have to perform any image analysis; it just has to look at the information for that page. There is no image analysis algorithm that will be faster than asking the PDF document if a page contains images or text. All the image analysis work was done when the PDF file was created.
Here's a tutorial:
https://geekflare.com/extract-text-links...ng-python/
Pedroski55 likes this post
Posts: 1,093
Threads: 143
Joined: Jul 2017
@deanhystad: Nice, clear link! Thanks!
Apropo colours on computers: 24-bit colours: 16.7 million (2 to the power of 24)
Quote:Nearly all computers and displays over the last five to ten years come standard with support for at least 16-bit color, with newer computers supporting 24-bit and 32-bit color. Is there a different between the different levels of color? The short answer is yes. All three color bit depths use red, blue and green as standard colors, but its the number of color combinations and alpha channel that makes the difference. Whether you are viewing pictures, watching a video, or playing a video game a higher color depth is more visually appealing.
16-bit color
With 16-bit color, also called High color, computers and monitors can display as many as 65,536 colors, which is adequate for most uses. However, graphic intensive video games and higher resolution video can benefit from and use the higher color depths.
24-bit color
Using 24-bit color, also called True color, computers and monitors can display as many as 16,777,216 different color combinations.
32-bit color
Like 24-bit color, 32-bit color supports 16,777,216 colors but has an alpha channel it can create more convincing gradients, shadows, and transparencies. With the alpha channel 32-bit color supports 4,294,967,296 color combinations.
As you increase the support for more colors, more memory is required. However, almost all computers today include video cards with enough memory to support 32-bit colors at most resolutions. Older computer and video cards may only be able to support up to 16-bit color.
Can my eyes tell a difference?
Most users cannot tell much of a difference between 16-bit and 32-bit. However, if you are using a program with gradients, shadows, transparency, or other visual effects that require multiple colors you may notice a difference.
Posts: 148
Threads: 34
Joined: May 2020
Feb-08-2024, 06:18 PM
(This post was last modified: Feb-08-2024, 06:27 PM by flash77.)
Dear Pedroski55,
dear deanhystad!
Thank you for your excellent answers!
Unfortunately, it wasn't at all clear to me that PDF documents already "know" whether they contain text, for example.
And that you just need to ask. I thought you had to convert the PDF into an image and analyze it.
That's why I just asked about it in a previous post and worked on a solution together with Deanhystad.
Later I mentioned that it was actually about PDF files.
I would like to apologize for the confusion and unnecessary work on this.
But it also had something good: I gained more programming experience.
I would also like to apologize to Pedroski55!
The only thing that bothered me was the wording: "messing around with pixels"
Thank you for this nice forum with the very helpful members!
@deanhystad: Thanks a lot for the link!!
(I will replace the image analysis in my code with the link's content...)
Have a nice evening, Pedroski55 and deanhystad!!
Posts: 6,783
Threads: 20
Joined: Feb 2020
Quote:Unfortunately, it wasn't at all clear to me that PDF documents already "know" whether they contain text, for example.
It is always a good idea to start each project with research. I enjoy watching other people work, but I love using other people's work.
Posts: 1
Threads: 0
Joined: Feb 2024
Feb-26-2024, 05:14 PM
(This post was last modified: Feb-26-2024, 05:14 PM by kumaransh.)
It's great that you're exploring ways to count image colors quickly.
It seems like you're facing an issue with the "colors_count_list" being NoneType. This could be due to various reasons, such as an error in the conversion process or an issue with the OpenCV2 library.
One suggestion would be to double-check the image conversion process to ensure that the BMP file is being generated correctly from the PDF. Additionally, make sure that the OpenCV2 library is installed correctly and that you're importing it properly in your code.
And hey, if you're looking to optimize your images further, you could try using compress jpeg to reduce file sizes without compromising quality.
Posts: 148
Threads: 34
Joined: May 2020
Dear deanhystad,
I just wanted to let you know that our work on counting the pixels in an image to determine the majority of pixels of the same color (PrimaryColorRatio) was not in vain.
First, I went through the information contained in the link you gave me.
Now I can recognize text or images on "clean", non-scanned PDFs.
def text_existing is for detecting text, def image_existing is for detecting images.
Here is the code for it:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
import os
from PyPDF2 import PdfReader
import fitz
def text_existing(path):
obj = os.scandir(path)
for entry in obj:
if entry.name.endswith( ".pdf" ):
reader = PdfReader(path + entry.name)
text = reader.pages[ 0 ].extract_text()
if text ! = " " :
print (text)
else :
print ( "page contains no text" )
obj.close()
def image_existing(path):
obj = os.scandir(path)
for entry in obj:
if entry.name.endswith( ".pdf" ):
doc = fitz. open (path + entry.name)
page = doc.load_page( 0 )
image_xref = page.get_images()
c = page.get_contents()
if c is None :
print ( "No image detected." )
else :
xref_value = image_xref[ 0 ][ 0 ]
if xref_value > 0 :
print ( "Image detected." )
obj.close()
text_existing( "D:/Daten/aktuell/detect_blank_pages/" )
image_existing( "D:/Daten/aktuell/detect_blank_pages/" )
|
My original situation was (and still is) that I want to examine PDFs where A4 paper is scanned and a PDF is created using software (naps2).
For me, these PDFs are recognized with the programming mentioned above
that an image is included.
Now the efforts with the PrimaryColorRatio come to fruition:
By counting the pixels of the same color and looking at the largest proportion.
For me it's all about white A4 paper that has some kind of print or writing on it.
I extracted the image of the scanned page from the PDF and binarized it according to instructions,
so I only have white and black pixels that can be counted.
FilledRatio = 1 - PrimaryColorRatio
This allows me to determine whether a scanned page is filled or empty (by comparing the "filled pixels" to a threshold).
I completely discarded my previous idea of ​​quantization - I was able to achieve much better results with binarization.
I tried to binarize myself with numpy.where (but it didn't work :-))
Now I can detect blank pages within scanned pages...
The next step will be to incorporate this function into my programming with multiprocessing.
In the past I have already received help from you with the "Hangerfinder", which I use to digitize Super8 films.
Multiprocessing was also used here.
I'll follow that.
Despite the misunderstanding, there was a sensible solution to my problem with scanned pages, it wasn't for nothing...
And I was able to learn more things related to Python.
Thank you for your patient, clever support!!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
from PIL import Image
from pathlib import WindowsPath
import os
import numpy as np
from io import BytesIO
import fitz
def examine_pdf_from_scanned_paper(path):
filledThreshold = 0.001
obj = os.scandir(path)
for entry in obj:
if entry.name.endswith( ".pdf" ):
doc = fitz. open (path + "/" + str (entry.name))
page = doc.load_page( 0 )
image_xref = page.get_images()
xref_value = image_xref[ 0 ][ 0 ]
img_dictionary = doc.extract_image(xref_value)
img_binary = img_dictionary[ "image" ]
image_io = BytesIO(img_binary)
image = Image. open (image_io)
im_gray = image.convert( "L" )
im_gray = np.array(im_gray)
maxval = 255
threshold = 200
im_bin = (im_gray > threshold) * maxval
image_bin = Image.fromarray(np.uint8(im_bin))
image_bin.save( "binarized.png" )
rgb = np.array(image_bin).reshape( - 1 , 3 )
b24 = rgb[:, 0 ] * 65536 + rgb[:, 1 ] * 256 + rgb[:, 2 ]
_, counts = np.unique(b24, return_counts = True )
PrimaryColorRatio = max (counts) / max ( len (b24), 1 )
FilledRatio = 1 - PrimaryColorRatio
print ( "PrimaryColorRatio = " , PrimaryColorRatio)
print ( "FilledRatio = " , FilledRatio)
if FilledRatio > filledThreshold:
print ( "image filled" )
else :
print ( "image empty" )
obj.close()
examine_pdf_from_scanned_paper( "D:/Daten/aktuell/detect_blank_pages_gut/" )
|
Posts: 6,783
Threads: 20
Joined: Feb 2020
You should look at pathlib as a replacement for os.scandir. Instead of this:
1 2 3 4 5 6 |
import os
def text_existing(path):
obj = os.scandir(path)
for entry in obj:
if entry.name.endswith( ".pdf" ):
|
You could do this:
1 2 3 4 |
from pathlib import Path
def text_existing(path):
for entry in Path(path).glob( "*.pdf" ):
|
|