pdf2image, poppler and paths - Printable Version

pdf2image, poppler and paths - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: pdf2image, poppler and paths (/thread-37451.html)

Pages: 1 2

pdf2image, poppler and paths - jehoshua - Jun-11-2022

I need to run some tests on converting PDF's to images. The article at https://medium.com/towards-data-science/extracting-text-from-scanned-pdf-using-pytesseract-open-cv-cd670ee38052 has been quite helpful.

I have pdf2image, poppler, poppler-utils, etc installed. I have used the code at https://gist.github.com/akash-ch2812/1e2c0991105d0ed2f0fa2cadbee00362/raw/4c840fcd8ee5ec8e492f968e25d80af84ff21df0/PDF_to_Image.py

from pdf2image import convert_from_path

pdfs = r"provide path to pdf file"
pages = convert_from_path(pdfs, 350)

i = 1
for page in pages:
    image_name = "Page_" + str(i) + ".jpg"  
    page.save(image_name, "JPEG")
    i = i+1

Error:$ python3 PDF_to_Image.py Tests_20220530.pdf 

Traceback (most recent call last):
  File "/home/********/.local/lib/python3.9/site-packages/pdf2image/pdf2image.py", line 479, in pdfinfo_from_path
    raise ValueError
ValueError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/********/Downloads/OCR/PDF_to_Image.py", line 4, in <module>
    pages = convert_from_path(pdfs, 350)
  File "/home/********/.local/lib/python3.9/site-packages/pdf2image/pdf2image.py", line 98, in convert_from_path
    page_count = pdfinfo_from_path(pdf_path, userpw, poppler_path=poppler_path)["Pages"]
  File "/home/********/.local/lib/python3.9/site-packages/pdf2image/pdf2image.py", line 488, in pdfinfo_from_path
    raise PDFPageCountError(
pdf2image.exceptions.PDFPageCountError: Unable to get page count.
I/O Error: Couldn't open file 'provide path to pdf file': No such file or directory.

So, have tried various versions of specifying the path and/or filename at line 3, all to no avail. Not sure how to determine if 'poppler' is in the PATH, however a 'locate' shows it is installed.

I have looked through the issues from "pdf2image" and problem not solved.

Does this script simply need the "poppler" path and how do I find that ? Also, surely the code can be modified so that a parameters is parsed to specify path/directories. The PDF is in the same path as the script, so I assume the script is failing because it doesn't know where "poppler" is found.

RE: pdf2image, poppler and paths - Larz60+ - Jun-11-2022

the last line of error:

Error:
I/O Error: Couldn't open file 'provide path to pdf file': No such file or directory.

refers to line 3 of your code. This needs to be a file name of a pdf file.

RE: pdf2image, poppler and paths - jehoshua - Jun-11-2022

Quote:refers to line 3 of your code. This needs to be a file name of a pdf file.

Thanks; I knew it was line 3 , but just wasn't specified correctly. The following ..

pdfs = r"~/Downloads/OCR/Tests_20220530.pdf" didn't work
pdfs = r"/home/********/Downloads/OCR/Tests_20220530.pdf" worked
pdfs = r"Tests_20220530.pdf" worked

As I wanted to have the filename as an argument, tried this

from pdf2image import convert_from_path
import sys

# Print total number of arguments
print ('Total number of arguments:', format(len(sys.argv)))

# Print all arguments
print ('Argument List:', str(sys.argv))

# Print arguments one by one
print ('First argument:',  str(sys.argv[0]))
print ('Second argument:',  str(sys.argv[1]))

filename = sys.argv[1]
pdfs = r"(filename)"
pages = convert_from_path(pdfs, 350)

i = 1
for page in pages:
    image_name = "Page_" + str(i) + ".jpg"
    page.save(image_name, "JPEG")
    i = i+1

$ python3 PDF_to_Image1.py Tests_20220530.pdf

Error:Total number of arguments: 2
Argument List: ['PDF_to_Image1.py', 'Tests_20220530.pdf']
First argument: PDF_to_Image1.py
Second argument: Tests_20220530.pdf
Traceback (most recent call last):
  File "/home/********/.local/lib/python3.9/site-packages/pdf2image/pdf2image.py", line 479, in pdfinfo_from_path
    raise ValueError
ValueError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/********/Downloads/OCR/PDF_to_Image1.py", line 17, in <module>
    pages = convert_from_path(pdfs, 350)
  File "/home/********/.local/lib/python3.9/site-packages/pdf2image/pdf2image.py", line 98, in convert_from_path
    page_count = pdfinfo_from_path(pdf_path, userpw, poppler_path=poppler_path)["Pages"]
  File "/home/********/.local/lib/python3.9/site-packages/pdf2image/pdf2image.py", line 488, in pdfinfo_from_path
    raise PDFPageCountError(
pdf2image.exceptions.PDFPageCountError: Unable to get page count.
I/O Error: Couldn't open file '(filename)': No such file or directory.

RE: pdf2image, poppler and paths - DPaul - Jun-11-2022

Seems that it still is a 'file not found' problem.
Maybe try this:

import os

path_to_pdf = os.path.join('c:\data' , 'pdfdir', 'pdfsubdir','mypdf.pdf')
Now your file is always found by using the variable path_to_pdf

Paul

RE: pdf2image, poppler and paths - jehoshua - Jun-11-2022

(Jun-11-2022, 05:56 AM)DPaul Wrote: Seems that it still is a 'file not found' problem.

It worked when I hard coded the path and filename.

(Jun-11-2022, 05:56 AM)DPaul Wrote: Maybe try this:

import os

path_to_pdf = os.path.join('c:\data' , 'pdfdir', 'pdfsubdir','mypdf.pdf')
Now your file is always found by using the variable path_to_pdf

Thanks; I couldn't get that to work, however it caused me to investigate what that "r" in

pdfs = r"provide path to pdf file"

was used for. Found a good article at https://www.codespeedy.com/how-does-carriage-return-work-in-python/ , then realised I didn't need that "r" at all. As I needed the PDF filename to be an argument, it now works okay as

from pdf2image import convert_from_path
import sys

# Print total number of arguments
print ('Total number of arguments:', format(len(sys.argv)))

# Print all arguments
print ('Argument List:', str(sys.argv))

# Print arguments one by one
print ('First argument:',  str(sys.argv[0]))
print ('Second argument:',  str(sys.argv[1]))

filename = sys.argv[1]
pdfs = filename
pages = convert_from_path(pdfs, 350)

i = 1
for page in pages:
    image_name = "Page_" + str(i) + ".jpg"
    page.save(image_name, "JPEG")
    i = i+1

RE: pdf2image, poppler and paths - Larz60+ - Jun-11-2022

Assume for illustration that you had the following relative file structure for your project:

├── data
│ ├── csv
│ ├── PDF
│ └── tmp
├── docs
├── src
└── venv
└── ...

Note: this class will create the directories if they don't already
exist. It is non-destructive, ald will leave existing paths intact.

then a class to access any part of that path would look like:

import os
from pathlib import Path


class MyPaths:
    def __init__(self):
        os.chdir(os.path.abspath(os.path.dirname(__file__)))

        HomePath = Path(".")

        rootpath = HomePath / ".."

        self.datapath = rootpath / "data"
        self.datapath.mkdir(exist_ok=True)

        self.csvpath = self.datapath / "csv"
        self.csvpath.mkdir(exist_ok=True)

        self.docpath = rootpath / "docs"
        self.docpath.mkdir(exist_ok=True)

        self.jsonpath = self.datapath / "json"
        self.jsonpath.mkdir(exist_ok=True)

        self.htmlpath = self.datapath / "html"
        self.htmlpath.mkdir(exist_ok=True)

        self.pdfpath = self.datapath / 'PDF'
        self.pdfpath.mkdir(exist_ok=True)

        self.tmppath = self.datapath / "tmp"
        self.tmppath.mkdir(exist_ok=True)


if __name__ == "__main__":
    MyPaths()

mow to access from your main script:

from pdf2image import convert_from_path
from MyPaths import MyPaths
import sys

mpath = MyPaths()
# Print total number of arguments
print ('Total number of arguments:', format(len(sys.argv)))
 
# Print all arguments
print ('Argument List:', str(sys.argv))
 
# Print arguments one by one
print ('First argument:',  str(sys.argv[0]))
print ('Second argument:',  str(sys.argv[1]))
 
filename = sys.argv[1]
pdfs = mpath.pdfpath / filename
pages = convert_from_path(pdfs, 350)
 
i = 1
for page in pages:
    image_name = "Page_" + str(i) + ".jpg"
    page.save(image_name, "JPEG")
    i = i+1

Should do the trick.

RE: pdf2image, poppler and paths - snippsat - Jun-11-2022

jehoshua Wrote:

from pdf2image import convert_from_path
 
pdfs = r"provide path to pdf file"
pages = convert_from_path(pdfs, 350)
 
i = 1
for page in pages:
    image_name = "Page_" + str(i) + ".jpg"  
    page.save(image_name, "JPEG")
    i = i+1

This loop can be written better,often called more pythonic✨

from pdf2image import convert_from_path

file_name = r'G:\div_code\Cartoon.pdf'
pages = convert_from_path(file_name, dpi=350)
for index,page in enumerate(pages, start=1):
    page.save(f'Page{index}.jpg')

jehoshua Wrote:then realised I didn't need that "r" at all. As I needed the PDF filename to be an argument, it now works okay as

The r is not the problem,it's used so escape characters dos not mess up folder name when use singel \ in path names on Windows.

>>> folder = 'C:\test'
>>> print(folder)
C:	est
>>> # Fix
>>> folder = r'C:\test'
>>> print(folder)
C:\test

So first here get \t(escape character) used as Tab
The folder can not be read now.

RE: pdf2image, poppler and paths - jehoshua - Jun-11-2022

(Jun-11-2022, 09:10 AM)Larz60+ Wrote: Assume for illustration that you had the following relative file structure for your project:

├── data
│ ├── csv
│ ├── PDF
│ └── tmp
├── docs
├── src
└── venv
└── ...

Note: this class will create the directories if they don't already
exist. It is non-destructive, ald will leave existing paths intact.

then a class to access any part of that path would look like {snip}

Thanks. I like the file structure and the class methods. If I wanted to use the class in other Python scripts and the file was called myPath.py , is it simply a matter of including

from MyPaths import MyPaths

or do I have to setup something like a library ?

RE: pdf2image, poppler and paths - jehoshua - Jun-11-2022

(Jun-11-2022, 10:34 AM)snippsat Wrote: This loop can be written better,often called more pythonic

Thank you.

(Jun-11-2022, 10:34 AM)snippsat Wrote: The r is not the problem,it's used so escape characters dos not mess up folder name when use singel \ in path names on Windows.

Thanks for explaining and the example. But do I need to code like that if I only use Linux ? Also I'm a bit more leaning towards parsing arguments rather than hard code in the filename (like I have over 2,600 PDF files)

As an aside, I came across a tool called ocrmypdf . It converted a PDF that was 'unsearchable' and added the necessary text layer to make it searchable. I see it is written in Python.

RE: pdf2image, poppler and paths - Larz60+ - Jun-12-2022

jehoshua Wrote:Thanks. I like the file structure and the class methods. If I wanted to use the class in other Python scripts and the file was called myPath.py , is it simply a matter of including: [from MyPaths import MyPaths]
or do I have to setup something like a library ?

Almost that simple:
here's a new project example you can try (Linux, windows may be somewhat different):

Create a directory named whatever your project name is
cd to that directory
Create a virtual environment with [python -m venv venv]
Activate the virtual environment with (for Linux, windows different) [. ./venv/bin/activate]
make a directory named [src]
copy MyPaths.py to the new src directory
[cd src]
run MyPaths.py with command [python MyPaths.py]
All paths will be created, ready to use. You will also have a virtual environment for your project

Note: You can eliminate steps 7 and 8 as this will happen the first time MyPaths in imported into any of your scripts. I haven't tested outside of virtual environment, as, frankly, I always use a virtual environment ofr any new project, but this should work without.