Python Forum
pdf2image, poppler and paths
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
pdf2image, poppler and paths
#1
I need to run some tests on converting PDF's to images. The article at https://medium.com/towards-data-science/...670ee38052 has been quite helpful.

I have pdf2image, poppler, poppler-utils, etc installed. I have used the code at https://gist.github.com/akash-ch2812/1e2...o_Image.py

from pdf2image import convert_from_path

pdfs = r"provide path to pdf file"
pages = convert_from_path(pdfs, 350)

i = 1
for page in pages:
    image_name = "Page_" + str(i) + ".jpg"  
    page.save(image_name, "JPEG")
    i = i+1     
Error:
$ python3 PDF_to_Image.py Tests_20220530.pdf Traceback (most recent call last): File "/home/********/.local/lib/python3.9/site-packages/pdf2image/pdf2image.py", line 479, in pdfinfo_from_path raise ValueError ValueError During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/********/Downloads/OCR/PDF_to_Image.py", line 4, in <module> pages = convert_from_path(pdfs, 350) File "/home/********/.local/lib/python3.9/site-packages/pdf2image/pdf2image.py", line 98, in convert_from_path page_count = pdfinfo_from_path(pdf_path, userpw, poppler_path=poppler_path)["Pages"] File "/home/********/.local/lib/python3.9/site-packages/pdf2image/pdf2image.py", line 488, in pdfinfo_from_path raise PDFPageCountError( pdf2image.exceptions.PDFPageCountError: Unable to get page count. I/O Error: Couldn't open file 'provide path to pdf file': No such file or directory.
So, have tried various versions of specifying the path and/or filename at line 3, all to no avail. Not sure how to determine if 'poppler' is in the PATH, however a 'locate' shows it is installed.

I have looked through the issues from "pdf2image" and problem not solved.

Does this script simply need the "poppler" path and how do I find that ? Also, surely the code can be modified so that a parameters is parsed to specify path/directories. The PDF is in the same path as the script, so I assume the script is failing because it doesn't know where "poppler" is found.
Reply
#2
the last line of error:
Error:
I/O Error: Couldn't open file 'provide path to pdf file': No such file or directory.
refers to line 3 of your code. This needs to be a file name of a pdf file.
jehoshua likes this post
Reply
#3
Quote:refers to line 3 of your code. This needs to be a file name of a pdf file.

Thanks; I knew it was line 3 , but just wasn't specified correctly. The following ..

pdfs = r"~/Downloads/OCR/Tests_20220530.pdf" didn't work
pdfs = r"/home/********/Downloads/OCR/Tests_20220530.pdf" worked
pdfs = r"Tests_20220530.pdf" worked

As I wanted to have the filename as an argument, tried this

from pdf2image import convert_from_path
import sys

# Print total number of arguments
print ('Total number of arguments:', format(len(sys.argv)))

# Print all arguments
print ('Argument List:', str(sys.argv))

# Print arguments one by one
print ('First argument:',  str(sys.argv[0]))
print ('Second argument:',  str(sys.argv[1]))

filename = sys.argv[1]
pdfs = r"(filename)"
pages = convert_from_path(pdfs, 350)

i = 1
for page in pages:
    image_name = "Page_" + str(i) + ".jpg"
    page.save(image_name, "JPEG")
    i = i+1
$ python3 PDF_to_Image1.py Tests_20220530.pdf

Error:
Total number of arguments: 2 Argument List: ['PDF_to_Image1.py', 'Tests_20220530.pdf'] First argument: PDF_to_Image1.py Second argument: Tests_20220530.pdf Traceback (most recent call last): File "/home/********/.local/lib/python3.9/site-packages/pdf2image/pdf2image.py", line 479, in pdfinfo_from_path raise ValueError ValueError During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/********/Downloads/OCR/PDF_to_Image1.py", line 17, in <module> pages = convert_from_path(pdfs, 350) File "/home/********/.local/lib/python3.9/site-packages/pdf2image/pdf2image.py", line 98, in convert_from_path page_count = pdfinfo_from_path(pdf_path, userpw, poppler_path=poppler_path)["Pages"] File "/home/********/.local/lib/python3.9/site-packages/pdf2image/pdf2image.py", line 488, in pdfinfo_from_path raise PDFPageCountError( pdf2image.exceptions.PDFPageCountError: Unable to get page count. I/O Error: Couldn't open file '(filename)': No such file or directory.
Reply
#4
Seems that it still is a 'file not found' problem.
Maybe try this:

import os

path_to_pdf = os.path.join('c:\data' , 'pdfdir', 'pdfsubdir','mypdf.pdf')
Now your file is always found by using the variable path_to_pdf 
Paul
jehoshua likes this post
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Reply
#5
(Jun-11-2022, 05:56 AM)DPaul Wrote: Seems that it still is a 'file not found' problem.

It worked when I hard coded the path and filename.

(Jun-11-2022, 05:56 AM)DPaul Wrote: Maybe try this:

import os

path_to_pdf = os.path.join('c:\data' , 'pdfdir', 'pdfsubdir','mypdf.pdf')
Now your file is always found by using the variable path_to_pdf 

Thanks; I couldn't get that to work, however it caused me to investigate what that "r" in

pdfs = r"provide path to pdf file"
was used for. Found a good article at https://www.codespeedy.com/how-does-carr...in-python/ , then realised I didn't need that "r" at all. As I needed the PDF filename to be an argument, it now works okay as

from pdf2image import convert_from_path
import sys

# Print total number of arguments
print ('Total number of arguments:', format(len(sys.argv)))

# Print all arguments
print ('Argument List:', str(sys.argv))

# Print arguments one by one
print ('First argument:',  str(sys.argv[0]))
print ('Second argument:',  str(sys.argv[1]))

filename = sys.argv[1]
pdfs = filename
pages = convert_from_path(pdfs, 350)

i = 1
for page in pages:
    image_name = "Page_" + str(i) + ".jpg"
    page.save(image_name, "JPEG")
    i = i+1
Reply
#6
Assume for illustration that you had the following relative file structure for your project:

├── data
│ ├── csv
│ ├── PDF
│ └── tmp
├── docs
├── src
└── venv
└── ...

Note: this class will create the directories if they don't already
exist. It is non-destructive, ald will leave existing paths intact.

then a class to access any part of that path would look like:
import os
from pathlib import Path


class MyPaths:
    def __init__(self):
        os.chdir(os.path.abspath(os.path.dirname(__file__)))

        HomePath = Path(".")

        rootpath = HomePath / ".."

        self.datapath = rootpath / "data"
        self.datapath.mkdir(exist_ok=True)

        self.csvpath = self.datapath / "csv"
        self.csvpath.mkdir(exist_ok=True)

        self.docpath = rootpath / "docs"
        self.docpath.mkdir(exist_ok=True)

        self.jsonpath = self.datapath / "json"
        self.jsonpath.mkdir(exist_ok=True)

        self.htmlpath = self.datapath / "html"
        self.htmlpath.mkdir(exist_ok=True)

        self.pdfpath = self.datapath / 'PDF'
        self.pdfpath.mkdir(exist_ok=True)

        self.tmppath = self.datapath / "tmp"
        self.tmppath.mkdir(exist_ok=True)


if __name__ == "__main__":
    MyPaths()
mow to access from your main script:

from pdf2image import convert_from_path
from MyPaths import MyPaths
import sys

mpath = MyPaths()
# Print total number of arguments
print ('Total number of arguments:', format(len(sys.argv)))
 
# Print all arguments
print ('Argument List:', str(sys.argv))
 
# Print arguments one by one
print ('First argument:',  str(sys.argv[0]))
print ('Second argument:',  str(sys.argv[1]))
 
filename = sys.argv[1]
pdfs = mpath.pdfpath / filename
pages = convert_from_path(pdfs, 350)
 
i = 1
for page in pages:
    image_name = "Page_" + str(i) + ".jpg"
    page.save(image_name, "JPEG")
    i = i+1
Should do the trick.
jehoshua likes this post
Reply
#7
jehoshua Wrote:
from pdf2image import convert_from_path
 
pdfs = r"provide path to pdf file"
pages = convert_from_path(pdfs, 350)
 
i = 1
for page in pages:
    image_name = "Page_" + str(i) + ".jpg"  
    page.save(image_name, "JPEG")
    i = i+1  
This loop can be written better,often called more pythonic✨
from pdf2image import convert_from_path

file_name = r'G:\div_code\Cartoon.pdf'
pages = convert_from_path(file_name, dpi=350)
for index,page in enumerate(pages, start=1):
    page.save(f'Page{index}.jpg') 
jehoshua Wrote:then realised I didn't need that "r" at all. As I needed the PDF filename to be an argument, it now works okay as
The r is not the problem,it's used so escape characters dos not mess up folder name when use singel \ in path names on Windows.
>>> folder = 'C:\test'
>>> print(folder)
C:	est
>>> # Fix
>>> folder = r'C:\test'
>>> print(folder)
C:\test
So first here get \t(escape character) used as Tab
The folder can not be read now.
jehoshua likes this post
Reply
#8
(Jun-11-2022, 09:10 AM)Larz60+ Wrote: Assume for illustration that you had the following relative file structure for your project:

├── data
│ ├── csv
│ ├── PDF
│ └── tmp
├── docs
├── src
└── venv
└── ...

Note: this class will create the directories if they don't already
exist. It is non-destructive, ald will leave existing paths intact.

then a class to access any part of that path would look like {snip}

Thanks. I like the file structure and the class methods. If I wanted to use the class in other Python scripts and the file was called myPath.py , is it simply a matter of including

from MyPaths import MyPaths
or do I have to setup something like a library ?
Reply
#9
(Jun-11-2022, 10:34 AM)snippsat Wrote: This loop can be written better,often called more pythonic

Thank you.

(Jun-11-2022, 10:34 AM)snippsat Wrote: The r is not the problem,it's used so escape characters dos not mess up folder name when use singel \ in path names on Windows.

Thanks for explaining and the example. But do I need to code like that if I only use Linux ? Also I'm a bit more leaning towards parsing arguments rather than hard code in the filename (like I have over 2,600 PDF files)

As an aside, I came across a tool called ocrmypdf . It converted a PDF that was 'unsearchable' and added the necessary text layer to make it searchable. I see it is written in Python.
Reply
#10
jehoshua Wrote:Thanks. I like the file structure and the class methods. If I wanted to use the class in other Python scripts and the file was called myPath.py , is it simply a matter of including: [from MyPaths import MyPaths]
or do I have to setup something like a library ?
Almost that simple:
here's a new project example you can try (Linux, windows may be somewhat different):
  1. Create a directory named whatever your project name is
  2. cd to that directory
  3. Create a virtual environment with [python -m venv venv]
  4. Activate the virtual environment with (for Linux, windows different) [. ./venv/bin/activate]
  5. make a directory named [src]
  6. copy MyPaths.py to the new src directory
  7. [cd src]
  8. run MyPaths.py with command [python MyPaths.py]
  9. All paths will be created, ready to use. You will also have a virtual environment for your project

Note: You can eliminate steps 7 and 8 as this will happen the first time MyPaths in imported into any of your scripts. I haven't tested outside of virtual environment, as, frankly, I always use a virtual environment ofr any new project, but this should work without.
jehoshua likes this post
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Windows paths issue otalado 3 1,475 May-29-2022, 09:11 AM
Last Post: snippsat
  automatically get absolute paths oclmedyb 1 2,127 Mar-11-2021, 04:31 PM
Last Post: deanhystad
  chkFile with absolute paths JarredAwesome 7 3,022 Sep-21-2020, 03:51 AM
Last Post: bowlofred
  Paths millpond 12 5,228 Jul-30-2020, 01:16 PM
Last Post: snippsat
  Problems with windows paths delphinis 6 5,211 Jul-21-2020, 06:11 PM
Last Post: Gribouillis
  'No module named pdf2image' ironfelix717 13 22,281 Jul-24-2019, 11:54 AM
Last Post: snippsat
  Shortest paths to win snake and ladder sandaab 5 4,265 Jun-30-2019, 03:20 PM
Last Post: sandaab
  How to handle paths with spaces in the name? zBernie 1 6,746 Nov-22-2018, 04:04 AM
Last Post: ichabod801
  Question: Paths and writing to a file mwmaw 6 6,515 Dec-20-2016, 03:44 PM
Last Post: wavic

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020