Posts: 122
Threads: 24
Joined: Dec 2017
Jun-11-2022, 02:21 AM
(This post was last modified: Jun-11-2022, 02:37 AM by Larz60+.
Edit Reason: fixed error tags
)
I need to run some tests on converting PDF's to images. The article at https://medium.com/towards-data-science/...670ee38052 has been quite helpful.
I have pdf2image, poppler, poppler-utils, etc installed. I have used the code at https://gist.github.com/akash-ch2812/1e2...o_Image.py
from pdf2image import convert_from_path
pdfs = r"provide path to pdf file"
pages = convert_from_path(pdfs, 350)
i = 1
for page in pages:
image_name = "Page_" + str(i) + ".jpg"
page.save(image_name, "JPEG")
i = i+1 Error: $ python3 PDF_to_Image.py Tests_20220530.pdf
Traceback (most recent call last):
File "/home/********/.local/lib/python3.9/site-packages/pdf2image/pdf2image.py", line 479, in pdfinfo_from_path
raise ValueError
ValueError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/********/Downloads/OCR/PDF_to_Image.py", line 4, in <module>
pages = convert_from_path(pdfs, 350)
File "/home/********/.local/lib/python3.9/site-packages/pdf2image/pdf2image.py", line 98, in convert_from_path
page_count = pdfinfo_from_path(pdf_path, userpw, poppler_path=poppler_path)["Pages"]
File "/home/********/.local/lib/python3.9/site-packages/pdf2image/pdf2image.py", line 488, in pdfinfo_from_path
raise PDFPageCountError(
pdf2image.exceptions.PDFPageCountError: Unable to get page count.
I/O Error: Couldn't open file 'provide path to pdf file': No such file or directory.
So, have tried various versions of specifying the path and/or filename at line 3, all to no avail. Not sure how to determine if 'poppler' is in the PATH, however a 'locate' shows it is installed.
I have looked through the issues from "pdf2image" and problem not solved.
Does this script simply need the "poppler" path and how do I find that ? Also, surely the code can be modified so that a parameters is parsed to specify path/directories. The PDF is in the same path as the script, so I assume the script is failing because it doesn't know where "poppler" is found.
Posts: 12,022
Threads: 484
Joined: Sep 2016
the last line of error: Error: I/O Error: Couldn't open file 'provide path to pdf file': No such file or directory.
refers to line 3 of your code. This needs to be a file name of a pdf file.
Posts: 122
Threads: 24
Joined: Dec 2017
Quote:refers to line 3 of your code. This needs to be a file name of a pdf file.
Thanks; I knew it was line 3 , but just wasn't specified correctly. The following ..
pdfs = r"~/Downloads/OCR/Tests_20220530.pdf" didn't work
pdfs = r"/home/********/Downloads/OCR/Tests_20220530.pdf" worked
pdfs = r"Tests_20220530.pdf" worked
As I wanted to have the filename as an argument, tried this
from pdf2image import convert_from_path
import sys
# Print total number of arguments
print ('Total number of arguments:', format(len(sys.argv)))
# Print all arguments
print ('Argument List:', str(sys.argv))
# Print arguments one by one
print ('First argument:', str(sys.argv[0]))
print ('Second argument:', str(sys.argv[1]))
filename = sys.argv[1]
pdfs = r"(filename)"
pages = convert_from_path(pdfs, 350)
i = 1
for page in pages:
image_name = "Page_" + str(i) + ".jpg"
page.save(image_name, "JPEG")
i = i+1 $ python3 PDF_to_Image1.py Tests_20220530.pdf
Error: Total number of arguments: 2
Argument List: ['PDF_to_Image1.py', 'Tests_20220530.pdf']
First argument: PDF_to_Image1.py
Second argument: Tests_20220530.pdf
Traceback (most recent call last):
File "/home/********/.local/lib/python3.9/site-packages/pdf2image/pdf2image.py", line 479, in pdfinfo_from_path
raise ValueError
ValueError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/********/Downloads/OCR/PDF_to_Image1.py", line 17, in <module>
pages = convert_from_path(pdfs, 350)
File "/home/********/.local/lib/python3.9/site-packages/pdf2image/pdf2image.py", line 98, in convert_from_path
page_count = pdfinfo_from_path(pdf_path, userpw, poppler_path=poppler_path)["Pages"]
File "/home/********/.local/lib/python3.9/site-packages/pdf2image/pdf2image.py", line 488, in pdfinfo_from_path
raise PDFPageCountError(
pdf2image.exceptions.PDFPageCountError: Unable to get page count.
I/O Error: Couldn't open file '(filename)': No such file or directory.
Posts: 741
Threads: 122
Joined: Dec 2017
Seems that it still is a 'file not found' problem.
Maybe try this:
import os
path_to_pdf = os.path.join('c:\data' , 'pdfdir', 'pdfsubdir','mypdf.pdf')
Now your file is always found by using the variable path_to_pdf Paul
It is more important to do the right thing, than to do the thing right.(P.Drucker)
Better is the enemy of good. (Montesquieu) = French version for 'kiss'.
Posts: 122
Threads: 24
Joined: Dec 2017
Jun-11-2022, 07:08 AM
(This post was last modified: Jun-11-2022, 07:08 AM by jehoshua.)
(Jun-11-2022, 05:56 AM)DPaul Wrote: Seems that it still is a 'file not found' problem.
It worked when I hard coded the path and filename.
(Jun-11-2022, 05:56 AM)DPaul Wrote: Maybe try this:
import os
path_to_pdf = os.path.join('c:\data' , 'pdfdir', 'pdfsubdir','mypdf.pdf')
Now your file is always found by using the variable path_to_pdf
Thanks; I couldn't get that to work, however it caused me to investigate what that "r" in
pdfs = r"provide path to pdf file" was used for. Found a good article at https://www.codespeedy.com/how-does-carr...in-python/ , then realised I didn't need that "r" at all. As I needed the PDF filename to be an argument, it now works okay as
from pdf2image import convert_from_path
import sys
# Print total number of arguments
print ('Total number of arguments:', format(len(sys.argv)))
# Print all arguments
print ('Argument List:', str(sys.argv))
# Print arguments one by one
print ('First argument:', str(sys.argv[0]))
print ('Second argument:', str(sys.argv[1]))
filename = sys.argv[1]
pdfs = filename
pages = convert_from_path(pdfs, 350)
i = 1
for page in pages:
image_name = "Page_" + str(i) + ".jpg"
page.save(image_name, "JPEG")
i = i+1
Posts: 12,022
Threads: 484
Joined: Sep 2016
Jun-11-2022, 09:10 AM
(This post was last modified: Jun-11-2022, 09:31 AM by Larz60+.)
Assume for illustration that you had the following relative file structure for your project:
├── data
│ ├── csv
│ ├── PDF
│ └── tmp
├── docs
├── src
└── venv
└── ...
Note: this class will create the directories if they don't already
exist. It is non-destructive, ald will leave existing paths intact.
then a class to access any part of that path would look like:
import os
from pathlib import Path
class MyPaths:
def __init__(self):
os.chdir(os.path.abspath(os.path.dirname(__file__)))
HomePath = Path(".")
rootpath = HomePath / ".."
self.datapath = rootpath / "data"
self.datapath.mkdir(exist_ok=True)
self.csvpath = self.datapath / "csv"
self.csvpath.mkdir(exist_ok=True)
self.docpath = rootpath / "docs"
self.docpath.mkdir(exist_ok=True)
self.jsonpath = self.datapath / "json"
self.jsonpath.mkdir(exist_ok=True)
self.htmlpath = self.datapath / "html"
self.htmlpath.mkdir(exist_ok=True)
self.pdfpath = self.datapath / 'PDF'
self.pdfpath.mkdir(exist_ok=True)
self.tmppath = self.datapath / "tmp"
self.tmppath.mkdir(exist_ok=True)
if __name__ == "__main__":
MyPaths() mow to access from your main script:
from pdf2image import convert_from_path
from MyPaths import MyPaths
import sys
mpath = MyPaths()
# Print total number of arguments
print ('Total number of arguments:', format(len(sys.argv)))
# Print all arguments
print ('Argument List:', str(sys.argv))
# Print arguments one by one
print ('First argument:', str(sys.argv[0]))
print ('Second argument:', str(sys.argv[1]))
filename = sys.argv[1]
pdfs = mpath.pdfpath / filename
pages = convert_from_path(pdfs, 350)
i = 1
for page in pages:
image_name = "Page_" + str(i) + ".jpg"
page.save(image_name, "JPEG")
i = i+1 Should do the trick.
Posts: 7,312
Threads: 123
Joined: Sep 2016
Jun-11-2022, 10:34 AM
(This post was last modified: Jun-11-2022, 10:37 AM by snippsat.)
jehoshua Wrote:from pdf2image import convert_from_path
pdfs = r"provide path to pdf file"
pages = convert_from_path(pdfs, 350)
i = 1
for page in pages:
image_name = "Page_" + str(i) + ".jpg"
page.save(image_name, "JPEG")
i = i+1 This loop can be written better,often called more pythonic✨
from pdf2image import convert_from_path
file_name = r'G:\div_code\Cartoon.pdf'
pages = convert_from_path(file_name, dpi=350)
for index,page in enumerate(pages, start=1):
page.save(f'Page{index}.jpg') jehoshua Wrote:then realised I didn't need that "r" at all. As I needed the PDF filename to be an argument, it now works okay as The r is not the problem,it's used so escape characters dos not mess up folder name when use singel \ in path names on Windows.
>>> folder = 'C:\test'
>>> print(folder)
C: est
>>> # Fix
>>> folder = r'C:\test'
>>> print(folder)
C:\test So first here get \t (escape character) used as Tab
The folder can not be read now.
Posts: 122
Threads: 24
Joined: Dec 2017
(Jun-11-2022, 09:10 AM)Larz60+ Wrote: Assume for illustration that you had the following relative file structure for your project:
├── data
│ ├── csv
│ ├── PDF
│ └── tmp
├── docs
├── src
└── venv
└── ...
Note: this class will create the directories if they don't already
exist. It is non-destructive, ald will leave existing paths intact.
then a class to access any part of that path would look like {snip}
Thanks. I like the file structure and the class methods. If I wanted to use the class in other Python scripts and the file was called myPath.py , is it simply a matter of including
from MyPaths import MyPaths or do I have to setup something like a library ?
Posts: 122
Threads: 24
Joined: Dec 2017
(Jun-11-2022, 10:34 AM)snippsat Wrote: This loop can be written better,often called more pythonic
Thank you.
(Jun-11-2022, 10:34 AM)snippsat Wrote: The r is not the problem,it's used so escape characters dos not mess up folder name when use singel \ in path names on Windows.
Thanks for explaining and the example. But do I need to code like that if I only use Linux ? Also I'm a bit more leaning towards parsing arguments rather than hard code in the filename (like I have over 2,600 PDF files)
As an aside, I came across a tool called ocrmypdf . It converted a PDF that was 'unsearchable' and added the necessary text layer to make it searchable. I see it is written in Python.
Posts: 12,022
Threads: 484
Joined: Sep 2016
Jun-12-2022, 01:56 PM
(This post was last modified: Jun-12-2022, 01:56 PM by Larz60+.)
jehoshua Wrote:Thanks. I like the file structure and the class methods. If I wanted to use the class in other Python scripts and the file was called myPath.py , is it simply a matter of including: [from MyPaths import MyPaths]
or do I have to setup something like a library ? Almost that simple:
here's a new project example you can try (Linux, windows may be somewhat different):
- Create a directory named whatever your project name is
- cd to that directory
- Create a virtual environment with [python -m venv venv]
- Activate the virtual environment with (for Linux, windows different) [. ./venv/bin/activate]
- make a directory named [src]
- copy MyPaths.py to the new src directory
- [cd src]
- run MyPaths.py with command [python MyPaths.py]
- All paths will be created, ready to use. You will also have a virtual environment for your project
Note: You can eliminate steps 7 and 8 as this will happen the first time MyPaths in imported into any of your scripts. I haven't tested outside of virtual environment, as, frankly, I always use a virtual environment ofr any new project, but this should work without.
|