![]() |
pdf2image, poppler and paths - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: pdf2image, poppler and paths (/thread-37451.html) Pages:
1
2
|
pdf2image, poppler and paths - jehoshua - Jun-11-2022 I need to run some tests on converting PDF's to images. The article at https://medium.com/towards-data-science/extracting-text-from-scanned-pdf-using-pytesseract-open-cv-cd670ee38052 has been quite helpful. I have pdf2image, poppler, poppler-utils, etc installed. I have used the code at https://gist.github.com/akash-ch2812/1e2c0991105d0ed2f0fa2cadbee00362/raw/4c840fcd8ee5ec8e492f968e25d80af84ff21df0/PDF_to_Image.py from pdf2image import convert_from_path pdfs = r"provide path to pdf file" pages = convert_from_path(pdfs, 350) i = 1 for page in pages: image_name = "Page_" + str(i) + ".jpg" page.save(image_name, "JPEG") i = i+1 So, have tried various versions of specifying the path and/or filename at line 3, all to no avail. Not sure how to determine if 'poppler' is in the PATH, however a 'locate' shows it is installed.I have looked through the issues from "pdf2image" and problem not solved. Does this script simply need the "poppler" path and how do I find that ? Also, surely the code can be modified so that a parameters is parsed to specify path/directories. The PDF is in the same path as the script, so I assume the script is failing because it doesn't know where "poppler" is found. RE: pdf2image, poppler and paths - Larz60+ - Jun-11-2022 the last line of error: refers to line 3 of your code. This needs to be a file name of a pdf file.
RE: pdf2image, poppler and paths - jehoshua - Jun-11-2022 Quote:refers to line 3 of your code. This needs to be a file name of a pdf file. Thanks; I knew it was line 3 , but just wasn't specified correctly. The following .. pdfs = r"~/Downloads/OCR/Tests_20220530.pdf" didn't work pdfs = r"/home/********/Downloads/OCR/Tests_20220530.pdf" worked pdfs = r"Tests_20220530.pdf" worked As I wanted to have the filename as an argument, tried this from pdf2image import convert_from_path import sys # Print total number of arguments print ('Total number of arguments:', format(len(sys.argv))) # Print all arguments print ('Argument List:', str(sys.argv)) # Print arguments one by one print ('First argument:', str(sys.argv[0])) print ('Second argument:', str(sys.argv[1])) filename = sys.argv[1] pdfs = r"(filename)" pages = convert_from_path(pdfs, 350) i = 1 for page in pages: image_name = "Page_" + str(i) + ".jpg" page.save(image_name, "JPEG") i = i+1$ python3 PDF_to_Image1.py Tests_20220530.pdf
RE: pdf2image, poppler and paths - DPaul - Jun-11-2022 Seems that it still is a 'file not found' problem. Maybe try this: import os path_to_pdf = os.path.join('c:\data' , 'pdfdir', 'pdfsubdir','mypdf.pdf') Now your file is always found by using the variable path_to_pdfPaul RE: pdf2image, poppler and paths - jehoshua - Jun-11-2022 (Jun-11-2022, 05:56 AM)DPaul Wrote: Seems that it still is a 'file not found' problem. It worked when I hard coded the path and filename. (Jun-11-2022, 05:56 AM)DPaul Wrote: Maybe try this: Thanks; I couldn't get that to work, however it caused me to investigate what that "r" in pdfs = r"provide path to pdf file"was used for. Found a good article at https://www.codespeedy.com/how-does-carriage-return-work-in-python/ , then realised I didn't need that "r" at all. As I needed the PDF filename to be an argument, it now works okay as from pdf2image import convert_from_path import sys # Print total number of arguments print ('Total number of arguments:', format(len(sys.argv))) # Print all arguments print ('Argument List:', str(sys.argv)) # Print arguments one by one print ('First argument:', str(sys.argv[0])) print ('Second argument:', str(sys.argv[1])) filename = sys.argv[1] pdfs = filename pages = convert_from_path(pdfs, 350) i = 1 for page in pages: image_name = "Page_" + str(i) + ".jpg" page.save(image_name, "JPEG") i = i+1 RE: pdf2image, poppler and paths - Larz60+ - Jun-11-2022 Assume for illustration that you had the following relative file structure for your project: ├── data │ ├── csv │ └── tmp ├── docs ├── src └── venv └── ... Note: this class will create the directories if they don't already exist. It is non-destructive, ald will leave existing paths intact. then a class to access any part of that path would look like: import os from pathlib import Path class MyPaths: def __init__(self): os.chdir(os.path.abspath(os.path.dirname(__file__))) HomePath = Path(".") rootpath = HomePath / ".." self.datapath = rootpath / "data" self.datapath.mkdir(exist_ok=True) self.csvpath = self.datapath / "csv" self.csvpath.mkdir(exist_ok=True) self.docpath = rootpath / "docs" self.docpath.mkdir(exist_ok=True) self.jsonpath = self.datapath / "json" self.jsonpath.mkdir(exist_ok=True) self.htmlpath = self.datapath / "html" self.htmlpath.mkdir(exist_ok=True) self.pdfpath = self.datapath / 'PDF' self.pdfpath.mkdir(exist_ok=True) self.tmppath = self.datapath / "tmp" self.tmppath.mkdir(exist_ok=True) if __name__ == "__main__": MyPaths()mow to access from your main script: from pdf2image import convert_from_path from MyPaths import MyPaths import sys mpath = MyPaths() # Print total number of arguments print ('Total number of arguments:', format(len(sys.argv))) # Print all arguments print ('Argument List:', str(sys.argv)) # Print arguments one by one print ('First argument:', str(sys.argv[0])) print ('Second argument:', str(sys.argv[1])) filename = sys.argv[1] pdfs = mpath.pdfpath / filename pages = convert_from_path(pdfs, 350) i = 1 for page in pages: image_name = "Page_" + str(i) + ".jpg" page.save(image_name, "JPEG") i = i+1Should do the trick. RE: pdf2image, poppler and paths - snippsat - Jun-11-2022 jehoshua Wrote:This loop can be written better,often called more pythonic✨from pdf2image import convert_from_path pdfs = r"provide path to pdf file" pages = convert_from_path(pdfs, 350) i = 1 for page in pages: image_name = "Page_" + str(i) + ".jpg" page.save(image_name, "JPEG") i = i+1 from pdf2image import convert_from_path file_name = r'G:\div_code\Cartoon.pdf' pages = convert_from_path(file_name, dpi=350) for index,page in enumerate(pages, start=1): page.save(f'Page{index}.jpg') jehoshua Wrote:then realised I didn't need that "r" at all. As I needed the PDF filename to be an argument, it now works okay asThe r is not the problem,it's used so escape characters dos not mess up folder name when use singel \ in path names on Windows.>>> folder = 'C:\test' >>> print(folder) C: est >>> # Fix >>> folder = r'C:\test' >>> print(folder) C:\testSo first here get \t (escape character) used as Tab The folder can not be read now. RE: pdf2image, poppler and paths - jehoshua - Jun-11-2022 (Jun-11-2022, 09:10 AM)Larz60+ Wrote: Assume for illustration that you had the following relative file structure for your project: Thanks. I like the file structure and the class methods. If I wanted to use the class in other Python scripts and the file was called myPath.py , is it simply a matter of including from MyPaths import MyPathsor do I have to setup something like a library ? RE: pdf2image, poppler and paths - jehoshua - Jun-11-2022 (Jun-11-2022, 10:34 AM)snippsat Wrote: This loop can be written better,often called more pythonic Thank you. (Jun-11-2022, 10:34 AM)snippsat Wrote: The Thanks for explaining and the example. But do I need to code like that if I only use Linux ? Also I'm a bit more leaning towards parsing arguments rather than hard code in the filename (like I have over 2,600 PDF files) As an aside, I came across a tool called ocrmypdf . It converted a PDF that was 'unsearchable' and added the necessary text layer to make it searchable. I see it is written in Python. RE: pdf2image, poppler and paths - Larz60+ - Jun-12-2022 jehoshua Wrote:Thanks. I like the file structure and the class methods. If I wanted to use the class in other Python scripts and the file was called myPath.py , is it simply a matter of including: [from MyPaths import MyPaths]Almost that simple: here's a new project example you can try (Linux, windows may be somewhat different):
Note: You can eliminate steps 7 and 8 as this will happen the first time MyPaths in imported into any of your scripts. I haven't tested outside of virtual environment, as, frankly, I always use a virtual environment ofr any new project, but this should work without. |