Python Forum
Python: AttributeError: 'PageObject' object has no attribute 'extract_images'
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Python: AttributeError: 'PageObject' object has no attribute 'extract_images'
#1
hello, I try to convert with OCR some pdf that contains images. I got this error:

Traceback (most recent call last):
  File "E:\Carte\BB\17 - Site Leadership\alte\Ionel Balauta\Aryeht\Task 1 - Traduce tot site-ul\Doar Google Web\Andreea\Meditatii\2023\OCR.py", line 31, in <module>
    image = page.extract_images()[0]["obj"]
AttributeError: 'PageObject' object has no attribute 'extract_images'
this is the code:

import os
import PyPDF2
import pytesseract
from PIL import Image
from pdf2image import convert_from_path

# Path to the folder containing PDF files
input_folder = "d:/doc/doc"

# Path to the folder where text files will be saved
output_folder = "d:/doc/doc"

# Path to the Tesseract OCR executable (change if necessary)
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

# Get a list of all PDF files in the input folder
files = [f for f in os.listdir(input_folder) if f.endswith(".pdf")]

# Loop through each PDF file and convert it to text using OCR
for file in files:
    pdf_path = os.path.join(input_folder, file)
    txt_path = os.path.join(output_folder, os.path.splitext(file)[0] + ".txt")

    # Extract images from PDF and perform OCR on each image
    images = []
    with open(pdf_path, "rb") as file:
        pdf_reader = PyPDF2.PdfFileReader(file)

        for page_num in range(pdf_reader.numPages):
            page = pdf_reader.getPage(page_num)
            image = page.extract_images()[0]["obj"]
            images.append(Image.frombytes("RGB", image.size, image.data))

    # Perform OCR on images and extract text
    text = ""
    for image in images:
        text += pytesseract.image_to_string(image)

    # Save the extracted text to a text file
    with open(txt_path, "w", encoding="utf-8") as txt_file:
        txt_file.write(text)

print("Conversion complete!")
can someone fix my code so it works?
Reply
#2
When you have problems, read the documentation.

https://pypdf2.readthedocs.io/en/latest/...bject.html

PageObject does not have an extract_images() method. It has a images property that will return a list of images for the page. This was a change made in September 2022. If you are using pypdf2 2.12.0 or newer, use the images property instead of the extract_images() method.
Reply
#3
import os
import pytesseract
from PIL import Image
from pdf2image import convert_from_path
from PyPDF2 import PdfFileReader

# Path to the folder containing PDF files
input_folder = "d:/doc/doc"

# Path to the folder where text files will be saved
output_folder = "d:/doc/doc"

# Path to the Tesseract OCR executable (change if necessary)
pytesseract.pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

# Get a list of all PDF files in the input folder
files = [f for f in os.listdir(input_folder) if f.endswith(".pdf")]

# Loop through each PDF file and convert it to text using OCR
for file in files:
    pdf_path = os.path.join(input_folder, file)
    txt_path = os.path.join(output_folder, os.path.splitext(file)[0] + ".txt")

    # Convert PDF pages to images
    images = convert_from_path(pdf_path)

    # Perform OCR on images and extract text
    text = ""
    for image in images:
        # text += pytesseract.image_to_string(image)
        text += pytesseract.image_to_string(image, lang='ron')

    # Save the extracted text to a text file
    with open(txt_path, "w", encoding="utf-8") as txt_file:
        txt_file.write(text)

print("Conversion complete!")
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  getpass.getpass() results in AttributeError: module 'os' has no attribute 'O_NOCTTY' EarthAndMoon 4 790 Oct-03-2023, 02:00 PM
Last Post: deanhystad
  AttributeError: '_tkinter.tkapp' object has no attribute 'username' Konstantin23 4 1,757 Aug-04-2023, 12:41 PM
Last Post: Konstantin23
  Python: Regex is not good for re.search (AttributeError: 'NoneType' object has no att Melcu54 9 1,522 Jun-28-2023, 11:13 AM
Last Post: Melcu54
  Parallel processing - AttributeError: Can't get attribute 'sktimekmeans' Mohana1983 1 764 Jun-22-2023, 02:33 AM
Last Post: woooee
  Object attribute behavior different in 2 scripts db042190 1 755 Jun-14-2023, 12:37 PM
Last Post: deanhystad
  cx_oracle Error - AttributeError: 'function' object has no attribute 'cursor' birajdarmm 1 2,399 Apr-15-2023, 05:17 PM
Last Post: deanhystad
  Pandas AttributeError: 'DataFrame' object has no attribute 'concat' Sameer33 5 5,685 Feb-17-2023, 06:01 PM
Last Post: Sameer33
  WebDriver' object has no attribute 'find_element_by_css_selector rickadams 3 5,945 Sep-19-2022, 06:11 PM
Last Post: Larz60+
  'dict_items' object has no attribute 'sort' Calli 6 4,541 Jul-29-2022, 09:19 PM
Last Post: Gribouillis
  AttributeError: 'numpy.ndarray' object has no attribute 'load' hobbyist 8 7,143 Jul-06-2022, 10:55 AM
Last Post: deanhystad

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020