Python Forum

Hi guys!

My script is reading a PDF using pytesseract OCR.
My goal is to extract text from various PDF formats, extract certain numbers from said text and reuse that data elsewhere.
I tried a few PDF reader modules, but since my PDF's have several different formats, it's not accurate enough.

The main code. I don't fully understand half of what I've done here, but it works.

import io
import pytesseract

from PIL import Image
from wand.image import Image as wi
import csv

pdf = wi(filename="asker2.pdf", resolution=200)
pdfImage = pdf.convert('png')

imageBlobs = []

for img in pdfImage.sequence:
    imgPage = wi(image=img)
    imageBlobs.append(imgPage.make_blob('png'))

recognized_text = []

for imgBlob in imageBlobs:
    im = Image.open(io.BytesIO(imgBlob))
    text = pytesseract.image_to_string(im, lang='nor', config='--psm 1')
    recognized_text.append(text)

My problems:

The output from my pytesseract OCR somehow ends up being one line as both the list and "text" object. When I write it to .txt or .csv it looks like several lines but in fact is not(??) When I load the file or output object(text) into a pandas dataframe, it has 1 column with the entire text extraction inside it, which makes it hard to sort the data.

Pandas error I get when not using encoding="unicode_escape"

pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 83, saw 2

Here's some of the stuff I've tried to work around it:

#Attempting to reduce the ammount of data
removed_char = text.translate({ord(i): None for i in 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ/ø()'})


#Writing the output data to a text file, with unicode_escape because pandas can't read my data if not.
with open('output1.txt', 'w', encoding="unicode_escape", newline='') as f:
     f.write(removed_char)

The output of removed_char object is either loads of empty rows between each row containing data, or one long row with loads of \n\n\n\n\ everywhere.

I tried to remove the \n doing this:

text.replace('\n', '')

 
with open("output1.txt", 'r', newline=None) as fd:
    for line in fd:
        line = line.replace("\n", "")

The \n is still everywhere, or huge spaces of lines with no data.

I tried some variations of this using 'rb' and 'wb' but then I get an error saying

TypeError: a bytes-like object is required, not 'str'

Various variations of this one, regardless of what I pass into the reader/writer (Apart from unicode_escape)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position

Am I missing some obvious solution to this? Been stuck for about 25 hours of active research.

fishglue