Python Forum
Trouble with encoded data (I think) - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Trouble with encoded data (I think) (/thread-21708.html)



Trouble with encoded data (I think) - fishglue - Oct-10-2019

Hi guys!

My script is reading a PDF using pytesseract OCR.
My goal is to extract text from various PDF formats, extract certain numbers from said text and reuse that data elsewhere.
I tried a few PDF reader modules, but since my PDF's have several different formats, it's not accurate enough.

The main code. I don't fully understand half of what I've done here, but it works.
import io
import pytesseract

from PIL import Image
from wand.image import Image as wi
import csv

pdf = wi(filename="asker2.pdf", resolution=200)
pdfImage = pdf.convert('png')

imageBlobs = []

for img in pdfImage.sequence:
    imgPage = wi(image=img)
    imageBlobs.append(imgPage.make_blob('png'))

recognized_text = []

for imgBlob in imageBlobs:
    im = Image.open(io.BytesIO(imgBlob))
    text = pytesseract.image_to_string(im, lang='nor', config='--psm 1')
    recognized_text.append(text)
My problems:

The output from my pytesseract OCR somehow ends up being one line as both the list and "text" object. When I write it to .txt or .csv it looks like several lines but in fact is not(??) When I load the file or output object(text) into a pandas dataframe, it has 1 column with the entire text extraction inside it, which makes it hard to sort the data.

Pandas error I get when not using encoding="unicode_escape"
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 83, saw 2
Here's some of the stuff I've tried to work around it:

#Attempting to reduce the ammount of data
removed_char = text.translate({ord(i): None for i in 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ/ΓΈ()'})


#Writing the output data to a text file, with unicode_escape because pandas can't read my data if not.
with open('output1.txt', 'w', encoding="unicode_escape", newline='') as f:
     f.write(removed_char)
The output of removed_char object is either loads of empty rows between each row containing data, or one long row with loads of \n\n\n\n\ everywhere.

I tried to remove the \n doing this:
text.replace('\n', '')
 
with open("output1.txt", 'r', newline=None) as fd:
    for line in fd:
        line = line.replace("\n", "")
The \n is still everywhere, or huge spaces of lines with no data.

I tried some variations of this using 'rb' and 'wb' but then I get an error saying
TypeError: a bytes-like object is required, not 'str'
Various variations of this one, regardless of what I pass into the reader/writer (Apart from unicode_escape)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position
Am I missing some obvious solution to this? Been stuck for about 25 hours of active research.