Python Forum
Trouble with encoded data (I think)
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Trouble with encoded data (I think)
#1
Hi guys!

My script is reading a PDF using pytesseract OCR.
My goal is to extract text from various PDF formats, extract certain numbers from said text and reuse that data elsewhere.
I tried a few PDF reader modules, but since my PDF's have several different formats, it's not accurate enough.

The main code. I don't fully understand half of what I've done here, but it works.
import io
import pytesseract

from PIL import Image
from wand.image import Image as wi
import csv

pdf = wi(filename="asker2.pdf", resolution=200)
pdfImage = pdf.convert('png')

imageBlobs = []

for img in pdfImage.sequence:
    imgPage = wi(image=img)
    imageBlobs.append(imgPage.make_blob('png'))

recognized_text = []

for imgBlob in imageBlobs:
    im = Image.open(io.BytesIO(imgBlob))
    text = pytesseract.image_to_string(im, lang='nor', config='--psm 1')
    recognized_text.append(text)
My problems:

The output from my pytesseract OCR somehow ends up being one line as both the list and "text" object. When I write it to .txt or .csv it looks like several lines but in fact is not(??) When I load the file or output object(text) into a pandas dataframe, it has 1 column with the entire text extraction inside it, which makes it hard to sort the data.

Pandas error I get when not using encoding="unicode_escape"
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 83, saw 2
Here's some of the stuff I've tried to work around it:

#Attempting to reduce the ammount of data
removed_char = text.translate({ord(i): None for i in 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ/ΓΈ()'})


#Writing the output data to a text file, with unicode_escape because pandas can't read my data if not.
with open('output1.txt', 'w', encoding="unicode_escape", newline='') as f:
     f.write(removed_char)
The output of removed_char object is either loads of empty rows between each row containing data, or one long row with loads of \n\n\n\n\ everywhere.

I tried to remove the \n doing this:
text.replace('\n', '')
 
with open("output1.txt", 'r', newline=None) as fd:
    for line in fd:
        line = line.replace("\n", "")
The \n is still everywhere, or huge spaces of lines with no data.

I tried some variations of this using 'rb' and 'wb' but then I get an error saying
TypeError: a bytes-like object is required, not 'str'
Various variations of this one, regardless of what I pass into the reader/writer (Apart from unicode_escape)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position
Am I missing some obvious solution to this? Been stuck for about 25 hours of active research.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
Question Can you put encoded strings into .txt files? Alivegamer 0 1,255 May-04-2022, 12:50 AM
Last Post: Alivegamer
  filecmp is not working for UTF-8 BOM encoded files sureshnagarajan 3 2,597 Feb-10-2021, 11:17 AM
Last Post: sureshnagarajan
  get original code after being encoded to UTF-8 ashok 18 5,884 Sep-08-2020, 04:17 AM
Last Post: ndc85430
  Having trouble with minute stock data MAZambelli4353 2 2,331 Sep-03-2019, 09:41 AM
Last Post: perfringo
  convert hex encoded string to ASCII Skaperen 4 114,237 Oct-02-2016, 09:22 AM
Last Post: snippsat

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020