Oct-10-2019, 08:15 PM
Hi guys!
My script is reading a PDF using pytesseract OCR.
My goal is to extract text from various PDF formats, extract certain numbers from said text and reuse that data elsewhere.
I tried a few PDF reader modules, but since my PDF's have several different formats, it's not accurate enough.
The main code. I don't fully understand half of what I've done here, but it works.
The output from my pytesseract OCR somehow ends up being one line as both the list and "text" object. When I write it to .txt or .csv it looks like several lines but in fact is not(??) When I load the file or output object(text) into a pandas dataframe, it has 1 column with the entire text extraction inside it, which makes it hard to sort the data.
Pandas error I get when not using encoding="unicode_escape"
I tried to remove the \n doing this:
I tried some variations of this using 'rb' and 'wb' but then I get an error saying
My script is reading a PDF using pytesseract OCR.
My goal is to extract text from various PDF formats, extract certain numbers from said text and reuse that data elsewhere.
I tried a few PDF reader modules, but since my PDF's have several different formats, it's not accurate enough.
The main code. I don't fully understand half of what I've done here, but it works.
import io import pytesseract from PIL import Image from wand.image import Image as wi import csv pdf = wi(filename="asker2.pdf", resolution=200) pdfImage = pdf.convert('png') imageBlobs = [] for img in pdfImage.sequence: imgPage = wi(image=img) imageBlobs.append(imgPage.make_blob('png')) recognized_text = [] for imgBlob in imageBlobs: im = Image.open(io.BytesIO(imgBlob)) text = pytesseract.image_to_string(im, lang='nor', config='--psm 1') recognized_text.append(text)My problems:
The output from my pytesseract OCR somehow ends up being one line as both the list and "text" object. When I write it to .txt or .csv it looks like several lines but in fact is not(??) When I load the file or output object(text) into a pandas dataframe, it has 1 column with the entire text extraction inside it, which makes it hard to sort the data.
Pandas error I get when not using encoding="unicode_escape"
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 83, saw 2Here's some of the stuff I've tried to work around it:
#Attempting to reduce the ammount of data removed_char = text.translate({ord(i): None for i in 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ/ø()'}) #Writing the output data to a text file, with unicode_escape because pandas can't read my data if not. with open('output1.txt', 'w', encoding="unicode_escape", newline='') as f: f.write(removed_char)The output of removed_char object is either loads of empty rows between each row containing data, or one long row with loads of \n\n\n\n\ everywhere.
I tried to remove the \n doing this:
text.replace('\n', '')
with open("output1.txt", 'r', newline=None) as fd: for line in fd: line = line.replace("\n", "")The \n is still everywhere, or huge spaces of lines with no data.
I tried some variations of this using 'rb' and 'wb' but then I get an error saying
TypeError: a bytes-like object is required, not 'str'Various variations of this one, regardless of what I pass into the reader/writer (Apart from unicode_escape)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in positionAm I missing some obvious solution to this? Been stuck for about 25 hours of active research.