Trouble with encoded data (I think) - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: Trouble with encoded data (I think) (/thread-21708.html) |
Trouble with encoded data (I think) - fishglue - Oct-10-2019 Hi guys! My script is reading a PDF using pytesseract OCR. My goal is to extract text from various PDF formats, extract certain numbers from said text and reuse that data elsewhere. I tried a few PDF reader modules, but since my PDF's have several different formats, it's not accurate enough. The main code. I don't fully understand half of what I've done here, but it works. import io import pytesseract from PIL import Image from wand.image import Image as wi import csv pdf = wi(filename="asker2.pdf", resolution=200) pdfImage = pdf.convert('png') imageBlobs = [] for img in pdfImage.sequence: imgPage = wi(image=img) imageBlobs.append(imgPage.make_blob('png')) recognized_text = [] for imgBlob in imageBlobs: im = Image.open(io.BytesIO(imgBlob)) text = pytesseract.image_to_string(im, lang='nor', config='--psm 1') recognized_text.append(text)My problems: The output from my pytesseract OCR somehow ends up being one line as both the list and "text" object. When I write it to .txt or .csv it looks like several lines but in fact is not(??) When I load the file or output object(text) into a pandas dataframe, it has 1 column with the entire text extraction inside it, which makes it hard to sort the data. Pandas error I get when not using encoding="unicode_escape" pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 83, saw 2Here's some of the stuff I've tried to work around it: #Attempting to reduce the ammount of data removed_char = text.translate({ord(i): None for i in 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ/ΓΈ()'}) #Writing the output data to a text file, with unicode_escape because pandas can't read my data if not. with open('output1.txt', 'w', encoding="unicode_escape", newline='') as f: f.write(removed_char)The output of removed_char object is either loads of empty rows between each row containing data, or one long row with loads of \n\n\n\n\ everywhere. I tried to remove the \n doing this: text.replace('\n', '') with open("output1.txt", 'r', newline=None) as fd: for line in fd: line = line.replace("\n", "")The \n is still everywhere, or huge spaces of lines with no data. I tried some variations of this using 'rb' and 'wb' but then I get an error saying TypeError: a bytes-like object is required, not 'str'Various variations of this one, regardless of what I pass into the reader/writer (Apart from unicode_escape) UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in positionAm I missing some obvious solution to this? Been stuck for about 25 hours of active research. |