Oct-10-2019, 08:15 PM
Hi guys!
My script is reading a PDF using pytesseract OCR.
My goal is to extract text from various PDF formats, extract certain numbers from said text and reuse that data elsewhere.
I tried a few PDF reader modules, but since my PDF's have several different formats, it's not accurate enough.
The main code. I don't fully understand half of what I've done here, but it works.
My problems:
The output from my pytesseract OCR somehow ends up being one line as both the list and "text" object. When I write it to .txt or .csv it looks like several lines but in fact is not(??) When I load the file or output object(text) into a pandas dataframe, it has 1 column with the entire text extraction inside it, which makes it hard to sort the data.
Pandas error I get when not using encoding="unicode_escape"
Here's some of the stuff I've tried to work around it:
The output of removed_char object is either loads of empty rows between each row containing data, or one long row with loads of \n\n\n\n\ everywhere.
I tried to remove the \n doing this:
The \n is still everywhere, or huge spaces of lines with no data.
I tried some variations of this using 'rb' and 'wb' but then I get an error saying
Various variations of this one, regardless of what I pass into the reader/writer (Apart from unicode_escape)
Am I missing some obvious solution to this? Been stuck for about 25 hours of active research.
My script is reading a PDF using pytesseract OCR.
My goal is to extract text from various PDF formats, extract certain numbers from said text and reuse that data elsewhere.
I tried a few PDF reader modules, but since my PDF's have several different formats, it's not accurate enough.
The main code. I don't fully understand half of what I've done here, but it works.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
import io import pytesseract from PIL import Image from wand.image import Image as wi import csv pdf = wi(filename = "asker2.pdf" , resolution = 200 ) pdfImage = pdf.convert( 'png' ) imageBlobs = [] for img in pdfImage.sequence: imgPage = wi(image = img) imageBlobs.append(imgPage.make_blob( 'png' )) recognized_text = [] for imgBlob in imageBlobs: im = Image. open (io.BytesIO(imgBlob)) text = pytesseract.image_to_string(im, lang = 'nor' , config = '--psm 1' ) recognized_text.append(text) |
The output from my pytesseract OCR somehow ends up being one line as both the list and "text" object. When I write it to .txt or .csv it looks like several lines but in fact is not(??) When I load the file or output object(text) into a pandas dataframe, it has 1 column with the entire text extraction inside it, which makes it hard to sort the data.
Pandas error I get when not using encoding="unicode_escape"
1 |
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 83 , saw 2 |
1 2 3 4 5 6 7 |
#Attempting to reduce the ammount of data removed_char = text.translate({ ord (i): None for i in 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ/ø()' }) #Writing the output data to a text file, with unicode_escape because pandas can't read my data if not. with open ( 'output1.txt' , 'w' , encoding = "unicode_escape" , newline = '') as f: f.write(removed_char) |
I tried to remove the \n doing this:
1 |
text.replace( '\n' , '') |
1 2 3 4 |
with open ( "output1.txt" , 'r' , newline = None ) as fd: for line in fd: line = line.replace( "\n" , "") |
I tried some variations of this using 'rb' and 'wb' but then I get an error saying
1 |
TypeError: a bytes - like object is required, not 'str' |
1 |
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xda in position |