Nov-11-2023, 08:23 AM
(This post was last modified: Nov-11-2023, 08:23 AM by Pedroski55.)
Hi again! I understand my approach may not be useful in this situation.
I am not good with regex!
Just as a test, I did this:
First, I used cropImage(path2jpg) to get the first column, Insumo and saved it as temp.jpg
Then I used the function below:
I am not good with regex!
Just as a test, I did this:
First, I used cropImage(path2jpg) to get the first column, Insumo and saved it as temp.jpg
Then I used the function below:
def convert2text(name): # only 1 jpg now jpgFile = path2tempjpg + 'temp.jpg' with open(path2text + name, 'a') as this_text: # this works fine porText = pytesseract.image_to_string(Image.open(jpgFile), lang='por') this_text.write(porText) print('removing the jpgs ... ') junkjpgs(path2tempjpg) print('finished this image ... ')This gives:
Output:Insumo
4094 - FECHADURA
ELETRÔNICA
PARA PORTA DE ABRIR - FE
21150 S/ MAÇANETA
4565 - CONTROLE REMOTO
XAC 4000
But like I said, this is only useful if all the PDFs have the same format, because you need the exact coordinates for cropping the image.