Posts: 58
Threads: 20
Joined: Jan 2017
Nov-08-2023, 09:28 PM
(This post was last modified: Nov-08-2023, 09:28 PM by zinho.)
Hi.
I would like exporta data from PDF files, but I need in tabula format.
import io
import pytesseract
from pdf2image import convert_from_path
import pandas as pd
import re
#Insumo, Quantidade, Unid., Preço unit., Preço final
def extract_text_from_pdf(pdf_path):
# Convert PDF to image
pages = convert_from_path(pdf_path, 500)
# Extract text from each page using Tesseract OCR
text_data = ''
for page in pages:
text = pytesseract.image_to_string(page)
#text_data += text
text_data += text + '\n'
# Return the text data
return text_data
text = extract_text_from_pdf('1.pdf')
# extract main string
result = re.findall(r'Insumo(.*?)Cond. pagamento', text,re.DOTALL|re.MULTILINE)
rst = list(result)
df = pd.DataFrame(rst)
df.to_excel('output.xlsx')
#print(df)
print("Done!") This is output (I won't this)
Output: [' Quantidade|Unid. |Solicitagao Prego unit.| Desc(R$)} Desc(%) %Acr/|Preco final Dt. entrega\n4094 - FECHADURA 594,1200 0,00 0,00 0,00|594, 12 26/10/2023\n\nELETRONICA\nPARA PORTA DE ABRIR - FE\n\n \n\n \n\n21150 S/ MACANETA\n4565 - CONTROLE REMOTO 19,1700 26/10/2023\nXAC 4000\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n']
I want this (think in worksheet in excel)
Output: Insumo Quantidade Unid. Preço unit. Preço final
4094 - FECHADURA ELETRONICA PARA PORTA DE ABRIR - FE 21150 S/ MACANETA 1 un 594,12 594,12
4565 - CONTROLE REMOTO XAC 400 2 un 19,17 38,34
Look here for my PDF file exemplo:
https://drive.google.com/file/d/1QnX4vv8...sp=sharing
Thaks
Posts: 1,091
Threads: 143
Joined: Jul 2017
Do you want all of 1.pdf?
Do you just want the small table in the pdf that has Insumo in the top left column?
Do all these tables always have the same format?
Or do you want all of 1.pdf?
I don't have Portuguese as a language in Tesseract, so I ran on English. I get most of the text, but putting it all neatly in a table will be hard.
If all the pdfs have the same format, you can use Image to crop out each table and just work on 1 table at a time. But, because you have multilines in the column Insumo, it will get confusing.
If all the tables always have the same dimensions, then you can use Image to crop out each column, extract the text and write to Excel.
But I think the pdfs maybe have different page layouts and the tables therein will probably have different sizes!
Posts: 58
Threads: 20
Joined: Jan 2017
Nov-09-2023, 07:11 PM
(This post was last modified: Nov-09-2023, 07:12 PM by zinho.)
(Nov-09-2023, 06:56 PM)Pedroski55 Wrote: Do you want all of 1.pdf?
Do you just want the small table in the pdf that has Insumo in the top left column?
Do all these tables always have the same format?
Or do you want all of 1.pdf?
I don't have Portuguese as a language in Tesseract, so I ran on English. I get most of the text, but putting it all neatly in a table will be hard.
If all the pdfs have the same format, you can use Image to crop out each table and just work on 1 table at a time. But, because you have multilines in the column Insumo, it will get confusing.
If all the tables always have the same dimensions, then you can use Image to crop out each column, extract the text and write to Excel.
But I think the pdfs maybe have different page layouts and the tables therein will probably have different sizes!
I need partial content of the file, look my image as exemplo.
https://drive.google.com/file/d/1ietJDMS...sp=sharing
I need get this five fileds(Insumo, Quantidade, Unid., Preço unit., Preço final) and values of it.
When my code extract data, is mess data, I would like organized as tabular format, to put into spreadsheet
Posts: 1,091
Threads: 143
Joined: Jul 2017
Well, I don't know if this helps you, but you can give tesseract a cropped image of each column you want, by getting the crop coordinates for each column.
Then, when you OCR the image, the text you get will all be in 1 column. I do something similar for my little multi-choice marking system. I have to set up the crop coordinates when I make the answer form, but only 1 time, then I can mark hundreds or thousands of answer forms.
I don't know if all your Ordem de Compra PDFs will have the same format. If they do, that makes life easy! But probably not! This is not worth doing if every PDF is different!
You would need to have 5 sets of crop coordinates for 5 columns. Only Insumo has more than 1 line of text.
But if all the PDFs are different, cropping like this won't help because you would need to get the coordinates for each PDF!
Anyway, maybe this will give you some ideas!
I find the crop coordinates (80, 1100, 1580, 1300) cuts out just the table you want! If you save that image, at least you will OCR a bit faster.
import os
import pdf2image
from PIL import Image, ImageOps
path2pdf = '/home/pedro/pdfs/pdfs/'
path2text = '/home/pedro/temp/pdfs2text/'
path2jpg = '/home/pedro/pdfs/pdf2jpg/'
def junkjpgs(path):
print('Clearing out the folders we use, in case there is anything in there ... ')
pics = os.listdir(path)
if len(pics) == 0:
print('Nothing in ' + path + '\n\n')
return
for file in pics:
os.remove(path + file)
print('ALL files removed from: ' + path + '\n\n')
def splitPDF(aPDF, source, destination, savename):
print(f'Splitting {source + aPDF} to individual jpgs ... ')
# images is a list
images = pdf2image.convert_from_path(source + aPDF)
for i in range(len(images)):
images[i].save(destination + savename+ '_' + str(i+1) + '.jpg', 'JPEG')
#image.save(destination + str(studinr) + '.jpg', 'JPEG')
print(f'{source + aPDF} split to .jpgs and all saved in: {destination}')
def cropImage(path2jpg):
print('running cropImage() ... \n')
print('\n you need crop coordinates for each column \n')
# get 1 of the 1 page pdfs
one_page_jpgs = os.listdir(path2jpg)
one_page_jpgs.sort()
# Opens an image in RGB mode
im = Image.open(path2jpg + one_page_jpgs[0] )
# Size of the image in pixels (size of orginal image)
# Just for information
width, height = im.size
print('This image is: ', width, ' wide and ', height, ' high')
accept = ['yes']
answer = 'X'
while not answer in accept:
print('enter the crop image top left and bottom right coordinates.')
corners = input('Enter your 4 numbers, separated by a space ... ')
# later should be able to use fixed values for the y coordinate because it won't change
# work on that!!
coords = corners.split()
for i in range(len(coords)):
coords[i] = int(coords[i])
# Cropped image of above dimension
# (It will not change orginal image)
im1 = im.crop((coords[0], coords[1], coords[2], coords[3]))
# Shows the image in image viewer
im1.show()
print('Do you accept this image')
answer = input('Enter yes to accept, anything else to try again ')
if answer == 'yes':
print('image crop coordinates saved to tup ... ')
return (coords[0], coords[1], coords[2], coords[3])
def getCoords():
column_coords = []
print('Now we will get the image crop coordinates')
print('How many columns are there?')
cols = input('Just enter a number like 1 or 2 or 3 or 4 or more ... ')
for i in range(int(cols)):
tup = cropImage(path2jpg)
column_coords.append(tup)
return column_coords
if __name__ == '__main__':
filename = input('Enter the name of the PDF file you want to process ... ')
name = filename.split('.')
savename = name[0]
splitPDF(filename, path2pdf, path2jpg, savename)
# get the crop coordinates
coords = getCoords()
for tup in coords:
print('The column coordinates are', tup)
# now use the crop coordinates to make a small image of each column you want and OCR that But this Ordem de Compra did not begin life as an image PDF, it was an Office document!
It would be much easier to get what you want from the original document! Where is that??
Posts: 58
Threads: 20
Joined: Jan 2017
Nov-10-2023, 05:11 PM
(This post was last modified: Nov-10-2023, 05:12 PM by zinho.)
Hi Pedro.
THis is very far from my knowledge of Python3
I was thinking it's more easy use regex with output below, to get data e insert into Excel worksheet.
Output: ' Quantidade|Unid. |Solicitagao Prego unit.| Desc(R$)} Desc(%) %Acr/|Preco final Dt. entrega\n4094 - FECHADURA 594,1200 0,00 0,00 0,00|594, 12 26/10/2023\n\nELETRONICA\nPARA PORTA DE ABRIR - FE\n\n \n\n \n\n21150 S/ MACANETA\n4565 - CONTROLE REMOTO 19,1700 26/10/2023\nXAC 4000\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n']
But thank you any way.
Posts: 1,091
Threads: 143
Joined: Jul 2017
Nov-11-2023, 08:23 AM
(This post was last modified: Nov-11-2023, 08:23 AM by Pedroski55.)
Hi again! I understand my approach may not be useful in this situation.
I am not good with regex!
Just as a test, I did this:
First, I used cropImage(path2jpg) to get the first column, Insumo and saved it as temp.jpg
Then I used the function below:
def convert2text(name):
# only 1 jpg now
jpgFile = path2tempjpg + 'temp.jpg'
with open(path2text + name, 'a') as this_text:
# this works fine
porText = pytesseract.image_to_string(Image.open(jpgFile), lang='por')
this_text.write(porText)
print('removing the jpgs ... ')
junkjpgs(path2tempjpg)
print('finished this image ... ') This gives:
Output: Insumo
4094 - FECHADURA
ELETRÔNICA
PARA PORTA DE ABRIR - FE
21150 S/ MAÇANETA
4565 - CONTROLE REMOTO
XAC 4000
But like I said, this is only useful if all the PDFs have the same format, because you need the exact coordinates for cropping the image.
|