Export data from PDF as tabular format

Export data from PDF as tabular format - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Export data from PDF as tabular format (/thread-41089.html)

Export data from PDF as tabular format - zinho - Nov-08-2023

Hi.

I would like exporta data from PDF files, but I need in tabula format.

import io
import pytesseract
from pdf2image import convert_from_path
import pandas as pd
import re

#Insumo, Quantidade, Unid., Preço unit., Preço final

def extract_text_from_pdf(pdf_path):
    # Convert PDF to image
    pages = convert_from_path(pdf_path, 500)
     
    # Extract text from each page using Tesseract OCR
    text_data = ''
    for page in pages:
        text = pytesseract.image_to_string(page)
        #text_data += text 
        text_data += text + '\n'
     
    # Return the text data
    return text_data
 
text = extract_text_from_pdf('1.pdf')
# extract main string
result = re.findall(r'Insumo(.*?)Cond. pagamento', text,re.DOTALL|re.MULTILINE)
rst = list(result)
df = pd.DataFrame(rst)
df.to_excel('output.xlsx')
#print(df)
print("Done!")

This is output (I won't this)

Output:
[' Quantidade|Unid. |Solicitagao Prego unit.| Desc(R$)} Desc(%) %Acr/|Preco final Dt. entrega\n4094 - FECHADURA 594,1200 0,00 0,00 0,00|594, 12 26/10/2023\n\nELETRONICA\nPARA PORTA DE ABRIR - FE\n\n \n\n   \n\n21150 S/ MACANETA\n4565 - CONTROLE REMOTO 19,1700 26/10/2023\nXAC 4000\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n']

I want this (think in worksheet in excel)

Output:Insumo	Quantidade	Unid.	Preço unit.	Preço final
4094 - FECHADURA ELETRONICA PARA PORTA DE ABRIR - FE 21150 S/ MACANETA	1	un	594,12	594,12
4565 - CONTROLE REMOTO XAC 400	2	un	19,17	38,34

Look here for my PDF file exemplo:
https://drive.google.com/file/d/1QnX4vv8EeiVtg9zEPENL-ssfuG-MTQvf/view?usp=sharing

Thaks

RE: Export data from PDF as tabular format - Pedroski55 - Nov-09-2023

Do you want all of 1.pdf?

Do you just want the small table in the pdf that has Insumo in the top left column?

Do all these tables always have the same format?

Or do you want all of 1.pdf?

I don't have Portuguese as a language in Tesseract, so I ran on English. I get most of the text, but putting it all neatly in a table will be hard.

If all the pdfs have the same format, you can use Image to crop out each table and just work on 1 table at a time. But, because you have multilines in the column Insumo, it will get confusing.

If all the tables always have the same dimensions, then you can use Image to crop out each column, extract the text and write to Excel.

But I think the pdfs maybe have different page layouts and the tables therein will probably have different sizes!

RE: Export data from PDF as tabular format - zinho - Nov-09-2023

(Nov-09-2023, 06:56 PM)Pedroski55 Wrote: Do you want all of 1.pdf?

Do you just want the small table in the pdf that has Insumo in the top left column?

Do all these tables always have the same format?

Or do you want all of 1.pdf?

I don't have Portuguese as a language in Tesseract, so I ran on English. I get most of the text, but putting it all neatly in a table will be hard.

If all the pdfs have the same format, you can use Image to crop out each table and just work on 1 table at a time. But, because you have multilines in the column Insumo, it will get confusing.

If all the tables always have the same dimensions, then you can use Image to crop out each column, extract the text and write to Excel.

But I think the pdfs maybe have different page layouts and the tables therein will probably have different sizes!

I need partial content of the file, look my image as exemplo.
https://drive.google.com/file/d/1ietJDMSbQ3DpVZFSgb-y9H3aiy-Ex2ZH/view?usp=sharing

I need get this five fileds(Insumo, Quantidade, Unid., Preço unit., Preço final) and values of it.
When my code extract data, is mess data, I would like organized as tabular format, to put into spreadsheet

RE: Export data from PDF as tabular format - Pedroski55 - Nov-10-2023

Well, I don't know if this helps you, but you can give tesseract a cropped image of each column you want, by getting the crop coordinates for each column.

Then, when you OCR the image, the text you get will all be in 1 column. I do something similar for my little multi-choice marking system. I have to set up the crop coordinates when I make the answer form, but only 1 time, then I can mark hundreds or thousands of answer forms.

I don't know if all your Ordem de Compra PDFs will have the same format. If they do, that makes life easy! But probably not! This is not worth doing if every PDF is different!

You would need to have 5 sets of crop coordinates for 5 columns. Only Insumo has more than 1 line of text.

But if all the PDFs are different, cropping like this won't help because you would need to get the coordinates for each PDF! Confused

Anyway, maybe this will give you some ideas!

I find the crop coordinates (80, 1100, 1580, 1300) cuts out just the table you want! If you save that image, at least you will OCR a bit faster.

import os
import pdf2image
from PIL import Image, ImageOps

path2pdf = '/home/pedro/pdfs/pdfs/'
path2text = '/home/pedro/temp/pdfs2text/'
path2jpg = '/home/pedro/pdfs/pdf2jpg/'

def junkjpgs(path):
    print('Clearing out the folders we use, in case there is anything in there ... ')
    pics = os.listdir(path)
    if len(pics) == 0:
        print('Nothing in ' + path + '\n\n')
        return
    for file in pics:
        os.remove(path + file)
    print('ALL files removed from: ' + path + '\n\n')
    
def splitPDF(aPDF, source, destination, savename):
    print(f'Splitting {source + aPDF} to individual jpgs ... ')    
    # images is a list
    images = pdf2image.convert_from_path(source + aPDF)
    for i in range(len(images)):        
        images[i].save(destination + savename+ '_' + str(i+1) + '.jpg', 'JPEG')
        #image.save(destination + str(studinr) + '.jpg', 'JPEG')        
    print(f'{source + aPDF} split to .jpgs and all saved in: {destination}')

def cropImage(path2jpg):
    print('running cropImage() ... \n')
    print('\n you need crop coordinates for each column \n')
    # get 1 of the 1 page pdfs
    one_page_jpgs = os.listdir(path2jpg)
    one_page_jpgs.sort()
    # Opens an image in RGB mode 
    im = Image.open(path2jpg + one_page_jpgs[0] )      
    # Size of the image in pixels (size of orginal image) 
    # Just for information
    width, height = im.size
    print('This image is: ', width, ' wide and ', height, ' high')      
    accept = ['yes']
    answer = 'X'
    while not answer in accept:
        print('enter the crop image top left and bottom right coordinates.')
        corners = input('Enter your 4 numbers, separated by a space ... ')
        # later should be able to use fixed values for the y coordinate because it won't change
        # work on that!!
        coords = corners.split()
        for i in range(len(coords)):
            coords[i] = int(coords[i])
        # Cropped image of above dimension 
        # (It will not change orginal image) 
        im1 = im.crop((coords[0], coords[1], coords[2], coords[3])) 
          
        # Shows the image in image viewer 
        im1.show()
        print('Do you accept this image')
        answer = input('Enter yes to accept, anything else to try again ')
        if answer == 'yes':            
            print('image crop coordinates saved to tup ... ')
            return (coords[0], coords[1], coords[2], coords[3])

def getCoords():
    column_coords = []
    print('Now we will get the image crop coordinates')
    print('How many columns are there?')
    cols = input('Just enter a number like 1 or 2 or 3 or 4 or more ... ')    
    for i in range(int(cols)):         
        tup = cropImage(path2jpg)
        column_coords.append(tup)
    return column_coords

if __name__ == '__main__':
    filename = input('Enter the name of the PDF file you want to process ... ')
    name = filename.split('.')
    savename = name[0]
    splitPDF(filename, path2pdf, path2jpg, savename)
    # get the crop coordinates
    coords = getCoords()
    for tup in coords:
        print('The column coordinates are', tup)
    # now use the crop coordinates to make a small image of each column you want and OCR that

But this Ordem de Compra did not begin life as an image PDF, it was an Office document!

It would be much easier to get what you want from the original document! Where is that??

RE: Export data from PDF as tabular format - zinho - Nov-10-2023

Hi Pedro.
THis is very far from my knowledge of Python3

I was thinking it's more easy use regex with output below, to get data e insert into Excel worksheet.

Output:
' Quantidade|Unid. |Solicitagao Prego unit.| Desc(R$)} Desc(%) %Acr/|Preco final Dt. entrega\n4094 - FECHADURA 594,1200 0,00 0,00 0,00|594, 12 26/10/2023\n\nELETRONICA\nPARA PORTA DE ABRIR - FE\n\n \n\n   \n\n21150 S/ MACANETA\n4565 - CONTROLE REMOTO 19,1700 26/10/2023\nXAC 4000\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n']

But thank you any way.

RE: Export data from PDF as tabular format - Pedroski55 - Nov-11-2023

Hi again! I understand my approach may not be useful in this situation.

I am not good with regex!

Just as a test, I did this:

First, I used cropImage(path2jpg) to get the first column, Insumo and saved it as temp.jpg

Then I used the function below:

def convert2text(name):
    # only 1 jpg now
    jpgFile = path2tempjpg + 'temp.jpg'    
    with open(path2text + name, 'a') as this_text:
        # this works fine
        porText = pytesseract.image_to_string(Image.open(jpgFile), lang='por')
        this_text.write(porText)
    print('removing the jpgs ... ')
    junkjpgs(path2tempjpg)
    print('finished this image ... ')

This gives:

Output:Insumo

 

4094 - FECHADURA
ELETRÔNICA

PARA PORTA DE ABRIR - FE
21150 S/ MAÇANETA

 

4565 - CONTROLE REMOTO
XAC 4000

But like I said, this is only useful if all the PDFs have the same format, because you need the exact coordinates for cropping the image.