Export data from PDF as tabular format

zinho · (This post was last modified: Nov-08-2023, 09:28 PM by zinho.)

Hi.

I would like exporta data from PDF files, but I need in tabula format.

import io
import pytesseract
from pdf2image import convert_from_path
import pandas as pd
import re

#Insumo, Quantidade, Unid., Preço unit., Preço final

def extract_text_from_pdf(pdf_path):
    # Convert PDF to image
    pages = convert_from_path(pdf_path, 500)
     
    # Extract text from each page using Tesseract OCR
    text_data = ''
    for page in pages:
        text = pytesseract.image_to_string(page)
        #text_data += text 
        text_data += text + '\n'
     
    # Return the text data
    return text_data
 
text = extract_text_from_pdf('1.pdf')
# extract main string
result = re.findall(r'Insumo(.*?)Cond. pagamento', text,re.DOTALL|re.MULTILINE)
rst = list(result)
df = pd.DataFrame(rst)
df.to_excel('output.xlsx')
#print(df)
print("Done!")

This is output (I won't this)

Output:
[' Quantidade|Unid. |Solicitagao Prego unit.| Desc(R$)} Desc(%) %Acr/|Preco final Dt. entrega\n4094 - FECHADURA 594,1200 0,00 0,00 0,00|594, 12 26/10/2023\n\nELETRONICA\nPARA PORTA DE ABRIR - FE\n\n \n\n   \n\n21150 S/ MACANETA\n4565 - CONTROLE REMOTO 19,1700 26/10/2023\nXAC 4000\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n']

I want this (think in worksheet in excel)

Output:Insumo	Quantidade	Unid.	Preço unit.	Preço final
4094 - FECHADURA ELETRONICA PARA PORTA DE ABRIR - FE 21150 S/ MACANETA	1	un	594,12	594,12
4565 - CONTROLE REMOTO XAC 400	2	un	19,17	38,34

Look here for my PDF file exemplo:
https://drive.google.com/file/d/1QnX4vv8...sp=sharing

Thaks

Pedroski55 · Nov-09-2023, 06:56 PM

Do you want all of 1.pdf?

Do you just want the small table in the pdf that has Insumo in the top left column?

Do all these tables always have the same format?

Or do you want all of 1.pdf?

I don't have Portuguese as a language in Tesseract, so I ran on English. I get most of the text, but putting it all neatly in a table will be hard.

If all the pdfs have the same format, you can use Image to crop out each table and just work on 1 table at a time. But, because you have multilines in the column Insumo, it will get confusing.

If all the tables always have the same dimensions, then you can use Image to crop out each column, extract the text and write to Excel.

But I think the pdfs maybe have different page layouts and the tables therein will probably have different sizes!

zinho · (This post was last modified: Nov-09-2023, 07:12 PM by zinho.)

(Nov-09-2023, 06:56 PM)Pedroski55 Wrote: Do you want all of 1.pdf?

Do you just want the small table in the pdf that has Insumo in the top left column?

Do all these tables always have the same format?

Or do you want all of 1.pdf?

I don't have Portuguese as a language in Tesseract, so I ran on English. I get most of the text, but putting it all neatly in a table will be hard.

If all the pdfs have the same format, you can use Image to crop out each table and just work on 1 table at a time. But, because you have multilines in the column Insumo, it will get confusing.

If all the tables always have the same dimensions, then you can use Image to crop out each column, extract the text and write to Excel.

But I think the pdfs maybe have different page layouts and the tables therein will probably have different sizes!

I need partial content of the file, look my image as exemplo.
https://drive.google.com/file/d/1ietJDMS...sp=sharing

I need get this five fileds(Insumo, Quantidade, Unid., Preço unit., Preço final) and values of it.
When my code extract data, is mess data, I would like organized as tabular format, to put into spreadsheet

Pedroski55 · Nov-10-2023, 09:07 AM

Well, I don't know if this helps you, but you can give tesseract a cropped image of each column you want, by getting the crop coordinates for each column.

Then, when you OCR the image, the text you get will all be in 1 column. I do something similar for my little multi-choice marking system. I have to set up the crop coordinates when I make the answer form, but only 1 time, then I can mark hundreds or thousands of answer forms.

I don't know if all your Ordem de Compra PDFs will have the same format. If they do, that makes life easy! But probably not! This is not worth doing if every PDF is different!

You would need to have 5 sets of crop coordinates for 5 columns. Only Insumo has more than 1 line of text.

But if all the PDFs are different, cropping like this won't help because you would need to get the coordinates for each PDF! Confused

Anyway, maybe this will give you some ideas!

I find the crop coordinates (80, 1100, 1580, 1300) cuts out just the table you want! If you save that image, at least you will OCR a bit faster.

import os
import pdf2image
from PIL import Image, ImageOps

path2pdf = '/home/pedro/pdfs/pdfs/'
path2text = '/home/pedro/temp/pdfs2text/'
path2jpg = '/home/pedro/pdfs/pdf2jpg/'

def junkjpgs(path):
    print('Clearing out the folders we use, in case there is anything in there ... ')
    pics = os.listdir(path)
    if len(pics) == 0:
        print('Nothing in ' + path + '\n\n')
        return
    for file in pics:
        os.remove(path + file)
    print('ALL files removed from: ' + path + '\n\n')
    
def splitPDF(aPDF, source, destination, savename):
    print(f'Splitting {source + aPDF} to individual jpgs ... ')    
    # images is a list
    images = pdf2image.convert_from_path(source + aPDF)
    for i in range(len(images)):        
        images[i].save(destination + savename+ '_' + str(i+1) + '.jpg', 'JPEG')
        #image.save(destination + str(studinr) + '.jpg', 'JPEG')        
    print(f'{source + aPDF} split to .jpgs and all saved in: {destination}')

def cropImage(path2jpg):
    print('running cropImage() ... \n')
    print('\n you need crop coordinates for each column \n')
    # get 1 of the 1 page pdfs
    one_page_jpgs = os.listdir(path2jpg)
    one_page_jpgs.sort()
    # Opens an image in RGB mode 
    im = Image.open(path2jpg + one_page_jpgs[0] )      
    # Size of the image in pixels (size of orginal image) 
    # Just for information
    width, height = im.size
    print('This image is: ', width, ' wide and ', height, ' high')      
    accept = ['yes']
    answer = 'X'
    while not answer in accept:
        print('enter the crop image top left and bottom right coordinates.')
        corners = input('Enter your 4 numbers, separated by a space ... ')
        # later should be able to use fixed values for the y coordinate because it won't change
        # work on that!!
        coords = corners.split()
        for i in range(len(coords)):
            coords[i] = int(coords[i])
        # Cropped image of above dimension 
        # (It will not change orginal image) 
        im1 = im.crop((coords[0], coords[1], coords[2], coords[3])) 
          
        # Shows the image in image viewer 
        im1.show()
        print('Do you accept this image')
        answer = input('Enter yes to accept, anything else to try again ')
        if answer == 'yes':            
            print('image crop coordinates saved to tup ... ')
            return (coords[0], coords[1], coords[2], coords[3])

def getCoords():
    column_coords = []
    print('Now we will get the image crop coordinates')
    print('How many columns are there?')
    cols = input('Just enter a number like 1 or 2 or 3 or 4 or more ... ')    
    for i in range(int(cols)):         
        tup = cropImage(path2jpg)
        column_coords.append(tup)
    return column_coords

if __name__ == '__main__':
    filename = input('Enter the name of the PDF file you want to process ... ')
    name = filename.split('.')
    savename = name[0]
    splitPDF(filename, path2pdf, path2jpg, savename)
    # get the crop coordinates
    coords = getCoords()
    for tup in coords:
        print('The column coordinates are', tup)
    # now use the crop coordinates to make a small image of each column you want and OCR that

But this Ordem de Compra did not begin life as an image PDF, it was an Office document!

It would be much easier to get what you want from the original document! Where is that??

zinho · (This post was last modified: Nov-10-2023, 05:12 PM by zinho.)

Hi Pedro.
THis is very far from my knowledge of Python3

I was thinking it's more easy use regex with output below, to get data e insert into Excel worksheet.

Output:
' Quantidade|Unid. |Solicitagao Prego unit.| Desc(R$)} Desc(%) %Acr/|Preco final Dt. entrega\n4094 - FECHADURA 594,1200 0,00 0,00 0,00|594, 12 26/10/2023\n\nELETRONICA\nPARA PORTA DE ABRIR - FE\n\n \n\n   \n\n21150 S/ MACANETA\n4565 - CONTROLE REMOTO 19,1700 26/10/2023\nXAC 4000\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n']

But thank you any way.

Pedroski55 · (This post was last modified: Nov-11-2023, 08:23 AM by Pedroski55.)

Hi again! I understand my approach may not be useful in this situation.

I am not good with regex!

Just as a test, I did this:

First, I used cropImage(path2jpg) to get the first column, Insumo and saved it as temp.jpg

Then I used the function below:

def convert2text(name):
    # only 1 jpg now
    jpgFile = path2tempjpg + 'temp.jpg'    
    with open(path2text + name, 'a') as this_text:
        # this works fine
        porText = pytesseract.image_to_string(Image.open(jpgFile), lang='por')
        this_text.write(porText)
    print('removing the jpgs ... ')
    junkjpgs(path2tempjpg)
    print('finished this image ... ')

This gives:

Output:Insumo

 

4094 - FECHADURA
ELETRÔNICA

PARA PORTA DE ABRIR - FE
21150 S/ MAÇANETA

 

4565 - CONTROLE REMOTO
XAC 4000

But like I said, this is only useful if all the PDFs have the same format, because you need the exact coordinates for cropping the image.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	[SOLVED] [datetime.strptime] ValueError: time data 'foo' does not match format 'bar'	Winfried	1	1,169	Jan-02-2025, 02:09 AM Last Post: lyly19
	How to properly format rows and columns in excel data from parsed .txt blocks	jh67	7	3,744	Dec-12-2022, 08:22 PM Last Post: jh67
	BCP Export sql data to csv	mg24	2	1,816	Nov-20-2022, 11:45 AM Last Post: Pedroski55
	Issue in changing data format (2 bytes) into a 16 bit data.	GiggsB	11	4,913	Jul-25-2022, 03:19 PM Last Post: deanhystad
	How to keep columns header on excel without change after export data to excel file?	ahmedbarbary	0	1,742	May-03-2022, 05:46 PM Last Post: ahmedbarbary
	Need Help writing data into Excel format	ajitnayak87	8	3,960	Feb-04-2022, 03:00 AM Last Post: Jeff_t
	Set 'Time' format cell when writing data to excel and not 'custom'	limors	3	8,611	Mar-29-2021, 09:36 PM Last Post: Larz60+
	tabula-py, how to preserve a read_pdf() format and export to csv	abcoelho	2	4,816	Mar-24-2021, 08:34 PM Last Post: abcoelho
	ValueError: time data 'None' does not match format '%Y-%m-%dT%H:%M:%S.%f'	rajesh3383	4	21,694	Sep-03-2020, 08:22 PM Last Post: buran
	Issue accessing data from Dictionary/List in the right format	LuisSatch	2	2,976	Jul-25-2020, 06:12 AM Last Post: LuisSatch

Export data from PDF as tabular format

User Panel Messages

Announcements