Python Forum
Export data from PDF as tabular format
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Export data from PDF as tabular format
#1
Hi.

I would like exporta data from PDF files, but I need in tabula format.
import io
import pytesseract
from pdf2image import convert_from_path
import pandas as pd
import re

#Insumo, Quantidade, Unid., Preço unit., Preço final

def extract_text_from_pdf(pdf_path):
    # Convert PDF to image
    pages = convert_from_path(pdf_path, 500)
     
    # Extract text from each page using Tesseract OCR
    text_data = ''
    for page in pages:
        text = pytesseract.image_to_string(page)
        #text_data += text 
        text_data += text + '\n'
     
    # Return the text data
    return text_data
 
text = extract_text_from_pdf('1.pdf')
# extract main string
result = re.findall(r'Insumo(.*?)Cond. pagamento', text,re.DOTALL|re.MULTILINE)
rst = list(result)
df = pd.DataFrame(rst)
df.to_excel('output.xlsx')
#print(df)
print("Done!")
This is output (I won't this)
Output:
[' Quantidade|Unid. |Solicitagao Prego unit.| Desc(R$)} Desc(%) %Acr/|Preco final Dt. entrega\n4094 - FECHADURA 594,1200 0,00 0,00 0,00|594, 12 26/10/2023\n\nELETRONICA\nPARA PORTA DE ABRIR - FE\n\n \n\n \n\n21150 S/ MACANETA\n4565 - CONTROLE REMOTO 19,1700 26/10/2023\nXAC 4000\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n']
I want this (think in worksheet in excel)
Output:
Insumo Quantidade Unid. Preço unit. Preço final 4094 - FECHADURA ELETRONICA PARA PORTA DE ABRIR - FE 21150 S/ MACANETA 1 un 594,12 594,12 4565 - CONTROLE REMOTO XAC 400 2 un 19,17 38,34
Look here for my PDF file exemplo:
https://drive.google.com/file/d/1QnX4vv8...sp=sharing

Thaks
Reply
#2
Do you want all of 1.pdf?

Do you just want the small table in the pdf that has Insumo in the top left column?

Do all these tables always have the same format?

Or do you want all of 1.pdf?

I don't have Portuguese as a language in Tesseract, so I ran on English. I get most of the text, but putting it all neatly in a table will be hard.

If all the pdfs have the same format, you can use Image to crop out each table and just work on 1 table at a time. But, because you have multilines in the column Insumo, it will get confusing.

If all the tables always have the same dimensions, then you can use Image to crop out each column, extract the text and write to Excel.

But I think the pdfs maybe have different page layouts and the tables therein will probably have different sizes!
Reply
#3
(Nov-09-2023, 06:56 PM)Pedroski55 Wrote: Do you want all of 1.pdf?

Do you just want the small table in the pdf that has Insumo in the top left column?

Do all these tables always have the same format?

Or do you want all of 1.pdf?

I don't have Portuguese as a language in Tesseract, so I ran on English. I get most of the text, but putting it all neatly in a table will be hard.

If all the pdfs have the same format, you can use Image to crop out each table and just work on 1 table at a time. But, because you have multilines in the column Insumo, it will get confusing.

If all the tables always have the same dimensions, then you can use Image to crop out each column, extract the text and write to Excel.

But I think the pdfs maybe have different page layouts and the tables therein will probably have different sizes!

I need partial content of the file, look my image as exemplo.
https://drive.google.com/file/d/1ietJDMS...sp=sharing

I need get this five fileds(Insumo, Quantidade, Unid., Preço unit., Preço final) and values of it.
When my code extract data, is mess data, I would like organized as tabular format, to put into spreadsheet
Reply
#4
Well, I don't know if this helps you, but you can give tesseract a cropped image of each column you want, by getting the crop coordinates for each column.

Then, when you OCR the image, the text you get will all be in 1 column. I do something similar for my little multi-choice marking system. I have to set up the crop coordinates when I make the answer form, but only 1 time, then I can mark hundreds or thousands of answer forms.

I don't know if all your Ordem de Compra PDFs will have the same format. If they do, that makes life easy! But probably not! This is not worth doing if every PDF is different!

You would need to have 5 sets of crop coordinates for 5 columns. Only Insumo has more than 1 line of text.

But if all the PDFs are different, cropping like this won't help because you would need to get the coordinates for each PDF! Confused

Anyway, maybe this will give you some ideas!

I find the crop coordinates (80, 1100, 1580, 1300) cuts out just the table you want! If you save that image, at least you will OCR a bit faster.

import os
import pdf2image
from PIL import Image, ImageOps

path2pdf = '/home/pedro/pdfs/pdfs/'
path2text = '/home/pedro/temp/pdfs2text/'
path2jpg = '/home/pedro/pdfs/pdf2jpg/'

def junkjpgs(path):
    print('Clearing out the folders we use, in case there is anything in there ... ')
    pics = os.listdir(path)
    if len(pics) == 0:
        print('Nothing in ' + path + '\n\n')
        return
    for file in pics:
        os.remove(path + file)
    print('ALL files removed from: ' + path + '\n\n')
    
def splitPDF(aPDF, source, destination, savename):
    print(f'Splitting {source + aPDF} to individual jpgs ... ')    
    # images is a list
    images = pdf2image.convert_from_path(source + aPDF)
    for i in range(len(images)):        
        images[i].save(destination + savename+ '_' + str(i+1) + '.jpg', 'JPEG')
        #image.save(destination + str(studinr) + '.jpg', 'JPEG')        
    print(f'{source + aPDF} split to .jpgs and all saved in: {destination}')

def cropImage(path2jpg):
    print('running cropImage() ... \n')
    print('\n you need crop coordinates for each column \n')
    # get 1 of the 1 page pdfs
    one_page_jpgs = os.listdir(path2jpg)
    one_page_jpgs.sort()
    # Opens an image in RGB mode 
    im = Image.open(path2jpg + one_page_jpgs[0] )      
    # Size of the image in pixels (size of orginal image) 
    # Just for information
    width, height = im.size
    print('This image is: ', width, ' wide and ', height, ' high')      
    accept = ['yes']
    answer = 'X'
    while not answer in accept:
        print('enter the crop image top left and bottom right coordinates.')
        corners = input('Enter your 4 numbers, separated by a space ... ')
        # later should be able to use fixed values for the y coordinate because it won't change
        # work on that!!
        coords = corners.split()
        for i in range(len(coords)):
            coords[i] = int(coords[i])
        # Cropped image of above dimension 
        # (It will not change orginal image) 
        im1 = im.crop((coords[0], coords[1], coords[2], coords[3])) 
          
        # Shows the image in image viewer 
        im1.show()
        print('Do you accept this image')
        answer = input('Enter yes to accept, anything else to try again ')
        if answer == 'yes':            
            print('image crop coordinates saved to tup ... ')
            return (coords[0], coords[1], coords[2], coords[3])

def getCoords():
    column_coords = []
    print('Now we will get the image crop coordinates')
    print('How many columns are there?')
    cols = input('Just enter a number like 1 or 2 or 3 or 4 or more ... ')    
    for i in range(int(cols)):         
        tup = cropImage(path2jpg)
        column_coords.append(tup)
    return column_coords

if __name__ == '__main__':
    filename = input('Enter the name of the PDF file you want to process ... ')
    name = filename.split('.')
    savename = name[0]
    splitPDF(filename, path2pdf, path2jpg, savename)
    # get the crop coordinates
    coords = getCoords()
    for tup in coords:
        print('The column coordinates are', tup)
    # now use the crop coordinates to make a small image of each column you want and OCR that
But this Ordem de Compra did not begin life as an image PDF, it was an Office document!

It would be much easier to get what you want from the original document! Where is that??
zinho likes this post
Reply
#5
Hi Pedro.
THis is very far from my knowledge of Python3

I was thinking it's more easy use regex with output below, to get data e insert into Excel worksheet.

Output:
' Quantidade|Unid. |Solicitagao Prego unit.| Desc(R$)} Desc(%) %Acr/|Preco final Dt. entrega\n4094 - FECHADURA 594,1200 0,00 0,00 0,00|594, 12 26/10/2023\n\nELETRONICA\nPARA PORTA DE ABRIR - FE\n\n \n\n \n\n21150 S/ MACANETA\n4565 - CONTROLE REMOTO 19,1700 26/10/2023\nXAC 4000\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n']
But thank you any way.
Reply
#6
Hi again! I understand my approach may not be useful in this situation.

I am not good with regex!

Just as a test, I did this:

First, I used cropImage(path2jpg) to get the first column, Insumo and saved it as temp.jpg

Then I used the function below:

def convert2text(name):
    # only 1 jpg now
    jpgFile = path2tempjpg + 'temp.jpg'    
    with open(path2text + name, 'a') as this_text:
        # this works fine
        porText = pytesseract.image_to_string(Image.open(jpgFile), lang='por')
        this_text.write(porText)
    print('removing the jpgs ... ')
    junkjpgs(path2tempjpg)
    print('finished this image ... ')
This gives:

Output:
Insumo 4094 - FECHADURA ELETRÔNICA PARA PORTA DE ABRIR - FE 21150 S/ MAÇANETA 4565 - CONTROLE REMOTO XAC 4000
But like I said, this is only useful if all the PDFs have the same format, because you need the exact coordinates for cropping the image.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  How to properly format rows and columns in excel data from parsed .txt blocks jh67 7 1,903 Dec-12-2022, 08:22 PM
Last Post: jh67
  BCP Export sql data to csv mg24 2 1,080 Nov-20-2022, 11:45 AM
Last Post: Pedroski55
  Issue in changing data format (2 bytes) into a 16 bit data. GiggsB 11 2,680 Jul-25-2022, 03:19 PM
Last Post: deanhystad
  How to keep columns header on excel without change after export data to excel file? ahmedbarbary 0 1,173 May-03-2022, 05:46 PM
Last Post: ahmedbarbary
  Need Help writing data into Excel format ajitnayak87 8 2,553 Feb-04-2022, 03:00 AM
Last Post: Jeff_t
Smile Set 'Time' format cell when writing data to excel and not 'custom' limors 3 6,324 Mar-29-2021, 09:36 PM
Last Post: Larz60+
  tabula-py, how to preserve a read_pdf() format and export to csv abcoelho 2 3,339 Mar-24-2021, 08:34 PM
Last Post: abcoelho
  ValueError: time data 'None' does not match format '%Y-%m-%dT%H:%M:%S.%f' rajesh3383 4 14,639 Sep-03-2020, 08:22 PM
Last Post: buran
  Issue accessing data from Dictionary/List in the right format LuisSatch 2 2,231 Jul-25-2020, 06:12 AM
Last Post: LuisSatch
  getting error ValueError: time data '' does not match format '%H:%M' srisrinu 2 5,619 Apr-09-2020, 11:12 AM
Last Post: srisrinu

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020