Export data from PDF as tabular format - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: Export data from PDF as tabular format (/thread-41089.html) |
Export data from PDF as tabular format - zinho - Nov-08-2023 Hi. I would like exporta data from PDF files, but I need in tabula format. import io import pytesseract from pdf2image import convert_from_path import pandas as pd import re #Insumo, Quantidade, Unid., Preço unit., Preço final def extract_text_from_pdf(pdf_path): # Convert PDF to image pages = convert_from_path(pdf_path, 500) # Extract text from each page using Tesseract OCR text_data = '' for page in pages: text = pytesseract.image_to_string(page) #text_data += text text_data += text + '\n' # Return the text data return text_data text = extract_text_from_pdf('1.pdf') # extract main string result = re.findall(r'Insumo(.*?)Cond. pagamento', text,re.DOTALL|re.MULTILINE) rst = list(result) df = pd.DataFrame(rst) df.to_excel('output.xlsx') #print(df) print("Done!")This is output (I won't this) I want this (think in worksheet in excel) Look here for my PDF file exemplo:https://drive.google.com/file/d/1QnX4vv8EeiVtg9zEPENL-ssfuG-MTQvf/view?usp=sharing Thaks RE: Export data from PDF as tabular format - Pedroski55 - Nov-09-2023 Do you want all of 1.pdf? Do you just want the small table in the pdf that has Insumo in the top left column? Do all these tables always have the same format? Or do you want all of 1.pdf? I don't have Portuguese as a language in Tesseract, so I ran on English. I get most of the text, but putting it all neatly in a table will be hard. If all the pdfs have the same format, you can use Image to crop out each table and just work on 1 table at a time. But, because you have multilines in the column Insumo, it will get confusing. If all the tables always have the same dimensions, then you can use Image to crop out each column, extract the text and write to Excel. But I think the pdfs maybe have different page layouts and the tables therein will probably have different sizes! RE: Export data from PDF as tabular format - zinho - Nov-09-2023 (Nov-09-2023, 06:56 PM)Pedroski55 Wrote: Do you want all of 1.pdf? I need partial content of the file, look my image as exemplo. https://drive.google.com/file/d/1ietJDMSbQ3DpVZFSgb-y9H3aiy-Ex2ZH/view?usp=sharing I need get this five fileds(Insumo, Quantidade, Unid., Preço unit., Preço final) and values of it. When my code extract data, is mess data, I would like organized as tabular format, to put into spreadsheet RE: Export data from PDF as tabular format - Pedroski55 - Nov-10-2023 Well, I don't know if this helps you, but you can give tesseract a cropped image of each column you want, by getting the crop coordinates for each column. Then, when you OCR the image, the text you get will all be in 1 column. I do something similar for my little multi-choice marking system. I have to set up the crop coordinates when I make the answer form, but only 1 time, then I can mark hundreds or thousands of answer forms. I don't know if all your Ordem de Compra PDFs will have the same format. If they do, that makes life easy! But probably not! This is not worth doing if every PDF is different! You would need to have 5 sets of crop coordinates for 5 columns. Only Insumo has more than 1 line of text. But if all the PDFs are different, cropping like this won't help because you would need to get the coordinates for each PDF! Anyway, maybe this will give you some ideas! I find the crop coordinates (80, 1100, 1580, 1300) cuts out just the table you want! If you save that image, at least you will OCR a bit faster. import os import pdf2image from PIL import Image, ImageOps path2pdf = '/home/pedro/pdfs/pdfs/' path2text = '/home/pedro/temp/pdfs2text/' path2jpg = '/home/pedro/pdfs/pdf2jpg/' def junkjpgs(path): print('Clearing out the folders we use, in case there is anything in there ... ') pics = os.listdir(path) if len(pics) == 0: print('Nothing in ' + path + '\n\n') return for file in pics: os.remove(path + file) print('ALL files removed from: ' + path + '\n\n') def splitPDF(aPDF, source, destination, savename): print(f'Splitting {source + aPDF} to individual jpgs ... ') # images is a list images = pdf2image.convert_from_path(source + aPDF) for i in range(len(images)): images[i].save(destination + savename+ '_' + str(i+1) + '.jpg', 'JPEG') #image.save(destination + str(studinr) + '.jpg', 'JPEG') print(f'{source + aPDF} split to .jpgs and all saved in: {destination}') def cropImage(path2jpg): print('running cropImage() ... \n') print('\n you need crop coordinates for each column \n') # get 1 of the 1 page pdfs one_page_jpgs = os.listdir(path2jpg) one_page_jpgs.sort() # Opens an image in RGB mode im = Image.open(path2jpg + one_page_jpgs[0] ) # Size of the image in pixels (size of orginal image) # Just for information width, height = im.size print('This image is: ', width, ' wide and ', height, ' high') accept = ['yes'] answer = 'X' while not answer in accept: print('enter the crop image top left and bottom right coordinates.') corners = input('Enter your 4 numbers, separated by a space ... ') # later should be able to use fixed values for the y coordinate because it won't change # work on that!! coords = corners.split() for i in range(len(coords)): coords[i] = int(coords[i]) # Cropped image of above dimension # (It will not change orginal image) im1 = im.crop((coords[0], coords[1], coords[2], coords[3])) # Shows the image in image viewer im1.show() print('Do you accept this image') answer = input('Enter yes to accept, anything else to try again ') if answer == 'yes': print('image crop coordinates saved to tup ... ') return (coords[0], coords[1], coords[2], coords[3]) def getCoords(): column_coords = [] print('Now we will get the image crop coordinates') print('How many columns are there?') cols = input('Just enter a number like 1 or 2 or 3 or 4 or more ... ') for i in range(int(cols)): tup = cropImage(path2jpg) column_coords.append(tup) return column_coords if __name__ == '__main__': filename = input('Enter the name of the PDF file you want to process ... ') name = filename.split('.') savename = name[0] splitPDF(filename, path2pdf, path2jpg, savename) # get the crop coordinates coords = getCoords() for tup in coords: print('The column coordinates are', tup) # now use the crop coordinates to make a small image of each column you want and OCR thatBut this Ordem de Compra did not begin life as an image PDF, it was an Office document! It would be much easier to get what you want from the original document! Where is that?? RE: Export data from PDF as tabular format - zinho - Nov-10-2023 Hi Pedro. THis is very far from my knowledge of Python3 I was thinking it's more easy use regex with output below, to get data e insert into Excel worksheet. But thank you any way.
RE: Export data from PDF as tabular format - Pedroski55 - Nov-11-2023 Hi again! I understand my approach may not be useful in this situation. I am not good with regex! Just as a test, I did this: First, I used cropImage(path2jpg) to get the first column, Insumo and saved it as temp.jpg Then I used the function below: def convert2text(name): # only 1 jpg now jpgFile = path2tempjpg + 'temp.jpg' with open(path2text + name, 'a') as this_text: # this works fine porText = pytesseract.image_to_string(Image.open(jpgFile), lang='por') this_text.write(porText) print('removing the jpgs ... ') junkjpgs(path2tempjpg) print('finished this image ... ')This gives: But like I said, this is only useful if all the PDFs have the same format, because you need the exact coordinates for cropping the image.
|