Hi.
I would like exporta data from PDF files, but I need in tabula format.
https://drive.google.com/file/d/1QnX4vv8...sp=sharing
Thaks
I would like exporta data from PDF files, but I need in tabula format.
import io import pytesseract from pdf2image import convert_from_path import pandas as pd import re #Insumo, Quantidade, Unid., Preço unit., Preço final def extract_text_from_pdf(pdf_path): # Convert PDF to image pages = convert_from_path(pdf_path, 500) # Extract text from each page using Tesseract OCR text_data = '' for page in pages: text = pytesseract.image_to_string(page) #text_data += text text_data += text + '\n' # Return the text data return text_data text = extract_text_from_pdf('1.pdf') # extract main string result = re.findall(r'Insumo(.*?)Cond. pagamento', text,re.DOTALL|re.MULTILINE) rst = list(result) df = pd.DataFrame(rst) df.to_excel('output.xlsx') #print(df) print("Done!")This is output (I won't this)
Output:[' Quantidade|Unid. |Solicitagao Prego unit.| Desc(R$)} Desc(%) %Acr/|Preco final Dt. entrega\n4094 - FECHADURA 594,1200 0,00 0,00 0,00|594, 12 26/10/2023\n\nELETRONICA\nPARA PORTA DE ABRIR - FE\n\n \n\n \n\n21150 S/ MACANETA\n4565 - CONTROLE REMOTO 19,1700 26/10/2023\nXAC 4000\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n']
I want this (think in worksheet in excel)Output:Insumo Quantidade Unid. Preço unit. Preço final
4094 - FECHADURA ELETRONICA PARA PORTA DE ABRIR - FE 21150 S/ MACANETA 1 un 594,12 594,12
4565 - CONTROLE REMOTO XAC 400 2 un 19,17 38,34
Look here for my PDF file exemplo:https://drive.google.com/file/d/1QnX4vv8...sp=sharing
Thaks