Python Forum
Export data from PDF as tabular format
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Export data from PDF as tabular format
#1
Hi.

I would like exporta data from PDF files, but I need in tabula format.
import io
import pytesseract
from pdf2image import convert_from_path
import pandas as pd
import re

#Insumo, Quantidade, Unid., Preço unit., Preço final

def extract_text_from_pdf(pdf_path):
    # Convert PDF to image
    pages = convert_from_path(pdf_path, 500)
     
    # Extract text from each page using Tesseract OCR
    text_data = ''
    for page in pages:
        text = pytesseract.image_to_string(page)
        #text_data += text 
        text_data += text + '\n'
     
    # Return the text data
    return text_data
 
text = extract_text_from_pdf('1.pdf')
# extract main string
result = re.findall(r'Insumo(.*?)Cond. pagamento', text,re.DOTALL|re.MULTILINE)
rst = list(result)
df = pd.DataFrame(rst)
df.to_excel('output.xlsx')
#print(df)
print("Done!")
This is output (I won't this)
Output:
[' Quantidade|Unid. |Solicitagao Prego unit.| Desc(R$)} Desc(%) %Acr/|Preco final Dt. entrega\n4094 - FECHADURA 594,1200 0,00 0,00 0,00|594, 12 26/10/2023\n\nELETRONICA\nPARA PORTA DE ABRIR - FE\n\n \n\n \n\n21150 S/ MACANETA\n4565 - CONTROLE REMOTO 19,1700 26/10/2023\nXAC 4000\n\n \n\n \n\n \n\n \n\n \n\n \n\n \n\n']
I want this (think in worksheet in excel)
Output:
Insumo Quantidade Unid. Preço unit. Preço final 4094 - FECHADURA ELETRONICA PARA PORTA DE ABRIR - FE 21150 S/ MACANETA 1 un 594,12 594,12 4565 - CONTROLE REMOTO XAC 400 2 un 19,17 38,34
Look here for my PDF file exemplo:
https://drive.google.com/file/d/1QnX4vv8...sp=sharing

Thaks
Reply


Messages In This Thread
Export data from PDF as tabular format - by zinho - Nov-08-2023, 09:28 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
  How to properly format rows and columns in excel data from parsed .txt blocks jh67 7 1,972 Dec-12-2022, 08:22 PM
Last Post: jh67
  BCP Export sql data to csv mg24 2 1,100 Nov-20-2022, 11:45 AM
Last Post: Pedroski55
  Issue in changing data format (2 bytes) into a 16 bit data. GiggsB 11 2,737 Jul-25-2022, 03:19 PM
Last Post: deanhystad
  How to keep columns header on excel without change after export data to excel file? ahmedbarbary 0 1,198 May-03-2022, 05:46 PM
Last Post: ahmedbarbary
  Need Help writing data into Excel format ajitnayak87 8 2,600 Feb-04-2022, 03:00 AM
Last Post: Jeff_t
Smile Set 'Time' format cell when writing data to excel and not 'custom' limors 3 6,396 Mar-29-2021, 09:36 PM
Last Post: Larz60+
  tabula-py, how to preserve a read_pdf() format and export to csv abcoelho 2 3,390 Mar-24-2021, 08:34 PM
Last Post: abcoelho
  ValueError: time data 'None' does not match format '%Y-%m-%dT%H:%M:%S.%f' rajesh3383 4 14,827 Sep-03-2020, 08:22 PM
Last Post: buran
  Issue accessing data from Dictionary/List in the right format LuisSatch 2 2,272 Jul-25-2020, 06:12 AM
Last Post: LuisSatch
  getting error ValueError: time data '' does not match format '%H:%M' srisrinu 2 5,637 Apr-09-2020, 11:12 AM
Last Post: srisrinu

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020