Python Forum

Full Version: Extracting table and table name from PDF
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Hi All,

I am new to Python development. I am working on a project where we need to extract tables and their table names present in the PDF file.
I am currently using camelot library in python to extract the tables from the PDF but i am not sure how to extract the table name.
Is there a library/approach to perform the same? Any help is much appreciated.
I haven't used this one, but looks promising: https://blog.chezo.uno/tabula-py-extract...7acfa5f302
(Mar-03-2020, 09:08 AM)Larz60+ Wrote: [ -> ]I haven't used this one, but looks promising: https://blog.chezo.uno/tabula-py-extract...7acfa5f302

This library doesn't extract headings that much, also sometimes extract contents from general text too.
Revisiting the thread for possible answers.

I am working on a project where we need to extract tables and their table names present in the PDF file(complex PDF). Tried TATR, yolo based models and many packages like tabula, camelot, tablecv and more. No hits. Requesting for support
PDFs are tricky customers!

pdfminer is pretty good for extracting stuff from PDFs.

I find pymupdf is very good for getting stuff from PDFs.

If you provide a sample PDF, we can try to get what you want. This PDF, a.pdf, is from another question here. The extracted text is weird, which is something to do with how the PDF stores data internally, but I have not yet been able to get the proper Cyrillic text, although the PDF opens and displays fine. Somehow the PDF viewer knows how to translate this weird text to Russian! If you know how, please tell me!

Mostly, I just want to get a few pages from a PDF, then I would use PyPDF2

import pandas as pd
import pymupdf # aka fitz

# from a recent question
# Russian language but extracted text is weird don't know why yet
# only 1 page and 1 table
path2pdf = '/home/pedro/Downloads/a.pdf'
savepath = 'temp/'
# get the PDF
doc = pymupdf.open(path2pdf)
# get info
num_pages = doc.page_count # here only 1 page
# get any page by number starting with zero
page1 = doc.pages(0) # is a generator
# get all tables in the PDF
for page in doc:
    tabs = page.find_tables()
# tabs <pymupdf.table.TableFinder object at 0x7ad03373a770>
# from dataframe you can export to Excel
# but this text is weird don't know why, YET!
df = pd.DataFrame(tabs[0].extract())
# export with index or set index=False for no index in result
df.to_excel(savepath + "weird_russian.xlsx", index=False)