Hi All,
I am new to Python development. I am working on a project where we need to extract tables and their table names present in the PDF file.
I am currently using camelot library in python to extract the tables from the PDF but i am not sure how to extract the table name.
Is there a library/approach to perform the same? Any help is much appreciated.
Revisiting the thread for possible answers.
I am working on a project where we need to extract tables and their table names present in the PDF file(complex PDF). Tried TATR, yolo based models and many packages like tabula, camelot, tablecv and more. No hits. Requesting for support
PDFs are tricky customers!
pdfminer is pretty good for extracting stuff from PDFs.
I find pymupdf is very good for getting stuff from PDFs.
If you provide a sample PDF, we can try to get what you want. This PDF, a.pdf, is from another question here. The extracted text is weird, which is something to do with how the PDF stores data internally, but I have not yet been able to get the proper Cyrillic text, although the PDF opens and displays fine. Somehow the PDF viewer knows how to translate this weird text to Russian! If you know how, please tell me!
Mostly, I just want to get a few pages from a PDF, then I would use PyPDF2
import pandas as pd
import pymupdf # aka fitz
# from a recent question
# Russian language but extracted text is weird don't know why yet
# only 1 page and 1 table
path2pdf = '/home/pedro/Downloads/a.pdf'
savepath = 'temp/'
# get the PDF
doc = pymupdf.open(path2pdf)
# get info
num_pages = doc.page_count # here only 1 page
# get any page by number starting with zero
page1 = doc.pages(0) # is a generator
# get all tables in the PDF
for page in doc:
tabs = page.find_tables()
# tabs <pymupdf.table.TableFinder object at 0x7ad03373a770>
# from dataframe you can export to Excel
# but this text is weird don't know why, YET!
df = pd.DataFrame(tabs[0].extract())
# export with index or set index=False for no index in result
df.to_excel(savepath + "weird_russian.xlsx", index=False)