Python Forum
Extracting table and table name from PDF
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Extracting table and table name from PDF
#1
Hi All,

I am new to Python development. I am working on a project where we need to extract tables and their table names present in the PDF file.
I am currently using camelot library in python to extract the tables from the PDF but i am not sure how to extract the table name.
Is there a library/approach to perform the same? Any help is much appreciated.
Reply
#2
I haven't used this one, but looks promising: https://blog.chezo.uno/tabula-py-extract...7acfa5f302
Reply
#3
(Mar-03-2020, 09:08 AM)Larz60+ Wrote: I haven't used this one, but looks promising: https://blog.chezo.uno/tabula-py-extract...7acfa5f302

This library doesn't extract headings that much, also sometimes extract contents from general text too.
Reply
#4
Revisiting the thread for possible answers.

I am working on a project where we need to extract tables and their table names present in the PDF file(complex PDF). Tried TATR, yolo based models and many packages like tabula, camelot, tablecv and more. No hits. Requesting for support
Reply
#5
PDFs are tricky customers!

pdfminer is pretty good for extracting stuff from PDFs.

I find pymupdf is very good for getting stuff from PDFs.

If you provide a sample PDF, we can try to get what you want. This PDF, a.pdf, is from another question here. The extracted text is weird, which is something to do with how the PDF stores data internally, but I have not yet been able to get the proper Cyrillic text, although the PDF opens and displays fine. Somehow the PDF viewer knows how to translate this weird text to Russian! If you know how, please tell me!

Mostly, I just want to get a few pages from a PDF, then I would use PyPDF2

import pandas as pd
import pymupdf # aka fitz

# from a recent question
# Russian language but extracted text is weird don't know why yet
# only 1 page and 1 table
path2pdf = '/home/pedro/Downloads/a.pdf'
savepath = 'temp/'
# get the PDF
doc = pymupdf.open(path2pdf)
# get info
num_pages = doc.page_count # here only 1 page
# get any page by number starting with zero
page1 = doc.pages(0) # is a generator
# get all tables in the PDF
for page in doc:
    tabs = page.find_tables()
# tabs <pymupdf.table.TableFinder object at 0x7ad03373a770>
# from dataframe you can export to Excel
# but this text is weird don't know why, YET!
df = pd.DataFrame(tabs[0].extract())
# export with index or set index=False for no index in result
df.to_excel(savepath + "weird_russian.xlsx", index=False)
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  drawing a table with the status of tasks in each thread pyfoo 3 895 Mar-01-2024, 09:29 AM
Last Post: nerdyaks
  How to create a table with different sizes of columns in MS word pepe 8 3,056 Dec-08-2023, 07:31 PM
Last Post: Pedroski55
  Trying to get counts/sum/percentages from pandas similar to pivot table cubangt 6 2,255 Oct-06-2023, 04:32 PM
Last Post: cubangt
  dict table kucingkembar 4 1,229 Sep-30-2023, 03:53 PM
Last Post: deanhystad
  Going through HTML table with selenium emont 3 1,380 Sep-30-2023, 02:13 AM
Last Post: emont
Thumbs Up Convert word into pdf and copy table to outlook body in a prescribed format email2kmahe 1 1,192 Sep-22-2023, 02:33 PM
Last Post: carecavoador
  Using pyodbc&pandas to load a Table data to df tester_V 3 1,591 Sep-09-2023, 08:55 PM
Last Post: tester_V
  Find a string from a column of one table in another table visedwings049 8 1,901 Sep-07-2023, 03:22 PM
Last Post: deanhystad
Question Using SQLAlchemy, prevent SQLite3 table update by multiple program instances Calab 2 1,187 Aug-09-2023, 05:51 PM
Last Post: Calab
  Color a table cell based on specific text Creepy 11 3,522 Jul-27-2023, 02:48 PM
Last Post: deanhystad

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020