Python Forum
Extracting tables and text above the table from a PDF to CSV
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Extracting tables and text above the table from a PDF to CSV
#1
Hi

I have a PDF file from where i need to extract all the tables and also the text above the tables and output the results to a csv file.By using tabula, i have tried extracting the tables, but i am not sure on how to extract the texts which are above the tables.I have to extract the Perf factor whose values are Accuracy and Time and also the text below the Perf factor which is the 'Description' I have attached the sample input PDF file. My original PDF file has 100+ tables

.pdf   Input.pdf (Size: 52.29 KB / Downloads: 7)

I have tried the following code to extract the tables

# Import the required Module
import tabula
# Read a PDF File
df = tabula.read_pdf("Input.pdf", pages='all')[0]
# convert PDF into CSV
tabula.convert_into("Input.pdf", "Output.csv", output_format="csv", pages='all')
print(df)
The expected output CSV should look like as shown below:
Output:
Perf factor Accuracy Description Accuracy of participants Perf factor attributes Value Category Football Participants 11 Ballots Completed 1 Ballots Terminated 4 Perf factor Time Description Total time taken Perf factor attributes Value Category Cricket Participants 10 Ballots Completed 4 Ballots Terminated 9
Please find the details of software which i use:
Python 3.9.13
Anaconda Navigator , Spyder

I was wondering if i should convert PDF to text to extract the text or if there is another better way. Any help would be much appreciated. Thanks in advance.
Reply
#2
perhaps https://pypi.org/project/pypdf/ will help.
this is the new PyPdf2, and extracts text as well as split off pages.
Haven't had the need to use it lately, but used the older version often.
Reply
#3
Unfortunately, converting pdf to text creates space issues and i am not able to extract keyword as required. Kindly help with a better solution to move forward
Reply
#4
(Jan-16-2023, 09:55 AM)DivAsh Wrote: The expected output CSV should look like as shown below:

Your pdf file doesn't have 'Description' but you expect that converting from pdf to csv it automagically appears?

You have two tasks here - read text from pdf and process it so that it corresponds to your expectations. Side note: if I save file as csv then I always prefer to save it (as name suggests) in comma separated values format.

With pypdf suggested by Larz60+ it easy to get text and manipulation of it shouldn't be a heavy task (remove numbering and empty lines; add Description):

from pypdf import PdfReader

reader = PdfReader('Input.pdf')
page = reader.pages[0]
text = page.extract_text()
print(text)
Output:
2.1 Perf factor Accuracy Accuracy of participants Perf factor attributes Value Category Football Participants 11 Ballots Completed 1 Ballots Terminated 4 2.2 Perf factor Time Total time taken Perf factor attributes Value Category Cricket Participants 10 Ballots Completed 4 Ballots Terminated 9
I'm not 'in'-sane. Indeed, I am so far 'out' of sane that you appear a tiny blip on the distant coast of sanity. Bucky Katt, Get Fuzzy

Da Bishop: There's a dead bishop on the landing. I don't know who keeps bringing them in here. ....but society is to blame.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
Question How does one clean a populated table in MySQL/MariaDB? Copying values across tables? BrandonKastning 2 1,582 Jan-17-2022, 05:46 AM
Last Post: BrandonKastning
  Extracting Text standenman 5 2,296 Nov-01-2021, 10:49 PM
Last Post: Gribouillis
  Adding Tables and Extracting Values from Tables jamescox11480 5 3,489 Sep-29-2018, 04:49 PM
Last Post: jamescox11480

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020