Extracting tables and text above the table from a PDF to CSV

DivAsh · (This post was last modified: Jan-16-2023, 09:55 AM by DivAsh.)

Hi

I have a PDF file from where i need to extract all the tables and also the text above the tables and output the results to a csv file.By using tabula, i have tried extracting the tables, but i am not sure on how to extract the texts which are above the tables.I have to extract the Perf factor whose values are Accuracy and Time and also the text below the Perf factor which is the 'Description' I have attached the sample input PDF file. My original PDF file has 100+ tables

.pdf

Input.pdf (Size: 52.29 KB / Downloads: 7)

I have tried the following code to extract the tables

# Import the required Module
import tabula
# Read a PDF File
df = tabula.read_pdf("Input.pdf", pages='all')[0]
# convert PDF into CSV
tabula.convert_into("Input.pdf", "Output.csv", output_format="csv", pages='all')
print(df)

The expected output CSV should look like as shown below:

Output:Perf factor	Accuracy
Description	Accuracy of participants
Perf factor attributes	Value
Category	Football
Participants	11
Ballots Completed	1
Ballots Terminated	4
Perf factor	Time
Description	Total time taken
Perf factor attributes	Value
Category	Cricket
Participants	10
Ballots Completed	4
Ballots Terminated	9

Please find the details of software which i use:
Python 3.9.13
Anaconda Navigator , Spyder

I was wondering if i should convert PDF to text to extract the text or if there is another better way. Any help would be much appreciated. Thanks in advance.

**Larz60+** · Jan-16-2023, 07:58 PM

perhaps https://pypi.org/project/pypdf/ will help.
this is the new PyPdf2, and extracts text as well as split off pages.
Haven't had the need to use it lately, but used the older version often.

DivAsh · Jan-18-2023, 04:23 AM

Unfortunately, converting pdf to text creates space issues and i am not able to extract keyword as required. Kindly help with a better solution to move forward

**perfringo** · (This post was last modified: Jan-18-2023, 07:39 AM by perfringo.)

(Jan-16-2023, 09:55 AM)DivAsh Wrote: The expected output CSV should look like as shown below:

Your pdf file doesn't have 'Description' but you expect that converting from pdf to csv it automagically appears?

You have two tasks here - read text from pdf and process it so that it corresponds to your expectations. Side note: if I save file as csv then I always prefer to save it (as name suggests) in comma separated values format.

With pypdf suggested by Larz60+ it easy to get text and manipulation of it shouldn't be a heavy task (remove numbering and empty lines; add Description):

from pypdf import PdfReader

reader = PdfReader('Input.pdf')
page = reader.pages[0]
text = page.extract_text()
print(text)

Output:2.1 Perf factor  Accuracy
Accuracy of participants

Perf factor attributes  Value
Category  Football
Participants  11
Ballots Completed  1
Ballots Terminated  4

2.2 Perf factor  Time
Total time  taken

Perf factor  attributes  Value
Category   Cricket
Participants   10
Ballots Completed    4
Ballots Terminated  9

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	How does one clean a populated table in MySQL/MariaDB? Copying values across tables?	BrandonKastning	2	2,248	Jan-17-2022, 05:46 AM Last Post: BrandonKastning
	Extracting Text	standenman	5	3,185	Nov-01-2021, 10:49 PM Last Post: Gribouillis
	Adding Tables and Extracting Values from Tables	jamescox11480	5	4,715	Sep-29-2018, 04:49 PM Last Post: jamescox11480

Extracting tables and text above the table from a PDF to CSV

User Panel Messages

Announcements