PDF Extract using CSV values

atomxkai · Jan-11-2022, 07:03 PM

Hello, need help on how to read from CSV file with multiple values instead of manual input?

Thank you.

from PyPDF2 import PdfFileReader, PdfFileWriter

pdf_file_path = 'document.pdf'
file_base_name = pdf_file_path.replace('.pdf', '')

pdf = PdfFileReader(pdf_file_path)

pdfWriter = PdfFileWriter()

# this values are manual input
# how to read csv file with multiple values instead of manual input?
setpage = 21
startpage = 524
endpage = 570

for page_num in range(startpage,endpage):
    pdfWriter.addPage(pdf.getPage(page_num))

with open("%(n)s_subset_%(b)s.pdf" % {'n': format(file_base_name), 'b': setpage }, 'wb') as f:
    pdfWriter.write(f)
    f.close()

BashBedlam · Jan-11-2022, 07:30 PM

Assuming that your csv file looks something like this:

Set Page, Start Page, End Page
2, 4, 8
2, 10, 14
2, 16, 20

Then this will do what you're asking:

from PyPDF2 import PdfFileReader, PdfFileWriter
 
pdf_file_path = 'document.pdf'
file_base_name = pdf_file_path.replace('.pdf', '')
 
pdf = PdfFileReader(pdf_file_path)
 
pdfWriter = PdfFileWriter()
 
with open ('page values.csv', 'r') as page_values_file :
	page_values_file.readline () # dump the header

	for line in page_values_file :
		page_values = line.strip ().split (',')
		setpage = int (page_values [0])
		startpage = int (page_values [1])
		endpage = int (page_values [2])

		for page_num in range(startpage,endpage):
			pdfWriter.addPage(pdf.getPage(page_num))
 
		with open("%(n)s_subset_%(b)s.pdf" % {'n': format(file_base_name), 'b': setpage }, 'wb') as f:
			pdfWriter.write(f)

atomxkai · Jan-12-2022, 06:14 PM

Thank you so much BashBedlam! It works! It help a lot and saved a day! Cool

I have some comments with the results I have.

Results:
document_subset_1 -> it does not export
document_subset_2 -> it export as per values from csv
document_subset_3 -> it export as per values from csv but it includes values from subset_2 in first pages
document_subset_4 -> same as subset_3 including subset_2
...
so the good thing is I just used the last subset PDF file which is complete extraction and manually extract subset_1.

Almost perfect! Smile

Example of my csv values are from 300+ PDF pages:

setpage startpage endpage
1	0	5
2	17	22
3	54	59
4	67	72
5	82	87
5	87	92
5	92	97
6	109	114
7	122	127
8	183	188
9	208	213
9	213	218
10	222	227

setpage - I grouped them as set because some are continuous.

BashBedlam · (This post was last modified: Jan-12-2022, 06:42 PM by BashBedlam.)

First off, there's no page zero so your first entry should start with a one. Secondly, I may have misunderstood your intended outcome. Try this and see if it's more what you had in mind.

from PyPDF2 import PdfFileReader, PdfFileWriter
  
pdf_file_path = 'document.pdf'
file_base_name = pdf_file_path.replace('.pdf', '')
  
pdf = PdfFileReader(pdf_file_path)
  
with open ('page values.csv', 'r') as page_values_file :
	page_values_file.readline () # dump the header
 
	for line in page_values_file :
		page_values = line.strip ().split (',')
		setpage = int (page_values [0])
		startpage = int (page_values [1])
		endpage = int (page_values [2])


		pdfWriter = PdfFileWriter()
		for page_num in range(startpage,endpage):
			pdfWriter.addPage(pdf.getPage(page_num))
			with open("%(n)s_subset_%(b)s.pdf" % {'n': format(file_base_name), 'b': setpage }, 'wb') as f:
				pdfWriter.write(f)

atomxkai · (This post was last modified: Jan-12-2022, 09:15 PM by atomxkai.)

(Jan-12-2022, 06:41 PM)BashBedlam Wrote: First off, there's no page zero so your first entry should start with a one. Secondly, I may have misunderstood your intended outcome. Try this and see if it's more what you had in mind.

from PyPDF2 import PdfFileReader, PdfFileWriter
  
pdf_file_path = 'document.pdf'
file_base_name = pdf_file_path.replace('.pdf', '')
  
pdf = PdfFileReader(pdf_file_path)
  
with open ('page values.csv', 'r') as page_values_file :
	page_values_file.readline () # dump the header
 
	for line in page_values_file :
		page_values = line.strip ().split (',')
		setpage = int (page_values [0])
		startpage = int (page_values [1])
		endpage = int (page_values [2])


		pdfWriter = PdfFileWriter()
		for page_num in range(startpage,endpage):
			pdfWriter.addPage(pdf.getPage(page_num))
			with open("%(n)s_subset_%(b)s.pdf" % {'n': format(file_base_name), 'b': setpage }, 'wb') as f:
				pdfWriter.write(f)

This 2nd update works well as per the given CSV values per record row. Although, it might not read the 3 records with same setpage as previous script, but when I instead change it to sequential it does the job to extract PDF pages as per values in each record rows.

I'm still learning how to call these functions and group them under the for loop and using with. Smile

I think I have 2 choices now if I use the 1st code, which the PDF files already merge in last PDF file output then just insert the 1st set while the 2nd code if I wanted to exactly export each PDF files as per CSV record values.

My challenge is just how to extract the 1st CSV record values, I have tried to set to startpage = 1 but looks like it didn't export.

Thanks again BashBedlam! really appreciate it.

Pedroski55 · Jan-13-2022, 12:20 PM

I often need to cut bits out of textbook pdfs.

Normally I just note down the start page and finish page and enter them by hand.

I never thought about making a csv, because I only want Unit 3 or Lesson 5.

If you just want one part of your csv data, make a loop of data and give yourself the choice of which section you want.

If you try this in your shell, just change the paths for your paths.

def myApp():
    #! /usr/bin/python3
    # this program will take a pdf and extract a range of connected pages

    from PyPDF2 import PdfFileWriter, PdfFileReader
    import os, csv

    print('enter the path to the pdf you want to get pages from ... ')
    path2PDF = input('something like /home/pedro/Latin/ (don\'t forget the last /) ...  ')
    path2Extracts = '/home/pedro/pdfExtractedPages/'
    path2CSV = '/home/pedro/pdfs/'

    files = os.listdir(path2PDF)
    pdfs = []
    for f in files:
        if f.endswith('.pdf'):
            pdfs.append(f)
    for f in pdfs:
        print('Which PDF do you want to extract pages from?')
        print(f)

    myPDF = input('Copy and paste 1 of the PDF names here ... ')
    # read the pdf
    pdf = PdfFileReader(path2PDF + myPDF)   
    pages = pdf.getNumPages()
    print('This pdf has ' + str(pages) + ' pages')

    # get the csv with the page details for extraction
    print('What pages do you want to get? They are in a CSV file.')
    csv_files = os.listdir(path2CSV)
    csvs = []
    for f in csv_files:
        if f.endswith('.csv'):
            csvs.append(f)
    for f in csvs:
        print('Which CSV file do you need?')
        print(f)
    myCSV = input('Copy and paste 1 of the CSV names here ... ')

    # get the data from csv
    with open(path2CSV + myCSV) as infile:
        # read the csv file in
        answers = csv.reader(infile)
        # csv.reader is annoying, it's gone if you have to repeat, so read to a data list first
        data = []
        for row in answers:
            data.append(row)

    # get the base name for saving the PDFs
    name = myPDF.split('.')
    bookTitle = name[0]

    # a function to make the excerpts
    def makePDF(alist):
        start = int(alist[1])
        end = int(alist[2])
        label = alist[0]
        pdf_writer = PdfFileWriter()
        for page in range(start, end):        
            pdf_writer.addPage(pdf.getPage(page))
        output_filename = f'{bookTitle}_{label}.pdf'
        with open(path2Extracts + output_filename, 'wb') as out:
            pdf_writer.write(out)
        print(f'Created: {output_filename} and saved in', path2Extracts)

    for i in range(1, len(data)):
        makePDF(data[i])
        
    print('Pages extracted, pdfs made and saved in ', pathToExtracts)
    print('All done!')

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Dataframe extract key values	danipyth	0	2,168	Feb-07-2021, 03:52 PM Last Post: danipyth
	xml.etree.ElementTree extract string values	matthias100	2	8,034	Jul-12-2020, 06:02 PM Last Post: snippsat
	Extract values from array	mehtamonita	8	11,106	Apr-18-2017, 02:45 PM Last Post: mehtamonita

PDF Extract using CSV values

User Panel Messages

Announcements