Posts: 30
Threads: 8
Joined: Feb 2021
Hello, need help on how to read from CSV file with multiple values instead of manual input?
Thank you.
from PyPDF2 import PdfFileReader, PdfFileWriter
pdf_file_path = 'document.pdf'
file_base_name = pdf_file_path.replace('.pdf', '')
pdf = PdfFileReader(pdf_file_path)
pdfWriter = PdfFileWriter()
# this values are manual input
# how to read csv file with multiple values instead of manual input?
setpage = 21
startpage = 524
endpage = 570
for page_num in range(startpage,endpage):
pdfWriter.addPage(pdf.getPage(page_num))
with open("%(n)s_subset_%(b)s.pdf" % {'n': format(file_base_name), 'b': setpage }, 'wb') as f:
pdfWriter.write(f)
f.close()
Posts: 379
Threads: 2
Joined: Jan 2021
Assuming that your csv file looks something like this:
Set Page, Start Page, End Page
2, 4, 8
2, 10, 14
2, 16, 20 Then this will do what you're asking:
from PyPDF2 import PdfFileReader, PdfFileWriter
pdf_file_path = 'document.pdf'
file_base_name = pdf_file_path.replace('.pdf', '')
pdf = PdfFileReader(pdf_file_path)
pdfWriter = PdfFileWriter()
with open ('page values.csv', 'r') as page_values_file :
page_values_file.readline () # dump the header
for line in page_values_file :
page_values = line.strip ().split (',')
setpage = int (page_values [0])
startpage = int (page_values [1])
endpage = int (page_values [2])
for page_num in range(startpage,endpage):
pdfWriter.addPage(pdf.getPage(page_num))
with open("%(n)s_subset_%(b)s.pdf" % {'n': format(file_base_name), 'b': setpage }, 'wb') as f:
pdfWriter.write(f)
Posts: 30
Threads: 8
Joined: Feb 2021
Thank you so much BashBedlam! It works! It help a lot and saved a day!
I have some comments with the results I have.
Results:
document_subset_1 -> it does not export
document_subset_2 -> it export as per values from csv
document_subset_3 -> it export as per values from csv but it includes values from subset_2 in first pages
document_subset_4 -> same as subset_3 including subset_2
...
so the good thing is I just used the last subset PDF file which is complete extraction and manually extract subset_1.
Almost perfect!
Example of my csv values are from 300+ PDF pages:
setpage startpage endpage
1 0 5
2 17 22
3 54 59
4 67 72
5 82 87
5 87 92
5 92 97
6 109 114
7 122 127
8 183 188
9 208 213
9 213 218
10 222 227 setpage - I grouped them as set because some are continuous.
Posts: 379
Threads: 2
Joined: Jan 2021
Jan-12-2022, 06:41 PM
(This post was last modified: Jan-12-2022, 06:42 PM by BashBedlam.)
First off, there's no page zero so your first entry should start with a one. Secondly, I may have misunderstood your intended outcome. Try this and see if it's more what you had in mind.
from PyPDF2 import PdfFileReader, PdfFileWriter
pdf_file_path = 'document.pdf'
file_base_name = pdf_file_path.replace('.pdf', '')
pdf = PdfFileReader(pdf_file_path)
with open ('page values.csv', 'r') as page_values_file :
page_values_file.readline () # dump the header
for line in page_values_file :
page_values = line.strip ().split (',')
setpage = int (page_values [0])
startpage = int (page_values [1])
endpage = int (page_values [2])
pdfWriter = PdfFileWriter()
for page_num in range(startpage,endpage):
pdfWriter.addPage(pdf.getPage(page_num))
with open("%(n)s_subset_%(b)s.pdf" % {'n': format(file_base_name), 'b': setpage }, 'wb') as f:
pdfWriter.write(f)
Posts: 30
Threads: 8
Joined: Feb 2021
Jan-12-2022, 09:15 PM
(This post was last modified: Jan-12-2022, 09:15 PM by atomxkai.)
(Jan-12-2022, 06:41 PM)BashBedlam Wrote: First off, there's no page zero so your first entry should start with a one. Secondly, I may have misunderstood your intended outcome. Try this and see if it's more what you had in mind.
from PyPDF2 import PdfFileReader, PdfFileWriter
pdf_file_path = 'document.pdf'
file_base_name = pdf_file_path.replace('.pdf', '')
pdf = PdfFileReader(pdf_file_path)
with open ('page values.csv', 'r') as page_values_file :
page_values_file.readline () # dump the header
for line in page_values_file :
page_values = line.strip ().split (',')
setpage = int (page_values [0])
startpage = int (page_values [1])
endpage = int (page_values [2])
pdfWriter = PdfFileWriter()
for page_num in range(startpage,endpage):
pdfWriter.addPage(pdf.getPage(page_num))
with open("%(n)s_subset_%(b)s.pdf" % {'n': format(file_base_name), 'b': setpage }, 'wb') as f:
pdfWriter.write(f)
This 2nd update works well as per the given CSV values per record row. Although, it might not read the 3 records with same setpage as previous script, but when I instead change it to sequential it does the job to extract PDF pages as per values in each record rows.
I'm still learning how to call these functions and group them under the for loop and using with.
I think I have 2 choices now if I use the 1st code, which the PDF files already merge in last PDF file output then just insert the 1st set while the 2nd code if I wanted to exactly export each PDF files as per CSV record values.
My challenge is just how to extract the 1st CSV record values, I have tried to set to startpage = 1 but looks like it didn't export.
Thanks again BashBedlam! really appreciate it.
Posts: 1,094
Threads: 143
Joined: Jul 2017
I often need to cut bits out of textbook pdfs.
Normally I just note down the start page and finish page and enter them by hand.
I never thought about making a csv, because I only want Unit 3 or Lesson 5.
If you just want one part of your csv data, make a loop of data and give yourself the choice of which section you want.
If you try this in your shell, just change the paths for your paths.
def myApp():
#! /usr/bin/python3
# this program will take a pdf and extract a range of connected pages
from PyPDF2 import PdfFileWriter, PdfFileReader
import os, csv
print('enter the path to the pdf you want to get pages from ... ')
path2PDF = input('something like /home/pedro/Latin/ (don\'t forget the last /) ... ')
path2Extracts = '/home/pedro/pdfExtractedPages/'
path2CSV = '/home/pedro/pdfs/'
files = os.listdir(path2PDF)
pdfs = []
for f in files:
if f.endswith('.pdf'):
pdfs.append(f)
for f in pdfs:
print('Which PDF do you want to extract pages from?')
print(f)
myPDF = input('Copy and paste 1 of the PDF names here ... ')
# read the pdf
pdf = PdfFileReader(path2PDF + myPDF)
pages = pdf.getNumPages()
print('This pdf has ' + str(pages) + ' pages')
# get the csv with the page details for extraction
print('What pages do you want to get? They are in a CSV file.')
csv_files = os.listdir(path2CSV)
csvs = []
for f in csv_files:
if f.endswith('.csv'):
csvs.append(f)
for f in csvs:
print('Which CSV file do you need?')
print(f)
myCSV = input('Copy and paste 1 of the CSV names here ... ')
# get the data from csv
with open(path2CSV + myCSV) as infile:
# read the csv file in
answers = csv.reader(infile)
# csv.reader is annoying, it's gone if you have to repeat, so read to a data list first
data = []
for row in answers:
data.append(row)
# get the base name for saving the PDFs
name = myPDF.split('.')
bookTitle = name[0]
# a function to make the excerpts
def makePDF(alist):
start = int(alist[1])
end = int(alist[2])
label = alist[0]
pdf_writer = PdfFileWriter()
for page in range(start, end):
pdf_writer.addPage(pdf.getPage(page))
output_filename = f'{bookTitle}_{label}.pdf'
with open(path2Extracts + output_filename, 'wb') as out:
pdf_writer.write(out)
print(f'Created: {output_filename} and saved in', path2Extracts)
for i in range(1, len(data)):
makePDF(data[i])
print('Pages extracted, pdfs made and saved in ', pathToExtracts)
print('All done!')
|