How to load all pages of the pdf file to the program

alicenguyen · Jun-28-2022, 06:51 AM

I'm working with the code to summarize the text using BERT. I am stopped at the step "loading all pages of the pdf file to the program". My code below just loads only one page. Please help for the instruction. I am a new coder.

f= open('/content/Example.pdf', 'rb')
pdf = PdfFileReader(f)
page = pdf.getPage(6) 
text = page.extractText()

***snippsat*** · Jun-28-2022, 09:31 AM

You should tell what library you use,bye the look it's PyPDF2
It's a common task to do, so if you search you will find different solution.
If i do a quick test this works.

import PyPDF2

file_pdf = 'sample.pdf'
with open(file_pdf, mode='rb') as f:
    reader = PyPDF2.PdfFileReader(f)
    for page in range(reader.numPages):
        p = reader.getPage(page)
        print(p.extract_text())

alicenguyen · Jun-28-2022, 09:41 AM

Yes. I use Pypdf2.
I will try with your recommend code. Thanks.

alicenguyen

It is successful now. But the output looks not the best. Could you review and give me your advice?

Output:Streaming output truncated to the last 5000 lines.


o



i



n



t



 



t

..... ect

***snippsat*** · (This post was last modified: Jun-28-2022, 10:09 AM by snippsat.)

Don't post the whole oput if it 100's of lines long.
I have shorting it out.
Are you using my code with with one pdf file?
Here is the file sample.pdf i test with.

alicenguyen · Jun-29-2022, 02:49 AM

I used that code but with another file. Your file is OK.
See the attached file.

.pdf

test.pdf (Size: 272.01 KB / Downloads: 172)

alicenguyen · Jul-01-2022, 02:43 AM

(Jun-29-2022, 02:49 AM)alicenguyen Wrote: I used that code but with another file. Your file is OK.
See the attached file.

Could you help with the answer?

Pedroski55 · Jul-01-2022, 06:01 AM

I'm pretty sure this code is based on something snippsat wrote some time ago, too clever for me.

PyPDf2 is good, but can't get some text for reasons I don't know or understand

# importing required modules 
import pdfplumber

path2pdf = '/home/pedro/pdfExtractedPages/The_Knights_Tale_Modern_English.pdf'    
path2text = '/home/pedro/temp/The_Knights_Tale_Middle_English.txt'
"""
>>> test = enumerate(pages, 1) # 1 starts counting at 1, set 0 to count from zero
>>> test
<enumerate object at 0x7f586adb5880>
>>> for t in test:
	print(t)

	
(1, <Page:1>)
(2, <Page:2>)
(3, <Page:3>)
"""
text_pages = []

with pdfplumber.open(pdf_file) as pdf:
    pages = pdf.pages
    for pg in range(len(pages)):
        text = pdf.pages[pg].extract_text()
        text_pages.append(text)

print('text_pages is now', len(text_pages), 'long. Now joining tthe list to a string ... ')
textstring = ''.join(text_pages)

with open(path2text, 'w') as tf:
    tf.write(textstring)

Gets the bits other modules can't reach! Worked for me, thank you snippsat!

**buran** · Jul-01-2022, 07:06 AM

You can try using https://pypi.org/project/camelot-py/

alicenguyen · Jul-04-2022, 03:14 AM

(Jul-01-2022, 06:01 AM)Pedroski55 Wrote: I'm pretty sure this code is based on something snippsat wrote some time ago, too clever for me.

PyPDf2 is good, but can't get some text for reasons I don't know or understand

Thanks for supporting the code. The result is better now.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Load a Folium map into a pdf-file	Thats_Leet	0	838	Jan-01-2025, 08:13 PM Last Post: Thats_Leet
	Json File more pages #pandas #dataframe	nio74maz	0	2,295	Dec-30-2020, 05:32 AM Last Post: nio74maz
	Phyton code to load a comma separated csv file in to a dict and then in to a dB	mrsenorchuck	2	3,526	Nov-29-2019, 10:59 AM Last Post: mrsenorchuck
	Load and format a CSV file	fioranosnake	11	6,676	Oct-30-2019, 12:32 PM Last Post: perfringo
	Load JSON file data into mongodb using pymongo	klllmmm	1	13,294	Jun-28-2019, 12:47 AM Last Post: klllmmm
	Fatal Python error: Py_Initialize: unable to load the file system codec	ecg1g15	0	4,252	Feb-12-2019, 12:16 PM Last Post: ecg1g15
	Cant seem to load my image file	jamshaid1997	0	3,214	Jan-18-2019, 02:54 PM Last Post: jamshaid1997
	Download entire web pages and save them as html file with urllib.request	fyec	2	17,570	Jul-13-2018, 10:12 AM Last Post: Larz60+
	Using asyncio to read text file and load GUI	QueenSvetlana	1	5,511	Nov-09-2017, 02:55 PM Last Post: heiner55
	Program that outputs HQ addresses of companies from Google + Local pages	frenchgirl1309	2	5,924	Nov-14-2016, 10:11 PM Last Post: Ofnuts

How to load all pages of the pdf file to the program

User Panel Messages

Announcements