Python Forum
How to load all pages of the pdf file to the program
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
How to load all pages of the pdf file to the program
#1
I'm working with the code to summarize the text using BERT. I am stopped at the step "loading all pages of the pdf file to the program". My code below just loads only one page. Please help for the instruction. I am a new coder.
f= open('/content/Example.pdf', 'rb')
pdf = PdfFileReader(f)
page = pdf.getPage(6) 
text = page.extractText()
Reply
#2
You should tell what library you use,bye the look it's PyPDF2
It's a common task to do, so if you search you will find different solution.
If i do a quick test this works.
import PyPDF2

file_pdf = 'sample.pdf'
with open(file_pdf, mode='rb') as f:
    reader = PyPDF2.PdfFileReader(f)
    for page in range(reader.numPages):
        p = reader.getPage(page)
        print(p.extract_text())
Reply
#3
Yes. I use Pypdf2.
I will try with your recommend code. Thanks.
Reply
#4
It is successful now. But the output looks not the best. Could you review and give me your advice?
Output:
Streaming output truncated to the last 5000 lines. o i n t t ..... ect
Reply
#5
Don't post the whole oput if it 100's of lines long.
I have shorting it out.
Are you using my code with with one pdf file?
Here is the file sample.pdf i test with.
Reply
#6
I used that code but with another file. Your file is OK.
See the attached file.

.pdf   test.pdf (Size: 272.01 KB / Downloads: 93)
Reply
#7
(Jun-29-2022, 02:49 AM)alicenguyen Wrote: I used that code but with another file. Your file is OK.
See the attached file.

Could you help with the answer?
Reply
#8
I'm pretty sure this code is based on something snippsat wrote some time ago, too clever for me.

PyPDf2 is good, but can't get some text for reasons I don't know or understand

# importing required modules 
import pdfplumber

path2pdf = '/home/pedro/pdfExtractedPages/The_Knights_Tale_Modern_English.pdf'    
path2text = '/home/pedro/temp/The_Knights_Tale_Middle_English.txt'
"""
>>> test = enumerate(pages, 1) # 1 starts counting at 1, set 0 to count from zero
>>> test
<enumerate object at 0x7f586adb5880>
>>> for t in test:
	print(t)

	
(1, <Page:1>)
(2, <Page:2>)
(3, <Page:3>)
"""
text_pages = []

with pdfplumber.open(pdf_file) as pdf:
    pages = pdf.pages
    for pg in range(len(pages)):
        text = pdf.pages[pg].extract_text()
        text_pages.append(text)

print('text_pages is now', len(text_pages), 'long. Now joining tthe list to a string ... ')
textstring = ''.join(text_pages)

with open(path2text, 'w') as tf:
    tf.write(textstring)
Gets the bits other modules can't reach! Worked for me, thank you snippsat!
Reply
#9
You can try using https://pypi.org/project/camelot-py/
If you can't explain it to a six year old, you don't understand it yourself, Albert Einstein
How to Ask Questions The Smart Way: link and another link
Create MCV example
Debug small programs

Reply
#10
(Jul-01-2022, 06:01 AM)Pedroski55 Wrote: I'm pretty sure this code is based on something snippsat wrote some time ago, too clever for me.

PyPDf2 is good, but can't get some text for reasons I don't know or understand

Thanks for supporting the code. The result is better now.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Json File more pages #pandas #dataframe nio74maz 0 1,792 Dec-30-2020, 05:32 AM
Last Post: nio74maz
  Phyton code to load a comma separated csv file in to a dict and then in to a dB mrsenorchuck 2 2,652 Nov-29-2019, 10:59 AM
Last Post: mrsenorchuck
  Load and format a CSV file fioranosnake 11 4,495 Oct-30-2019, 12:32 PM
Last Post: perfringo
  Load JSON file data into mongodb using pymongo klllmmm 1 11,851 Jun-28-2019, 12:47 AM
Last Post: klllmmm
  Fatal Python error: Py_Initialize: unable to load the file system codec ecg1g15 0 3,557 Feb-12-2019, 12:16 PM
Last Post: ecg1g15
  Cant seem to load my image file jamshaid1997 0 2,404 Jan-18-2019, 02:54 PM
Last Post: jamshaid1997
  Download entire web pages and save them as html file with urllib.request fyec 2 14,671 Jul-13-2018, 10:12 AM
Last Post: Larz60+
  Using asyncio to read text file and load GUI QueenSvetlana 1 4,790 Nov-09-2017, 02:55 PM
Last Post: heiner55
  Program that outputs HQ addresses of companies from Google + Local pages frenchgirl1309 2 5,009 Nov-14-2016, 10:11 PM
Last Post: Ofnuts
  it's about using ctypes to load .dll file Yuji3131 1 5,284 Oct-24-2016, 05:58 PM
Last Post: metulburr

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020