I'm pretty sure this code is based on something snippsat wrote some time ago, too clever for me.
PyPDf2 is good, but can't get some text for reasons I don't know or understand
# importing required modules
import pdfplumber
path2pdf = '/home/pedro/pdfExtractedPages/The_Knights_Tale_Modern_English.pdf'
path2text = '/home/pedro/temp/The_Knights_Tale_Middle_English.txt'
"""
>>> test = enumerate(pages, 1) # 1 starts counting at 1, set 0 to count from zero
>>> test
<enumerate object at 0x7f586adb5880>
>>> for t in test:
print(t)
(1, <Page:1>)
(2, <Page:2>)
(3, <Page:3>)
"""
text_pages = []
with pdfplumber.open(pdf_file) as pdf:
pages = pdf.pages
for pg in range(len(pages)):
text = pdf.pages[pg].extract_text()
text_pages.append(text)
print('text_pages is now', len(text_pages), 'long. Now joining tthe list to a string ... ')
textstring = ''.join(text_pages)
with open(path2text, 'w') as tf:
tf.write(textstring)
Gets the bits other modules can't reach! Worked for me, thank you snippsat!