Posts: 30
Threads: 8
Joined: Feb 2021
Jan-07-2022, 06:56 PM
(This post was last modified: Jan-08-2022, 12:17 AM by atomxkai.)
Hello,
I'm trying to use this script to search a word or text in a PDF and will output the text and its page number.
Please note: I'm using Windows
import PyPDF2
import re
pdfFileObj=open(r'C:\python\document.pdf',mode='rb')
pdfReader=PyPDF2.PdfFileReader(pdfFileObj)
number_of_pages=pdfReader.numPages
pages_text=[]
words_start_pos={}
words={}
searchwords=['Earth']
with open('Results.csv', 'w') as f:
f.write('{0},{1}\n'.format("Page Number", "Search"))
for word in searchwords:
for page in range(number_of_pages):
print(page)
pages_text.append(pdfReader.getPage(page).extractText())
words_start_pos[page]=[dwg.start() for dwg in re.finditer(word, pages_text[page].lower())]
words[page]=[pages_text[page][value:value+len(word)] for value in words_start_pos[page]]
for page in words:
for i in range(0,len(words[page])):
if str(words[page][i]) != 'nan':
f.write('{0},{1}\n'.format(page+1, words[page][i]))
print(page, words[page][i]) Problem:
Results is not showing.
https://imgur.com/a/yAD97mN
Posts: 170
Threads: 43
Joined: May 2019
When you run the logic, are your variables being populated?
Like you have print(page) does that print out a number?
What if you add a print(word) right below the for statement and see if that contains anything
Since you are looping thru those searchwords and page numbers, if those are blank, then you wont get any results.
Posts: 379
Threads: 2
Joined: Jan 2021
Try a different PDF. Your code works as expected for me with some PDFs but not others. Some PDFs are made entirely of images of words so only OCR would be able to retrieve those.
Posts: 30
Threads: 8
Joined: Feb 2021
Jan-08-2022, 12:26 AM
(This post was last modified: Jan-08-2022, 12:26 AM by atomxkai.)
(Jan-07-2022, 11:21 PM)BashBedlam Wrote: Try a different PDF. Your code works as expected for me with some PDFs but not others. Some PDFs are made entirely of images of words so only OCR would be able to retrieve those.
I learned that it mostly works in Mac than in Windows? But I will try different pdf. Thanks.
Posts: 30
Threads: 8
Joined: Feb 2021
(Jan-08-2022, 12:26 AM)atomxkai Wrote: (Jan-07-2022, 11:21 PM)BashBedlam Wrote: Try a different PDF. Your code works as expected for me with some PDFs but not others. Some PDFs are made entirely of images of words so only OCR would be able to retrieve those.
I learned that it mostly works in Mac than in Windows? But I will try different pdf. Thanks.
Posts: 30
Threads: 8
Joined: Feb 2021
(Jan-07-2022, 07:19 PM)cubangt Wrote: When you run the logic, are your variables being populated?
Like you have print(page) does that print out a number?
What if you add a print(word) right below the for statement and see if that contains anything
Since you are looping thru those searchwords and page numbers, if those are blank, then you wont get any results.
i tried to add print(word) but no output only the header. it seems i'll try to use different PDF
Posts: 7,320
Threads: 123
Joined: Sep 2016
Jan-08-2022, 10:09 AM
(This post was last modified: Jan-08-2022, 10:09 AM by snippsat.)
pdfplumber may be a better tool for this and more updated,
it's been 5-6 year since PyPDF2 was updated and has stuff that is none pythonic like CamelCase🐫 usage everywhere.
So i can write a quick test for this task,using this sample pdf.
import pdfplumber
pdf_file = "sample.pdf"
search_word = 'end'
with pdfplumber.open(pdf_file) as pdf:
pages = pdf.pages
for page_nr, pg in enumerate(pages, 1):
content = pg.extract_text()
if search_word in content:
print(f'<{search_word}> found at page number <{page_nr}> '\
f'at index <{content.index(search_word)}>') Output: <end> found at page number <2> at index <349>
Posts: 379
Threads: 2
Joined: Jan 2021
@ snippsat I liked your post. Just out of curiosity, how would you find a second or third occurrence of the same word on the same page?
Posts: 7,320
Threads: 123
Joined: Sep 2016
(Jan-08-2022, 03:07 PM)BashBedlam Wrote: @snippsat I liked your post. Just out of curiosity, how would you find a second or third occurrence of the same word on the same page? Can split up content then loop over that list.
Example.
import pdfplumber
pdf_file = "sample.pdf"
search_word = 'more'
with pdfplumber.open(pdf_file) as pdf:
pages = pdf.pages
for page_nr, pg in enumerate(pages, 1):
content = pg.extract_text().split()
for word in content:
if search_word in word:
print(f'<{search_word}> found on page {page_nr}') Output: <more> found on page 1
<more> found on page 1
<more> found on page 1
.....
<more> found on page 2
Eg collect in a list list and count words found.
import pdfplumber
pdf_file = "sample.pdf"
search_word = 'more'
lst = []
with pdfplumber.open(pdf_file) as pdf:
pages = pdf.pages
for page_nr, pg in enumerate(pages):
content = pg.extract_text().split()
for word in content:
if search_word in word:
lst.append(search_word)
print(f'Search word <{search_word}> found {len(lst)} times in {pdf_file}') Output: Search word <more> found 40 times in sample.pdf
atomxkai and BashBedlam like this post
Posts: 30
Threads: 8
Joined: Feb 2021
(Jan-08-2022, 10:09 AM)snippsat Wrote: pdfplumber may be a better tool for this and more updated,
it's been 5-6 year since PyPDF2 was updated and has stuff that is none pythonic like CamelCase🐫 usage everywhere.
So i can write a quick test for this task,using this sample pdf.
import pdfplumber
pdf_file = "sample.pdf"
search_word = 'end'
with pdfplumber.open(pdf_file) as pdf:
pages = pdf.pages
for page_nr, pg in enumerate(pages, 1):
content = pg.extract_text()
if search_word in content:
print(f'<{search_word}> found at page number <{page_nr}> '\
f'at index <{content.index(search_word)}>') Output: <end> found at page number <2> at index <349>
THIS Actually Works!!! awesome thank you so much. genius. can i use this?
still hoping if i can fix the original code though.
|