Search text in PDF and output its page number.

Search text in PDF and output its page number. - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Search text in PDF and output its page number. (/thread-35995.html)

Pages: 1 2 3

Search text in PDF and output its page number. - atomxkai - Jan-07-2022

Hello,

I'm trying to use this script to search a word or text in a PDF and will output the text and its page number.

Please note: I'm using Windows Smile

import PyPDF2
import re
 
pdfFileObj=open(r'C:\python\document.pdf',mode='rb')
pdfReader=PyPDF2.PdfFileReader(pdfFileObj)
number_of_pages=pdfReader.numPages
 
pages_text=[]
words_start_pos={}
words={}
 
searchwords=['Earth']
 
with open('Results.csv', 'w') as f:
    f.write('{0},{1}\n'.format("Page Number", "Search"))
    for word in searchwords:
        for page in range(number_of_pages):
            print(page)
            pages_text.append(pdfReader.getPage(page).extractText())
            words_start_pos[page]=[dwg.start() for dwg in re.finditer(word, pages_text[page].lower())]
            words[page]=[pages_text[page][value:value+len(word)] for value in words_start_pos[page]]
        for page in words:
            for i in range(0,len(words[page])):
               if str(words[page][i]) != 'nan':
                    f.write('{0},{1}\n'.format(page+1, words[page][i]))
                    print(page, words[page][i])

Problem:
Results is not showing.

[Image: yAD97mN]

https://imgur.com/a/yAD97mN

RE: Search text in PDF and output its page number. - cubangt - Jan-07-2022

When you run the logic, are your variables being populated?
Like you have print(page) does that print out a number?
What if you add a print(word) right below the for statement and see if that contains anything

Since you are looping thru those searchwords and page numbers, if those are blank, then you wont get any results.

RE: Search text in PDF and output its page number. - BashBedlam - Jan-07-2022

Try a different PDF. Your code works as expected for me with some PDFs but not others. Some PDFs are made entirely of images of words so only OCR would be able to retrieve those.

RE: Search text in PDF and output its page number. - atomxkai - Jan-08-2022

(Jan-07-2022, 11:21 PM)BashBedlam Wrote: Try a different PDF. Your code works as expected for me with some PDFs but not others. Some PDFs are made entirely of images of words so only OCR would be able to retrieve those.

I learned that it mostly works in Mac than in Windows? But I will try different pdf. Thanks.

RE: Search text in PDF and output its page number. - atomxkai - Jan-08-2022

(Jan-08-2022, 12:26 AM)atomxkai Wrote:
(Jan-07-2022, 11:21 PM)BashBedlam Wrote: Try a different PDF. Your code works as expected for me with some PDFs but not others. Some PDFs are made entirely of images of words so only OCR would be able to retrieve those.

I learned that it mostly works in Mac than in Windows? But I will try different pdf. Thanks.

RE: Search text in PDF and output its page number. - atomxkai - Jan-08-2022

(Jan-07-2022, 07:19 PM)cubangt Wrote: When you run the logic, are your variables being populated?
Like you have print(page) does that print out a number?
What if you add a print(word) right below the for statement and see if that contains anything

Since you are looping thru those searchwords and page numbers, if those are blank, then you wont get any results.

i tried to add print(word) but no output only the header. it seems i'll try to use different PDF

RE: Search text in PDF and output its page number. - snippsat - Jan-08-2022

pdfplumber may be a better tool for this and more updated,
it's been 5-6 year since PyPDF2 was updated and has stuff that is none pythonic like CamelCase🐫 usage everywhere.

So i can write a quick test for this task,using this sample pdf.

import pdfplumber

pdf_file = "sample.pdf"
search_word = 'end'
with pdfplumber.open(pdf_file) as pdf:
    pages = pdf.pages
    for page_nr, pg in enumerate(pages, 1):
        content = pg.extract_text()
        if search_word in content:
            print(f'<{search_word}> found at page number <{page_nr}> '\
                    f'at index <{content.index(search_word)}>')

Output:
<end> found at page number <2> at index <349>

RE: Search text in PDF and output its page number. - BashBedlam - Jan-08-2022

@snippsat I liked your post. Just out of curiosity, how would you find a second or third occurrence of the same word on the same page?

RE: Search text in PDF and output its page number. - snippsat - Jan-08-2022

(Jan-08-2022, 03:07 PM)BashBedlam Wrote: @snippsat I liked your post. Just out of curiosity, how would you find a second or third occurrence of the same word on the same page?

Can split up content then loop over that list.
Example.

import pdfplumber

pdf_file = "sample.pdf"
search_word = 'more'
with pdfplumber.open(pdf_file) as pdf:
    pages = pdf.pages
    for page_nr, pg in enumerate(pages, 1):
        content = pg.extract_text().split()
        for word in content:
            if search_word in word:
                print(f'<{search_word}> found on page {page_nr}')

Output:<more> found on page 1
<more> found on page 1
<more> found on page 1
.....
<more> found on page 2

Eg collect in a list list and count words found.

import pdfplumber

pdf_file = "sample.pdf"
search_word = 'more'
lst = []
with pdfplumber.open(pdf_file) as pdf:
    pages = pdf.pages
    for page_nr, pg in enumerate(pages):
        content = pg.extract_text().split()
        for word in content:
            if search_word in word:
                lst.append(search_word)

print(f'Search word <{search_word}> found {len(lst)} times in {pdf_file}')

Output:
Search word <more> found 40 times in sample.pdf

RE: Search text in PDF and output its page number. - atomxkai - Jan-10-2022

(Jan-08-2022, 10:09 AM)snippsat Wrote: pdfplumber may be a better tool for this and more updated,
it's been 5-6 year since PyPDF2 was updated and has stuff that is none pythonic like CamelCase🐫 usage everywhere.

So i can write a quick test for this task,using this sample pdf.
import pdfplumber

pdf_file = "sample.pdf"
search_word = 'end'
with pdfplumber.open(pdf_file) as pdf:
    pages = pdf.pages
    for page_nr, pg in enumerate(pages, 1):
        content = pg.extract_text()
        if search_word in content:
            print(f'<{search_word}> found at page number <{page_nr}> '\
                    f'at index <{content.index(search_word)}>')
Output:
<end> found at page number <2> at index <349>

THIS Actually Works!!! awesome thank you so much. genius. can i use this? Big Grin

still hoping if i can fix the original code though. Smile