Search text in PDF and output its page number.

atomxkai · (This post was last modified: Jan-08-2022, 12:17 AM by atomxkai.)

Hello,

I'm trying to use this script to search a word or text in a PDF and will output the text and its page number.

Please note: I'm using Windows Smile

import PyPDF2
import re
 
pdfFileObj=open(r'C:\python\document.pdf',mode='rb')
pdfReader=PyPDF2.PdfFileReader(pdfFileObj)
number_of_pages=pdfReader.numPages
 
pages_text=[]
words_start_pos={}
words={}
 
searchwords=['Earth']
 
with open('Results.csv', 'w') as f:
    f.write('{0},{1}\n'.format("Page Number", "Search"))
    for word in searchwords:
        for page in range(number_of_pages):
            print(page)
            pages_text.append(pdfReader.getPage(page).extractText())
            words_start_pos[page]=[dwg.start() for dwg in re.finditer(word, pages_text[page].lower())]
            words[page]=[pages_text[page][value:value+len(word)] for value in words_start_pos[page]]
        for page in words:
            for i in range(0,len(words[page])):
               if str(words[page][i]) != 'nan':
                    f.write('{0},{1}\n'.format(page+1, words[page][i]))
                    print(page, words[page][i])

Problem:
Results is not showing.

[Image: yAD97mN]

https://imgur.com/a/yAD97mN

cubangt · Jan-07-2022, 07:19 PM

When you run the logic, are your variables being populated?
Like you have print(page) does that print out a number?
What if you add a print(word) right below the for statement and see if that contains anything

Since you are looping thru those searchwords and page numbers, if those are blank, then you wont get any results.

BashBedlam · Jan-07-2022, 11:21 PM

Try a different PDF. Your code works as expected for me with some PDFs but not others. Some PDFs are made entirely of images of words so only OCR would be able to retrieve those.

atomxkai · (This post was last modified: Jan-08-2022, 12:26 AM by atomxkai.)

(Jan-07-2022, 11:21 PM)BashBedlam Wrote: Try a different PDF. Your code works as expected for me with some PDFs but not others. Some PDFs are made entirely of images of words so only OCR would be able to retrieve those.

I learned that it mostly works in Mac than in Windows? But I will try different pdf. Thanks.

atomxkai · Jan-08-2022, 12:27 AM

(Jan-08-2022, 12:26 AM)atomxkai Wrote:
(Jan-07-2022, 11:21 PM)BashBedlam Wrote: Try a different PDF. Your code works as expected for me with some PDFs but not others. Some PDFs are made entirely of images of words so only OCR would be able to retrieve those.

I learned that it mostly works in Mac than in Windows? But I will try different pdf. Thanks.

atomxkai · Jan-08-2022, 12:28 AM

(Jan-07-2022, 07:19 PM)cubangt Wrote: When you run the logic, are your variables being populated?
Like you have print(page) does that print out a number?
What if you add a print(word) right below the for statement and see if that contains anything

Since you are looping thru those searchwords and page numbers, if those are blank, then you wont get any results.

i tried to add print(word) but no output only the header. it seems i'll try to use different PDF

***snippsat*** · (This post was last modified: Jan-08-2022, 10:09 AM by snippsat.)

pdfplumber may be a better tool for this and more updated,
it's been 5-6 year since PyPDF2 was updated and has stuff that is none pythonic like CamelCase🐫 usage everywhere.

So i can write a quick test for this task,using this sample pdf.

import pdfplumber

pdf_file = "sample.pdf"
search_word = 'end'
with pdfplumber.open(pdf_file) as pdf:
    pages = pdf.pages
    for page_nr, pg in enumerate(pages, 1):
        content = pg.extract_text()
        if search_word in content:
            print(f'<{search_word}> found at page number <{page_nr}> '\
                    f'at index <{content.index(search_word)}>')

Output:
<end> found at page number <2> at index <349>

BashBedlam · Jan-08-2022, 03:07 PM

@snippsat I liked your post. Just out of curiosity, how would you find a second or third occurrence of the same word on the same page?

***snippsat*** · Jan-08-2022, 05:23 PM

(Jan-08-2022, 03:07 PM)BashBedlam Wrote: @snippsat I liked your post. Just out of curiosity, how would you find a second or third occurrence of the same word on the same page?

Can split up content then loop over that list.
Example.

import pdfplumber

pdf_file = "sample.pdf"
search_word = 'more'
with pdfplumber.open(pdf_file) as pdf:
    pages = pdf.pages
    for page_nr, pg in enumerate(pages, 1):
        content = pg.extract_text().split()
        for word in content:
            if search_word in word:
                print(f'<{search_word}> found on page {page_nr}')

Output:<more> found on page 1
<more> found on page 1
<more> found on page 1
.....
<more> found on page 2

Eg collect in a list list and count words found.

import pdfplumber

pdf_file = "sample.pdf"
search_word = 'more'
lst = []
with pdfplumber.open(pdf_file) as pdf:
    pages = pdf.pages
    for page_nr, pg in enumerate(pages):
        content = pg.extract_text().split()
        for word in content:
            if search_word in word:
                lst.append(search_word)

print(f'Search word <{search_word}> found {len(lst)} times in {pdf_file}')

Output:
Search word <more> found 40 times in sample.pdf

atomxkai · Jan-10-2022, 02:13 AM

(Jan-08-2022, 10:09 AM)snippsat Wrote: pdfplumber may be a better tool for this and more updated,
it's been 5-6 year since PyPDF2 was updated and has stuff that is none pythonic like CamelCase🐫 usage everywhere.

So i can write a quick test for this task,using this sample pdf.
import pdfplumber

pdf_file = "sample.pdf"
search_word = 'end'
with pdfplumber.open(pdf_file) as pdf:
    pages = pdf.pages
    for page_nr, pg in enumerate(pages, 1):
        content = pg.extract_text()
        if search_word in content:
            print(f'<{search_word}> found at page number <{page_nr}> '\
                    f'at index <{content.index(search_word)}>')
Output:
<end> found at page number <2> at index <349>

THIS Actually Works!!! awesome thank you so much. genius. can i use this? Big Grin

still hoping if i can fix the original code though. Smile

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Number stored as text with openpyxl	CAD79	2	362	Apr-17-2024, 10:17 AM Last Post: CAD79
	capturing multiline output for number of parameters	jss	3	809	Sep-01-2023, 05:42 PM Last Post: jss
	Formatting float number output	barryjo	2	911	May-04-2023, 02:04 PM Last Post: barryjo
	fuzzywuzzy search string in text file	marfer	9	4,557	Aug-03-2021, 02:41 AM Last Post: deanhystad
	Getting a GET request output text into a variable to work with it.	LeoT	2	2,987	Feb-24-2021, 02:05 PM Last Post: LeoT
	Increment text files output and limit contains	Kaminsky	1	3,188	Jan-30-2021, 06:58 PM Last Post: bowlofred
	How to Split Output Audio on Text to Speech Code	Base12	2	6,859	Aug-29-2020, 03:23 AM Last Post: Base12
	Search Results Web results Printing the number of days in a given month and year	afefDXCTN	1	2,231	Aug-21-2020, 12:20 PM Last Post: DeaD_EyE
	Import Text, output curve geometry	Alyner	0	1,972	Feb-03-2020, 03:05 AM Last Post: Alyner
	Search for the line number corresponding to a value	Lali	0	1,639	Oct-22-2019, 08:56 AM Last Post: Lali

Search text in PDF and output its page number.

User Panel Messages

Announcements