Python Forum
Search text in PDF and output its page number.
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Search text in PDF and output its page number.
#1
Hello,

I'm trying to use this script to search a word or text in a PDF and will output the text and its page number.

Please note: I'm using Windows Smile

import PyPDF2
import re
 
pdfFileObj=open(r'C:\python\document.pdf',mode='rb')
pdfReader=PyPDF2.PdfFileReader(pdfFileObj)
number_of_pages=pdfReader.numPages
 
pages_text=[]
words_start_pos={}
words={}
 
searchwords=['Earth']
 
with open('Results.csv', 'w') as f:
    f.write('{0},{1}\n'.format("Page Number", "Search"))
    for word in searchwords:
        for page in range(number_of_pages):
            print(page)
            pages_text.append(pdfReader.getPage(page).extractText())
            words_start_pos[page]=[dwg.start() for dwg in re.finditer(word, pages_text[page].lower())]
            words[page]=[pages_text[page][value:value+len(word)] for value in words_start_pos[page]]
        for page in words:
            for i in range(0,len(words[page])):
               if str(words[page][i]) != 'nan':
                    f.write('{0},{1}\n'.format(page+1, words[page][i]))
                    print(page, words[page][i])
Problem:
Results is not showing.

[Image: yAD97mN]
https://imgur.com/a/yAD97mN
Reply
#2
When you run the logic, are your variables being populated?
Like you have print(page) does that print out a number?
What if you add a print(word) right below the for statement and see if that contains anything

Since you are looping thru those searchwords and page numbers, if those are blank, then you wont get any results.
Reply
#3
Try a different PDF. Your code works as expected for me with some PDFs but not others. Some PDFs are made entirely of images of words so only OCR would be able to retrieve those.
Reply
#4
(Jan-07-2022, 11:21 PM)BashBedlam Wrote: Try a different PDF. Your code works as expected for me with some PDFs but not others. Some PDFs are made entirely of images of words so only OCR would be able to retrieve those.

I learned that it mostly works in Mac than in Windows? But I will try different pdf. Thanks.
Reply
#5
(Jan-08-2022, 12:26 AM)atomxkai Wrote:
(Jan-07-2022, 11:21 PM)BashBedlam Wrote: Try a different PDF. Your code works as expected for me with some PDFs but not others. Some PDFs are made entirely of images of words so only OCR would be able to retrieve those.

I learned that it mostly works in Mac than in Windows? But I will try different pdf. Thanks.
Reply
#6
(Jan-07-2022, 07:19 PM)cubangt Wrote: When you run the logic, are your variables being populated?
Like you have print(page) does that print out a number?
What if you add a print(word) right below the for statement and see if that contains anything

Since you are looping thru those searchwords and page numbers, if those are blank, then you wont get any results.

i tried to add print(word) but no output only the header. it seems i'll try to use different PDF
Reply
#7
pdfplumber may be a better tool for this and more updated,
it's been 5-6 year since PyPDF2 was updated and has stuff that is none pythonic like CamelCase🐫 usage everywhere.

So i can write a quick test for this task,using this sample pdf.
import pdfplumber

pdf_file = "sample.pdf"
search_word = 'end'
with pdfplumber.open(pdf_file) as pdf:
    pages = pdf.pages
    for page_nr, pg in enumerate(pages, 1):
        content = pg.extract_text()
        if search_word in content:
            print(f'<{search_word}> found at page number <{page_nr}> '\
                    f'at index <{content.index(search_word)}>')
Output:
<end> found at page number <2> at index <349>
BashBedlam, atomxkai, Larz60+ like this post
Reply
#8
@snippsat I liked your post. Just out of curiosity, how would you find a second or third occurrence of the same word on the same page?
atomxkai likes this post
Reply
#9
(Jan-08-2022, 03:07 PM)BashBedlam Wrote: @snippsat I liked your post. Just out of curiosity, how would you find a second or third occurrence of the same word on the same page?
Can split up content then loop over that list.
Example.
import pdfplumber

pdf_file = "sample.pdf"
search_word = 'more'
with pdfplumber.open(pdf_file) as pdf:
    pages = pdf.pages
    for page_nr, pg in enumerate(pages, 1):
        content = pg.extract_text().split()
        for word in content:
            if search_word in word:
                print(f'<{search_word}> found on page {page_nr}')
Output:
<more> found on page 1 <more> found on page 1 <more> found on page 1 ..... <more> found on page 2
Eg collect in a list list and count words found.
import pdfplumber

pdf_file = "sample.pdf"
search_word = 'more'
lst = []
with pdfplumber.open(pdf_file) as pdf:
    pages = pdf.pages
    for page_nr, pg in enumerate(pages):
        content = pg.extract_text().split()
        for word in content:
            if search_word in word:
                lst.append(search_word)

print(f'Search word <{search_word}> found {len(lst)} times in {pdf_file}')
Output:
Search word <more> found 40 times in sample.pdf
BashBedlam and atomxkai like this post
Reply
#10
(Jan-08-2022, 10:09 AM)snippsat Wrote: pdfplumber may be a better tool for this and more updated,
it's been 5-6 year since PyPDF2 was updated and has stuff that is none pythonic like CamelCase🐫 usage everywhere.

So i can write a quick test for this task,using this sample pdf.
import pdfplumber

pdf_file = "sample.pdf"
search_word = 'end'
with pdfplumber.open(pdf_file) as pdf:
    pages = pdf.pages
    for page_nr, pg in enumerate(pages, 1):
        content = pg.extract_text()
        if search_word in content:
            print(f'<{search_word}> found at page number <{page_nr}> '\
                    f'at index <{content.index(search_word)}>')
Output:
<end> found at page number <2> at index <349>


THIS Actually Works!!! awesome thank you so much. genius. can i use this? Big Grin

still hoping if i can fix the original code though. Smile
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
Brick Number stored as text with openpyxl CAD79 2 362 Apr-17-2024, 10:17 AM
Last Post: CAD79
  capturing multiline output for number of parameters jss 3 809 Sep-01-2023, 05:42 PM
Last Post: jss
  Formatting float number output barryjo 2 911 May-04-2023, 02:04 PM
Last Post: barryjo
  fuzzywuzzy search string in text file marfer 9 4,557 Aug-03-2021, 02:41 AM
Last Post: deanhystad
  Getting a GET request output text into a variable to work with it. LeoT 2 2,987 Feb-24-2021, 02:05 PM
Last Post: LeoT
  Increment text files output and limit contains Kaminsky 1 3,188 Jan-30-2021, 06:58 PM
Last Post: bowlofred
  How to Split Output Audio on Text to Speech Code Base12 2 6,859 Aug-29-2020, 03:23 AM
Last Post: Base12
  Search Results Web results Printing the number of days in a given month and year afefDXCTN 1 2,231 Aug-21-2020, 12:20 PM
Last Post: DeaD_EyE
  Import Text, output curve geometry Alyner 0 1,972 Feb-03-2020, 03:05 AM
Last Post: Alyner
  Search for the line number corresponding to a value Lali 0 1,639 Oct-22-2019, 08:56 AM
Last Post: Lali

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020