Python Forum
Search text in PDF and output its page number.
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Search text in PDF and output its page number.
#1
Hello,

I'm trying to use this script to search a word or text in a PDF and will output the text and its page number.

Please note: I'm using Windows Smile

import PyPDF2
import re
 
pdfFileObj=open(r'C:\python\document.pdf',mode='rb')
pdfReader=PyPDF2.PdfFileReader(pdfFileObj)
number_of_pages=pdfReader.numPages
 
pages_text=[]
words_start_pos={}
words={}
 
searchwords=['Earth']
 
with open('Results.csv', 'w') as f:
    f.write('{0},{1}\n'.format("Page Number", "Search"))
    for word in searchwords:
        for page in range(number_of_pages):
            print(page)
            pages_text.append(pdfReader.getPage(page).extractText())
            words_start_pos[page]=[dwg.start() for dwg in re.finditer(word, pages_text[page].lower())]
            words[page]=[pages_text[page][value:value+len(word)] for value in words_start_pos[page]]
        for page in words:
            for i in range(0,len(words[page])):
               if str(words[page][i]) != 'nan':
                    f.write('{0},{1}\n'.format(page+1, words[page][i]))
                    print(page, words[page][i])
Problem:
Results is not showing.

[Image: yAD97mN]
https://imgur.com/a/yAD97mN
Reply


Messages In This Thread
Search text in PDF and output its page number. - by atomxkai - Jan-07-2022, 06:56 PM

Possibly Related Threads…
Thread Author Replies Views Last Post
Brick Number stored as text with openpyxl CAD79 2 617 Apr-17-2024, 10:17 AM
Last Post: CAD79
  capturing multiline output for number of parameters jss 3 871 Sep-01-2023, 05:42 PM
Last Post: jss
  Formatting float number output barryjo 2 976 May-04-2023, 02:04 PM
Last Post: barryjo
  fuzzywuzzy search string in text file marfer 9 4,747 Aug-03-2021, 02:41 AM
Last Post: deanhystad
  Getting a GET request output text into a variable to work with it. LeoT 2 3,177 Feb-24-2021, 02:05 PM
Last Post: LeoT
  Increment text files output and limit contains Kaminsky 1 3,278 Jan-30-2021, 06:58 PM
Last Post: bowlofred
  How to Split Output Audio on Text to Speech Code Base12 2 6,940 Aug-29-2020, 03:23 AM
Last Post: Base12
  Search Results Web results Printing the number of days in a given month and year afefDXCTN 1 2,292 Aug-21-2020, 12:20 PM
Last Post: DeaD_EyE
  Import Text, output curve geometry Alyner 0 2,029 Feb-03-2020, 03:05 AM
Last Post: Alyner
  Search for the line number corresponding to a value Lali 0 1,679 Oct-22-2019, 08:56 AM
Last Post: Lali

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020