Python Forum

Full Version: Search text in PDF and output its page number.
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
Pages: 1 2 3
Can pdfplumber search part of a word then print results with the whole word?

example:

search word: Page (

Output:
search word: Page (1) found on page 1
search word: Page (2) found on page 2
search word: Page (3) found on page 3
...
(Jan-21-2022, 03:51 AM)atomxkai Wrote: [ -> ]Can pdfplumber search part of a word then print results with the whole word?
It's more up to you to do that task as pdfplumber return plaint text.
So for this task can use regex.
Eg a pattern(search) r"\bpage\s\d+\b" will find page 1,page 2 or page 50.
Also it find page \s(whitespace character) \d(matches a digit) +(matches the previous digit between one and unlimited times)
Example.
import pdfplumber
import re

pdf_file = "sample.pdf"
pattern = re.compile(r"\bpage\s\d+\b")
with pdfplumber.open(pdf_file) as pdf:
    pages = pdf.pages
    for page_nr, pg in enumerate(pages, 1):
        content = pg.extract_text()
        for match in pattern.finditer(content):
            print(match.group(), page_nr, content.index(match.group()))
Output:
page 2 1 568 page 1 2 39
Pages: 1 2 3