Search text in PDF and output its page number.

***snippsat*** · (This post was last modified: Jan-10-2022, 05:51 AM by snippsat.)

(Jan-10-2022, 02:13 AM)atomxkai Wrote: THIS Actually Works!!! awesome thank you so much. genius. can i use this?

Sure,you can use it.

(Jan-10-2022, 02:13 AM)atomxkai Wrote: still hoping if i can fix the original code though.

I would not bother with this,the last commit bye author.

Quote:The last commit was from 2018, there are 87 open PRs and 263 open issues.
It seems as if the project is dead.

I could probably fix your code,but as you see there can be issues(that can give problems),as project seems dead.

atomxkai · Jan-10-2022, 10:25 AM

(Jan-10-2022, 02:13 AM)atomxkai Wrote: still hoping if i can fix the original code though.

I would not bother with this,the last commit bye author.

Quote:The last commit was from 2018, there are 87 open PRs and 263 open issues.
It seems as if the project is dead.

I could probably fix your code,but as you see there can be issues(that can give problems),as project seems dead.
[/quote]

I understand. I think pdfplumber is the next thing to use. Hopefully this one will stay longer. Thanks!

atomxkai · Jan-10-2022, 10:43 AM

with open('Output.csv', 'w') as pdf:
    pdf.write('{0},{1}\n'.format("Page Number", "Search Word"))
    pages = pdf.pages
    for page_nr, pg in enumerate(pages, 1):
        content = pg.extract_text()
        if search_word in content:
            print(f'<{search_word}> found at page number <{page_nr}> '\
                f'at index <{content.index(search_word)}>')

If I can still ask, I'm trying to print these results now in CSV format with values. I'm still trying to understand and learning about syntax in python.

Would really appreciate your help. Thanks. Smile

***snippsat*** · Jan-10-2022, 01:56 PM

(Jan-10-2022, 10:43 AM)atomxkai Wrote: If I can still ask, I'm trying to print these results now in CSV format with values. I'm still trying to understand and learning about syntax in python.

Here a couple of examples.

In this one save same result as we print out before.

import pdfplumber

pdf_file = "sample.pdf"
search_word = 'end'
with pdfplumber.open(pdf_file) as pdf,open('result.txt', 'w') as f_out:
    pages = pdf.pages
    for page_nr, pg in enumerate(pages, 1):
        content = pg.extract_text()
        if search_word in content:
            f_out.write(
            f'<{search_word}> found at page number <{page_nr}> '\
            f'at index <{content.index(search_word)}>'
            )

Output:
<end> found at page number <2> at index <349>

Here using csv module.
So here make header row with result under.

import pdfplumber
import csv

pdf_file = "sample.pdf"
search_word = 'end'
header = ['search_word', 'page_nummer']
with pdfplumber.open(pdf_file) as pdf,open('result1.csv', 'w', newline='') as f_out:
    writer = csv.writer(f_out)
    pages = pdf.pages
    for page_nr, pg in enumerate(pages, 1):
        content = pg.extract_text()
        if search_word in content:
            writer.writerow(header)
            writer.writerow([search_word, page_nr])

Output:search_word,page_nummer
end,2

atomxkai · (This post was last modified: Jan-11-2022, 06:19 PM by atomxkai.)

Here using csv module.
So here make header row with result under.

import pdfplumber
import csv

pdf_file = "sample.pdf"
search_word = 'end'
header = ['search_word', 'page_nummer']
with pdfplumber.open(pdf_file) as pdf,open('result1.csv', 'w', newline='') as f_out:
    writer = csv.writer(f_out)
    pages = pdf.pages
    for page_nr, pg in enumerate(pages, 1):
        content = pg.extract_text()
        if search_word in content:
            writer.writerow(header)
            writer.writerow([search_word, page_nr])

Output:search_word,page_nummer
end,2

this is just awesome! this works, thank you so much snippsat!

i haven't tried the first output but will let you know.

wondering if i can still ask few more questions here to refine the script?

also what i learn here is that i can make 1 line with opening file and open write in one line statement. this is what bugging me before.

atomxkai · Jan-14-2022, 10:27 PM

Hi, it is working but I edit some lines and I'm trying to print the index for each row and I cannot seem to do it. Thanks!

import pdfplumber
import csv
 
pdf_file = "sample.pdf"
search_word = 'end'
header = ['Index','Search_word','Page_number']

with pdfplumber.open(pdf_file) as pdf,open('result1.csv', 'w', newline='') as f_out:
    writer = csv.writer(f_out)
    pages = pdf.pages
    writer.writerow(header) #change
    for page_nr, pg in enumerate(pages, 1):
        content = pg.extract_text()
        if search_word in content:
            recordValues = [search_word]
            for recordIndex, countValue in enumerate(recordValues,start=1):
                writer.writerow([recordIndex, search_word, page_nr])

***snippsat*** · (This post was last modified: Jan-15-2022, 12:59 AM by snippsat.)

(Jan-14-2022, 10:27 PM)atomxkai Wrote: Hi, it is working but I edit some lines and I'm trying to print the index for each row and I cannot seem to do it. Thanks!

You should add some print() to see what going to understand the code.
Now there is now rows content = pg.extract_text() is pages as one whole string per page.
PDF doesn't have a concept of lines of text (or any higher order collection of characters).

If want rows or new line has to split up the content at \n
I could guess could do something like this,if that what you mean?

import pdfplumber
import csv

pdf_file = "sample.pdf"
search_word = 'end'
header = ['Index','Search_word','Page_number']

with pdfplumber.open(pdf_file) as pdf,open('result1.csv', 'w', newline='') as f_out:
    writer = csv.writer(f_out)
    pages = pdf.pages
    writer.writerow(header) #change
    for page_nr, pg in enumerate(pages, 1):
        content = pg.extract_text().split('\n')
        for index, line in enumerate(content, 1):
            print(index, line)

Output:1  A Simple PDF File 
2  This is a small demonstration .pdf file - 
3  just for use in the Virtual Mechanics tutorials. More text. And more 
4  text. And more text. And more text. And more text. 
5  And more text. And more text. And more text. And more text. And more 
6  text. And more text. Boring, zzzzz. And more text. And more text. And 
7  more text. And more text. And more text. And more text. And more text. 
8  And more text. And more text. 
9  And more text. And more text. And more text. And more text. And more 
10  text. And more text. And more text. Even more. Continued on page 2 ...
1  Simple PDF File 2 
2  ...continued from page 1. Yet more text. And more text. And more text. 
3  And more text. And more text. And more text. And more text. And more 
4  text. Oh, how boring typing this stuff. But not as boring as watching 
5  paint dry. And more text. And more text. And more text. And more text. 
6  Boring.  More, a little more text. The end, and just as well.

atomxkai · Jan-18-2022, 11:11 PM

Thanks for the sample script.

But I wanted to print like this.

PDF files with 50 pages
search_word = 'page (2)'

print output:

index, search_word, page number
1, page (2), 8
2, page (2), 12
3, page (2), 27

thank you.

***snippsat*** · (This post was last modified: Jan-19-2022, 12:17 AM by snippsat.)

What you describe is what my code dos in post #7

import pdfplumber

pdf_file = "sample.pdf"
search_word = 'text'
with pdfplumber.open(pdf_file) as pdf:
    pages = pdf.pages
    for page_nr, pg in enumerate(pages, 1):
        content = pg.extract_text()
        if search_word in content:
            print(f'<{search_word}> found at page number <{page_nr}> '\
                    f'at index <{content.index(search_word)}>')

Output:<text> found at page number <1> at index <119>
<text> found at page number <2> at index <56>

So you can shuffle the the f-string print() to get output you want.

atomxkai · Jan-21-2022, 03:45 AM

I was able to run an index print. Thanks again snippsat. Smile

index = 1     
    for page_nr, pg in enumerate(pages, 1):
        content = pg.extract_text()
...
        if search_word in content:               
            print(index, search_word, page_nr)
            writer.writerow([index, search_word, page_nr])
            index = index + 1

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Number stored as text with openpyxl	CAD79	2	5,077	Apr-17-2024, 10:17 AM Last Post: CAD79
	capturing multiline output for number of parameters	jss	3	1,765	Sep-01-2023, 05:42 PM Last Post: jss
	Formatting float number output	barryjo	2	1,793	May-04-2023, 02:04 PM Last Post: barryjo
	fuzzywuzzy search string in text file	marfer	9	8,934	Aug-03-2021, 02:41 AM Last Post: deanhystad
	Getting a GET request output text into a variable to work with it.	LeoT	2	5,549	Feb-24-2021, 02:05 PM Last Post: LeoT
	Increment text files output and limit contains	Kaminsky	1	4,435	Jan-30-2021, 06:58 PM Last Post: bowlofred
	How to Split Output Audio on Text to Speech Code	Base12	2	8,132	Aug-29-2020, 03:23 AM Last Post: Base12
	Search Results Web results Printing the number of days in a given month and year	afefDXCTN	1	3,024	Aug-21-2020, 12:20 PM Last Post: DeaD_EyE
	Import Text, output curve geometry	Alyner	0	2,520	Feb-03-2020, 03:05 AM Last Post: Alyner
	Search for the line number corresponding to a value	Lali	0	2,139	Oct-22-2019, 08:56 AM Last Post: Lali

Search text in PDF and output its page number.

User Panel Messages

Announcements