Python Forum
Search text in PDF and output its page number.
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Search text in PDF and output its page number.
#11
(Jan-10-2022, 02:13 AM)atomxkai Wrote: THIS Actually Works!!! awesome thank you so much. genius. can i use this? Big Grin
Sure,you can use it.

(Jan-10-2022, 02:13 AM)atomxkai Wrote: still hoping if i can fix the original code though. Smile
I would not bother with this,the last commit bye author.
Quote:The last commit was from 2018, there are 87 open PRs and 263 open issues.
It seems as if the project is dead.
I could probably fix your code,but as you see there can be issues(that can give problems),as project seems dead.
atomxkai likes this post
Reply
#12
(Jan-10-2022, 02:13 AM)atomxkai Wrote: still hoping if i can fix the original code though. Smile
I would not bother with this,the last commit bye author.
Quote:The last commit was from 2018, there are 87 open PRs and 263 open issues.
It seems as if the project is dead.
I could probably fix your code,but as you see there can be issues(that can give problems),as project seems dead.
[/quote]

I understand. I think pdfplumber is the next thing to use. Hopefully this one will stay longer. Thanks!
Reply
#13
with open('Output.csv', 'w') as pdf:
    pdf.write('{0},{1}\n'.format("Page Number", "Search Word"))
    pages = pdf.pages
    for page_nr, pg in enumerate(pages, 1):
        content = pg.extract_text()
        if search_word in content:
            print(f'<{search_word}> found at page number <{page_nr}> '\
                f'at index <{content.index(search_word)}>')
If I can still ask, I'm trying to print these results now in CSV format with values. I'm still trying to understand and learning about syntax in python.

Would really appreciate your help. Thanks. Smile
Reply
#14
(Jan-10-2022, 10:43 AM)atomxkai Wrote: If I can still ask, I'm trying to print these results now in CSV format with values. I'm still trying to understand and learning about syntax in python.
Here a couple of examples.

In this one save same result as we print out before.
import pdfplumber

pdf_file = "sample.pdf"
search_word = 'end'
with pdfplumber.open(pdf_file) as pdf,open('result.txt', 'w') as f_out:
    pages = pdf.pages
    for page_nr, pg in enumerate(pages, 1):
        content = pg.extract_text()
        if search_word in content:
            f_out.write(
            f'<{search_word}> found at page number <{page_nr}> '\
            f'at index <{content.index(search_word)}>'
            )
Output:
<end> found at page number <2> at index <349>
Here using csv module.
So here make header row with result under.
import pdfplumber
import csv

pdf_file = "sample.pdf"
search_word = 'end'
header = ['search_word', 'page_nummer']
with pdfplumber.open(pdf_file) as pdf,open('result1.csv', 'w', newline='') as f_out:
    writer = csv.writer(f_out)
    pages = pdf.pages
    for page_nr, pg in enumerate(pages, 1):
        content = pg.extract_text()
        if search_word in content:
            writer.writerow(header)
            writer.writerow([search_word, page_nr])
Output:
search_word,page_nummer end,2
atomxkai likes this post
Reply
#15
Here using csv module.
So here make header row with result under.
import pdfplumber
import csv

pdf_file = "sample.pdf"
search_word = 'end'
header = ['search_word', 'page_nummer']
with pdfplumber.open(pdf_file) as pdf,open('result1.csv', 'w', newline='') as f_out:
    writer = csv.writer(f_out)
    pages = pdf.pages
    for page_nr, pg in enumerate(pages, 1):
        content = pg.extract_text()
        if search_word in content:
            writer.writerow(header)
            writer.writerow([search_word, page_nr])
Output:
search_word,page_nummer end,2
this is just awesome! this works, thank you so much snippsat!

i haven't tried the first output but will let you know.

wondering if i can still ask few more questions here to refine the script?

also what i learn here is that i can make 1 line with opening file and open write in one line statement. this is what bugging me before.
Reply
#16
Hi, it is working but I edit some lines and I'm trying to print the index for each row and I cannot seem to do it. Thanks!

import pdfplumber
import csv
 
pdf_file = "sample.pdf"
search_word = 'end'
header = ['Index','Search_word','Page_number']

with pdfplumber.open(pdf_file) as pdf,open('result1.csv', 'w', newline='') as f_out:
    writer = csv.writer(f_out)
    pages = pdf.pages
    writer.writerow(header) #change
    for page_nr, pg in enumerate(pages, 1):
        content = pg.extract_text()
        if search_word in content:
            recordValues = [search_word]
            for recordIndex, countValue in enumerate(recordValues,start=1):
                writer.writerow([recordIndex, search_word, page_nr])
Reply
#17
(Jan-14-2022, 10:27 PM)atomxkai Wrote: Hi, it is working but I edit some lines and I'm trying to print the index for each row and I cannot seem to do it. Thanks!
You should add some print() to see what going to understand the code.
Now there is now rows content = pg.extract_text() is pages as one whole string per page.
PDF doesn't have a concept of lines of text (or any higher order collection of characters).

If want rows or new line has to split up the content at \n
I could guess could do something like this,if that what you mean?
import pdfplumber
import csv

pdf_file = "sample.pdf"
search_word = 'end'
header = ['Index','Search_word','Page_number']

with pdfplumber.open(pdf_file) as pdf,open('result1.csv', 'w', newline='') as f_out:
    writer = csv.writer(f_out)
    pages = pdf.pages
    writer.writerow(header) #change
    for page_nr, pg in enumerate(pages, 1):
        content = pg.extract_text().split('\n')
        for index, line in enumerate(content, 1):
            print(index, line)
Output:
1 A Simple PDF File 2 This is a small demonstration .pdf file - 3 just for use in the Virtual Mechanics tutorials. More text. And more 4 text. And more text. And more text. And more text. 5 And more text. And more text. And more text. And more text. And more 6 text. And more text. Boring, zzzzz. And more text. And more text. And 7 more text. And more text. And more text. And more text. And more text. 8 And more text. And more text. 9 And more text. And more text. And more text. And more text. And more 10 text. And more text. And more text. Even more. Continued on page 2 ... 1 Simple PDF File 2 2 ...continued from page 1. Yet more text. And more text. And more text. 3 And more text. And more text. And more text. And more text. And more 4 text. Oh, how boring typing this stuff. But not as boring as watching 5 paint dry. And more text. And more text. And more text. And more text. 6 Boring. More, a little more text. The end, and just as well.
atomxkai likes this post
Reply
#18
Thanks for the sample script.

But I wanted to print like this.

PDF files with 50 pages
search_word = 'page (2)'

print output:

index, search_word, page number
1, page (2), 8
2, page (2), 12
3, page (2), 27

thank you.
Reply
#19
What you describe is what my code dos in post #7
import pdfplumber

pdf_file = "sample.pdf"
search_word = 'text'
with pdfplumber.open(pdf_file) as pdf:
    pages = pdf.pages
    for page_nr, pg in enumerate(pages, 1):
        content = pg.extract_text()
        if search_word in content:
            print(f'<{search_word}> found at page number <{page_nr}> '\
                    f'at index <{content.index(search_word)}>')
Output:
<text> found at page number <1> at index <119> <text> found at page number <2> at index <56>
So you can shuffle the the f-string print() to get output you want.
atomxkai likes this post
Reply
#20
I was able to run an index print. Thanks again snippsat. Smile

index = 1     
    for page_nr, pg in enumerate(pages, 1):
        content = pg.extract_text()
...
        if search_word in content:               
            print(index, search_word, page_nr)
            writer.writerow([index, search_word, page_nr])
            index = index + 1 
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
Brick Number stored as text with openpyxl CAD79 2 440 Apr-17-2024, 10:17 AM
Last Post: CAD79
  capturing multiline output for number of parameters jss 3 820 Sep-01-2023, 05:42 PM
Last Post: jss
  Formatting float number output barryjo 2 921 May-04-2023, 02:04 PM
Last Post: barryjo
  fuzzywuzzy search string in text file marfer 9 4,587 Aug-03-2021, 02:41 AM
Last Post: deanhystad
  Getting a GET request output text into a variable to work with it. LeoT 2 3,021 Feb-24-2021, 02:05 PM
Last Post: LeoT
  Increment text files output and limit contains Kaminsky 1 3,201 Jan-30-2021, 06:58 PM
Last Post: bowlofred
  How to Split Output Audio on Text to Speech Code Base12 2 6,864 Aug-29-2020, 03:23 AM
Last Post: Base12
  Search Results Web results Printing the number of days in a given month and year afefDXCTN 1 2,237 Aug-21-2020, 12:20 PM
Last Post: DeaD_EyE
  Import Text, output curve geometry Alyner 0 1,984 Feb-03-2020, 03:05 AM
Last Post: Alyner
  Search for the line number corresponding to a value Lali 0 1,651 Oct-22-2019, 08:56 AM
Last Post: Lali

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020