Posts: 7,324
Threads: 123
Joined: Sep 2016
Jan-10-2022, 05:51 AM
(This post was last modified: Jan-10-2022, 05:51 AM by snippsat.)
(Jan-10-2022, 02:13 AM)atomxkai Wrote: THIS Actually Works!!! awesome thank you so much. genius. can i use this?  Sure,you can use it.
(Jan-10-2022, 02:13 AM)atomxkai Wrote: still hoping if i can fix the original code though.  I would not bother with this,the last commit bye author.
Quote:The last commit was from 2018, there are 87 open PRs and 263 open issues.
It seems as if the project is dead.
I could probably fix your code,but as you see there can be issues(that can give problems),as project seems dead.
Posts: 30
Threads: 8
Joined: Feb 2021
(Jan-10-2022, 02:13 AM)atomxkai Wrote: still hoping if i can fix the original code though.  I would not bother with this,the last commit bye author.
Quote:The last commit was from 2018, there are 87 open PRs and 263 open issues.
It seems as if the project is dead.
I could probably fix your code,but as you see there can be issues(that can give problems),as project seems dead.
[/quote]
I understand. I think pdfplumber is the next thing to use. Hopefully this one will stay longer. Thanks!
Posts: 30
Threads: 8
Joined: Feb 2021
with open('Output.csv', 'w') as pdf:
pdf.write('{0},{1}\n'.format("Page Number", "Search Word"))
pages = pdf.pages
for page_nr, pg in enumerate(pages, 1):
content = pg.extract_text()
if search_word in content:
print(f'<{search_word}> found at page number <{page_nr}> '\
f'at index <{content.index(search_word)}>') If I can still ask, I'm trying to print these results now in CSV format with values. I'm still trying to understand and learning about syntax in python.
Would really appreciate your help. Thanks.
Posts: 7,324
Threads: 123
Joined: Sep 2016
(Jan-10-2022, 10:43 AM)atomxkai Wrote: If I can still ask, I'm trying to print these results now in CSV format with values. I'm still trying to understand and learning about syntax in python. Here a couple of examples.
In this one save same result as we print out before.
import pdfplumber
pdf_file = "sample.pdf"
search_word = 'end'
with pdfplumber.open(pdf_file) as pdf,open('result.txt', 'w') as f_out:
pages = pdf.pages
for page_nr, pg in enumerate(pages, 1):
content = pg.extract_text()
if search_word in content:
f_out.write(
f'<{search_word}> found at page number <{page_nr}> '\
f'at index <{content.index(search_word)}>'
) Output: <end> found at page number <2> at index <349>
Here using csv module.
So here make header row with result under.
import pdfplumber
import csv
pdf_file = "sample.pdf"
search_word = 'end'
header = ['search_word', 'page_nummer']
with pdfplumber.open(pdf_file) as pdf,open('result1.csv', 'w', newline='') as f_out:
writer = csv.writer(f_out)
pages = pdf.pages
for page_nr, pg in enumerate(pages, 1):
content = pg.extract_text()
if search_word in content:
writer.writerow(header)
writer.writerow([search_word, page_nr]) Output: search_word,page_nummer
end,2
Posts: 30
Threads: 8
Joined: Feb 2021
Jan-11-2022, 06:19 PM
(This post was last modified: Jan-11-2022, 06:19 PM by atomxkai.)
Here using csv module.
So here make header row with result under.
import pdfplumber
import csv
pdf_file = "sample.pdf"
search_word = 'end'
header = ['search_word', 'page_nummer']
with pdfplumber.open(pdf_file) as pdf,open('result1.csv', 'w', newline='') as f_out:
writer = csv.writer(f_out)
pages = pdf.pages
for page_nr, pg in enumerate(pages, 1):
content = pg.extract_text()
if search_word in content:
writer.writerow(header)
writer.writerow([search_word, page_nr]) Output: search_word,page_nummer
end,2
this is just awesome! this works, thank you so much snippsat!
i haven't tried the first output but will let you know.
wondering if i can still ask few more questions here to refine the script?
also what i learn here is that i can make 1 line with opening file and open write in one line statement. this is what bugging me before.
Posts: 30
Threads: 8
Joined: Feb 2021
Hi, it is working but I edit some lines and I'm trying to print the index for each row and I cannot seem to do it. Thanks!
import pdfplumber
import csv
pdf_file = "sample.pdf"
search_word = 'end'
header = ['Index','Search_word','Page_number']
with pdfplumber.open(pdf_file) as pdf,open('result1.csv', 'w', newline='') as f_out:
writer = csv.writer(f_out)
pages = pdf.pages
writer.writerow(header) #change
for page_nr, pg in enumerate(pages, 1):
content = pg.extract_text()
if search_word in content:
recordValues = [search_word]
for recordIndex, countValue in enumerate(recordValues,start=1):
writer.writerow([recordIndex, search_word, page_nr])
Posts: 7,324
Threads: 123
Joined: Sep 2016
Jan-15-2022, 12:58 AM
(This post was last modified: Jan-15-2022, 12:59 AM by snippsat.)
(Jan-14-2022, 10:27 PM)atomxkai Wrote: Hi, it is working but I edit some lines and I'm trying to print the index for each row and I cannot seem to do it. Thanks! You should add some print() to see what going to understand the code.
Now there is now rows content = pg.extract_text() is pages as one whole string per page .
PDF doesn't have a concept of lines of text (or any higher order collection of characters).
If want rows or new line has to split up the content at \n
I could guess could do something like this,if that what you mean?
import pdfplumber
import csv
pdf_file = "sample.pdf"
search_word = 'end'
header = ['Index','Search_word','Page_number']
with pdfplumber.open(pdf_file) as pdf,open('result1.csv', 'w', newline='') as f_out:
writer = csv.writer(f_out)
pages = pdf.pages
writer.writerow(header) #change
for page_nr, pg in enumerate(pages, 1):
content = pg.extract_text().split('\n')
for index, line in enumerate(content, 1):
print(index, line) Output: 1 A Simple PDF File
2 This is a small demonstration .pdf file -
3 just for use in the Virtual Mechanics tutorials. More text. And more
4 text. And more text. And more text. And more text.
5 And more text. And more text. And more text. And more text. And more
6 text. And more text. Boring, zzzzz. And more text. And more text. And
7 more text. And more text. And more text. And more text. And more text.
8 And more text. And more text.
9 And more text. And more text. And more text. And more text. And more
10 text. And more text. And more text. Even more. Continued on page 2 ...
1 Simple PDF File 2
2 ...continued from page 1. Yet more text. And more text. And more text.
3 And more text. And more text. And more text. And more text. And more
4 text. Oh, how boring typing this stuff. But not as boring as watching
5 paint dry. And more text. And more text. And more text. And more text.
6 Boring. More, a little more text. The end, and just as well.
Posts: 30
Threads: 8
Joined: Feb 2021
Thanks for the sample script.
But I wanted to print like this.
PDF files with 50 pages
search_word = 'page (2)'
print output:
index, search_word, page number
1, page (2), 8
2, page (2), 12
3, page (2), 27
thank you.
Posts: 7,324
Threads: 123
Joined: Sep 2016
Jan-19-2022, 12:17 AM
(This post was last modified: Jan-19-2022, 12:17 AM by snippsat.)
What you describe is what my code dos in post #7
import pdfplumber
pdf_file = "sample.pdf"
search_word = 'text'
with pdfplumber.open(pdf_file) as pdf:
pages = pdf.pages
for page_nr, pg in enumerate(pages, 1):
content = pg.extract_text()
if search_word in content:
print(f'<{search_word}> found at page number <{page_nr}> '\
f'at index <{content.index(search_word)}>') Output: <text> found at page number <1> at index <119>
<text> found at page number <2> at index <56>
So you can shuffle the the f-string print() to get output you want.
Posts: 30
Threads: 8
Joined: Feb 2021
I was able to run an index print. Thanks again snippsat.
index = 1
for page_nr, pg in enumerate(pages, 1):
content = pg.extract_text()
...
if search_word in content:
print(index, search_word, page_nr)
writer.writerow([index, search_word, page_nr])
index = index + 1
|