Hey,
I want to extract the line, in which a specific keyword is found. So for text-documents it is very simple, because of looping through the text and print the line.
So since now, I have done it so far with this:
with open('text.txt','r') as f1, open("keywords.txt") as f2:
st = set(map(str.rstrip, f2))
for line in f1:
if any(word in st for word in line.split()):
print(line)
So this works great.
But now I need to print the line out of a PDF-File. But I only get the hole text out of it. Here my code:
import PyPDF4
import re
pdfFileObj=open(r'PDFFILE.pdf',mode='rb')
searchwords=['_aktz:', 'AZ']
pdfReader=PyPDF4.PdfFileReader(pdfFileObj)
number_of_pages=pdfReader.numPages
pages_text=[]
words_start_pos={}
words={}
with pdfFileObj as f1, open('keywords.txt') as f2:
st = set(map(str.rstrip, pages_text))
for word in f1:
for page in range(number_of_pages):
pages_text.append(pdfReader.getPage(page).extractText())
print(pages_text)
for line in f1:
if any(word in st for word in line.split()):
print(line)
I guess I can see the hole text, because of the function
print(pages_text)
, also the last loop for
print(line)
caches nothing, I think.
Can anyone see what I am doing wrong?
At line 15 I think it should be f2
instead of pages_text
.
yeah, right. Thats what I had before. But both prints the hole text inside []. But yeah, this is how it looks like corrected. But I dont understand why I get the hoe text, not just the line
Correted:
import PyPDF4
import re
pdfFileObj=open(r'PDFFILE.pdf',mode='rb')
searchwords=['_aktz:', 'AZ']
pdfReader=PyPDF4.PdfFileReader(pdfFileObj)
number_of_pages=pdfReader.numPages
pages_text=[]
words_start_pos={}
words={}
with pdfFileObj as f1, open('keywords.txt') as f2:
st = set(map(str.rstrip, f2))
for word in f1:
for page in range(number_of_pages):
pages_text.append(pdfReader.getPage(page).extractText())
print(pages_text)
for line in f1:
if any(word in st for word in line.split()):
print(line)
Ok, I have written a new Version, because, that was way too complicated. With the fallowing code I can get the hole page printed. But again it just doesnt print me only the line which is matching the keyword.
import PyPDF4
import re
pdfFileObj = open(r'PDFfile.pdf', 'rb')
pdfReader = PyPDF4.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
pages_text = pageObj.extractText()
print(pages_text)
for line in pages_text:
if re.match("(.*)_akt:(.*)", line):
print(line),
I dont get it. Then I do this with a textfile, it works. I onl get the wanted line printed. So what is the difference?
import re
text = open("text.txt", "r")
for line in text:
if re.match("(.*)_akt:(.*)", line):
print(line),
If
pages_text
is a single string instead of a sequence of lines it won't work because the
for ... in pages_text
iterates over every character in the string. You could use a
StringIO
which works like a file
import io
for line in io.StringIO(pages_text):
...
Sounds great. I was wondering myself if the output is one object, or it is a list of lines.
I tried io.StringIO like the fallowing
import PyPDF4
import re
import io
pdfFileObj = open(r'PDFfile.pdf', 'rb')
pdfReader = PyPDF4.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
pages_text = pageObj.extractText()
# print(pages_text)
for line in io.StringIO(pages_text):
if re.match("(.*)_akt:(.*)", line):
print(line),
But the result is the same. Thats why I commented the line
print(pages_text)
out. But then there is no output.
What happens if you print(repr(line))
for every line regardless of the regex?
look at
print(repr(pages_text))
,edit also posted bye @
Gribouillis.
When i test this i see
\n
(new line)
So split on
\n
to get lines.
for line in pages_text.split('\n'):
print(line)
Test with my document take out lines start with Adobe.
import PyPDF4
import re
import io
pdfFileObj = open(r'pdf-sample.pdf', 'rb')
pdfReader = PyPDF4.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
pages_text = pageObj.extractText()
#print(repr(pages_text))
for line in pages_text.split('\n'):
if line.startswith('Adobe'):
print(line)
Output:
Adobe Acrobat PDF Files
AdobeĀ® Portable Document Format (PDF) is a universal file format that preserves all
Adobe PDF is an ideal format for electronic document distribution as it overcomes the
ok, looks great so far. i get everyline as a string. But now, how can I loop though this? I have this now
import PyPDF4
import re
import io
pdfFileObj = open(r'PDFfile.pdf', 'rb')
pdfReader = PyPDF4.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
pages_text = pageObj.extractText()
# print(pages_text)
for line in io.StringIO(pages_text):
print(repr(line))
# if re.match("(.*)_akt:(.*)", line):
# print(line),
Shall I loop like this:
...
lines = repr(line)
for x in lines:
if re.match("(.*)_akt:(.*)", lines):
print(x),
?
(Nov-27-2018, 12:04 PM)equaliser Wrote: [ -> ]Shall I loop like this:
No it's like i show in my post,you split on
\n
to get line bye line.
So if i use regex to take out lines start with PDF,it look like this.
for line in pages_text.split('\n'):
if re.match(r"^PDF", line):
print(line)
Output:
PDF files
PDF files always display