Posts: 8
Threads: 2
Joined: Nov 2018
Hey,
I want to extract the line, in which a specific keyword is found. So for text-documents it is very simple, because of looping through the text and print the line.
So since now, I have done it so far with this:
with open('text.txt','r') as f1, open("keywords.txt") as f2:
st = set(map(str.rstrip, f2))
for line in f1:
if any(word in st for word in line.split()):
print(line) So this works great.
But now I need to print the line out of a PDF-File. But I only get the hole text out of it. Here my code:
import PyPDF4
import re
pdfFileObj=open(r'PDFFILE.pdf',mode='rb')
searchwords=['_aktz:', 'AZ']
pdfReader=PyPDF4.PdfFileReader(pdfFileObj)
number_of_pages=pdfReader.numPages
pages_text=[]
words_start_pos={}
words={}
with pdfFileObj as f1, open('keywords.txt') as f2:
st = set(map(str.rstrip, pages_text))
for word in f1:
for page in range(number_of_pages):
pages_text.append(pdfReader.getPage(page).extractText())
print(pages_text)
for line in f1:
if any(word in st for word in line.split()):
print(line) I guess I can see the hole text, because of the function print(pages_text) , also the last loop for print(line) caches nothing, I think.
Can anyone see what I am doing wrong?
Posts: 4,801
Threads: 77
Joined: Jan 2018
At line 15 I think it should be f2 instead of pages_text .
Posts: 8
Threads: 2
Joined: Nov 2018
yeah, right. Thats what I had before. But both prints the hole text inside []. But yeah, this is how it looks like corrected. But I dont understand why I get the hoe text, not just the line
Correted:
import PyPDF4
import re
pdfFileObj=open(r'PDFFILE.pdf',mode='rb')
searchwords=['_aktz:', 'AZ']
pdfReader=PyPDF4.PdfFileReader(pdfFileObj)
number_of_pages=pdfReader.numPages
pages_text=[]
words_start_pos={}
words={}
with pdfFileObj as f1, open('keywords.txt') as f2:
st = set(map(str.rstrip, f2))
for word in f1:
for page in range(number_of_pages):
pages_text.append(pdfReader.getPage(page).extractText())
print(pages_text)
for line in f1:
if any(word in st for word in line.split()):
print(line)
Posts: 8
Threads: 2
Joined: Nov 2018
Ok, I have written a new Version, because, that was way too complicated. With the fallowing code I can get the hole page printed. But again it just doesnt print me only the line which is matching the keyword.
import PyPDF4
import re
pdfFileObj = open(r'PDFfile.pdf', 'rb')
pdfReader = PyPDF4.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
pages_text = pageObj.extractText()
print(pages_text)
for line in pages_text:
if re.match("(.*)_akt:(.*)", line):
print(line), I dont get it. Then I do this with a textfile, it works. I onl get the wanted line printed. So what is the difference?
import re
text = open("text.txt", "r")
for line in text:
if re.match("(.*)_akt:(.*)", line):
print(line),
Posts: 4,801
Threads: 77
Joined: Jan 2018
Nov-27-2018, 11:18 AM
(This post was last modified: Nov-27-2018, 11:18 AM by Gribouillis.)
If pages_text is a single string instead of a sequence of lines it won't work because the for ... in pages_text iterates over every character in the string. You could use a StringIO which works like a file
import io
for line in io.StringIO(pages_text):
...
Posts: 8
Threads: 2
Joined: Nov 2018
Sounds great. I was wondering myself if the output is one object, or it is a list of lines.
I tried io.StringIO like the fallowing
import PyPDF4
import re
import io
pdfFileObj = open(r'PDFfile.pdf', 'rb')
pdfReader = PyPDF4.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
pages_text = pageObj.extractText()
# print(pages_text)
for line in io.StringIO(pages_text):
if re.match("(.*)_akt:(.*)", line):
print(line), But the result is the same. Thats why I commented the line print(pages_text) out. But then there is no output.
Posts: 4,801
Threads: 77
Joined: Jan 2018
What happens if you print(repr(line)) for every line regardless of the regex?
Posts: 7,324
Threads: 123
Joined: Sep 2016
Nov-27-2018, 11:57 AM
(This post was last modified: Nov-27-2018, 11:57 AM by snippsat.)
look at print(repr(pages_text)) ,edit also posted bye @ Gribouillis.
When i test this i see \n (new line)
So split on \n to get lines.
for line in pages_text.split('\n'):
print(line) Test with my document take out lines start with Adobe.
import PyPDF4
import re
import io
pdfFileObj = open(r'pdf-sample.pdf', 'rb')
pdfReader = PyPDF4.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
pages_text = pageObj.extractText()
#print(repr(pages_text))
for line in pages_text.split('\n'):
if line.startswith('Adobe'):
print(line) Output: Adobe Acrobat PDF Files
AdobeĀ® Portable Document Format (PDF) is a universal file format that preserves all
Adobe PDF is an ideal format for electronic document distribution as it overcomes the
Posts: 8
Threads: 2
Joined: Nov 2018
Nov-27-2018, 12:04 PM
(This post was last modified: Nov-27-2018, 12:05 PM by equaliser.)
ok, looks great so far. i get everyline as a string. But now, how can I loop though this? I have this now
import PyPDF4
import re
import io
pdfFileObj = open(r'PDFfile.pdf', 'rb')
pdfReader = PyPDF4.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
pages_text = pageObj.extractText()
# print(pages_text)
for line in io.StringIO(pages_text):
print(repr(line))
# if re.match("(.*)_akt:(.*)", line):
# print(line), Shall I loop like this:
...
lines = repr(line)
for x in lines:
if re.match("(.*)_akt:(.*)", lines):
print(x), ?
Posts: 7,324
Threads: 123
Joined: Sep 2016
Nov-27-2018, 12:45 PM
(This post was last modified: Nov-27-2018, 12:45 PM by snippsat.)
(Nov-27-2018, 12:04 PM)equaliser Wrote: Shall I loop like this: No it's like i show in my post,you split on \n to get line bye line.
So if i use regex to take out lines start with PDF,it look like this.
for line in pages_text.split('\n'):
if re.match(r"^PDF", line):
print(line) Output: PDF files
PDF files always display
|