Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Extract Line from PDF
#1
Hey,

I want to extract the line, in which a specific keyword is found. So for text-documents it is very simple, because of looping through the text and print the line.

So since now, I have done it so far with this:

with open('text.txt','r') as f1, open("keywords.txt") as f2:
    st = set(map(str.rstrip, f2))
    for line in f1:
        if any(word in st for word in  line.split()):
            print(line)
So this works great.

But now I need to print the line out of a PDF-File. But I only get the hole text out of it. Here my code:

import PyPDF4
import re

pdfFileObj=open(r'PDFFILE.pdf',mode='rb')
searchwords=['_aktz:', 'AZ']

pdfReader=PyPDF4.PdfFileReader(pdfFileObj)
number_of_pages=pdfReader.numPages

pages_text=[]
words_start_pos={}
words={}

with pdfFileObj as f1, open('keywords.txt') as f2:
    st = set(map(str.rstrip, pages_text))
    for word in f1:
        for page in range(number_of_pages):
            pages_text.append(pdfReader.getPage(page).extractText())
            print(pages_text)
        for line in f1:
            if any(word in st for word in  line.split()):
                print(line)
I guess I can see the hole text, because of the function print(pages_text), also the last loop for print(line) caches nothing, I think.

Can anyone see what I am doing wrong?
Reply
#2
At line 15 I think it should be f2 instead of pages_text.
Reply
#3
yeah, right. Thats what I had before. But both prints the hole text inside []. But yeah, this is how it looks like corrected. But I dont understand why I get the hoe text, not just the line

Correted:
import PyPDF4
import re
 
pdfFileObj=open(r'PDFFILE.pdf',mode='rb')
searchwords=['_aktz:', 'AZ']
 
pdfReader=PyPDF4.PdfFileReader(pdfFileObj)
number_of_pages=pdfReader.numPages
 
pages_text=[]
words_start_pos={}
words={}
 
with pdfFileObj as f1, open('keywords.txt') as f2:
    st = set(map(str.rstrip, f2))
    for word in f1:
        for page in range(number_of_pages):
            pages_text.append(pdfReader.getPage(page).extractText())
            print(pages_text)
        for line in f1:
            if any(word in st for word in  line.split()):
                print(line)
Reply
#4
Ok, I have written a new Version, because, that was way too complicated. With the fallowing code I can get the hole page printed. But again it just doesnt print me only the line which is matching the keyword.

import PyPDF4
import re

pdfFileObj = open(r'PDFfile.pdf', 'rb')
pdfReader = PyPDF4.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
pages_text = pageObj.extractText()
print(pages_text)

for line in pages_text:
    if re.match("(.*)_akt:(.*)", line):
        print(line),
I dont get it. Then I do this with a textfile, it works. I onl get the wanted line printed. So what is the difference?

import re

text = open("text.txt", "r")

for line in text:
    if re.match("(.*)_akt:(.*)", line):
        print(line),
Reply
#5
If pages_text is a single string instead of a sequence of lines it won't work because the for ... in pages_text iterates over every character in the string. You could use a StringIO which works like a file
import io
for line in io.StringIO(pages_text):
    ...
Reply
#6
Sounds great. I was wondering myself if the output is one object, or it is a list of lines.

I tried io.StringIO like the fallowing

import PyPDF4
import re
import io

pdfFileObj = open(r'PDFfile.pdf', 'rb')
pdfReader = PyPDF4.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
pages_text = pageObj.extractText()
# print(pages_text)

for line in io.StringIO(pages_text):
    if re.match("(.*)_akt:(.*)", line):
        print(line),
But the result is the same. Thats why I commented the line print(pages_text) out. But then there is no output.
Reply
#7
What happens if you print(repr(line)) for every line regardless of the regex?
Reply
#8
look at print(repr(pages_text)),edit also posted bye @Gribouillis.
When i test this i see \n(new line)
So split on \n to get lines.
for line in pages_text.split('\n'):
        print(line)
Test with my document take out lines start with Adobe.
import PyPDF4
import re
import io

pdfFileObj = open(r'pdf-sample.pdf', 'rb')
pdfReader = PyPDF4.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
pages_text = pageObj.extractText()
#print(repr(pages_text))

for line in pages_text.split('\n'):
    if line.startswith('Adobe'):
        print(line)
Output:
Adobe Acrobat PDF Files AdobeĀ® Portable Document Format (PDF) is a universal file format that preserves all Adobe PDF is an ideal format for electronic document distribution as it overcomes the
Reply
#9
ok, looks great so far. i get everyline as a string. But now, how can I loop though this? I have this now

import PyPDF4
import re
import io
 
pdfFileObj = open(r'PDFfile.pdf', 'rb')
pdfReader = PyPDF4.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
pages_text = pageObj.extractText()
# print(pages_text)

for line in io.StringIO(pages_text):
    print(repr(line))
#    if re.match("(.*)_akt:(.*)", line):
#        print(line),
Shall I loop like this:

...
    lines = repr(line)
    for x in lines:
        if re.match("(.*)_akt:(.*)", lines):
            print(x),
?
Reply
#10
(Nov-27-2018, 12:04 PM)equaliser Wrote: Shall I loop like this:
No it's like i show in my post,you split on \n to get line bye line.
So if i use regex to take out lines start with PDF,it look like this.
for line in pages_text.split('\n'):
    if re.match(r"^PDF", line):
            print(line)
Output:
PDF files PDF files always display
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020