Extract Line from PDF

equaliser · Nov-23-2018, 01:38 PM

Hey,

I want to extract the line, in which a specific keyword is found. So for text-documents it is very simple, because of looping through the text and print the line.

So since now, I have done it so far with this:

with open('text.txt','r') as f1, open("keywords.txt") as f2:
    st = set(map(str.rstrip, f2))
    for line in f1:
        if any(word in st for word in  line.split()):
            print(line)

So this works great.

But now I need to print the line out of a PDF-File. But I only get the hole text out of it. Here my code:

import PyPDF4
import re

pdfFileObj=open(r'PDFFILE.pdf',mode='rb')
searchwords=['_aktz:', 'AZ']

pdfReader=PyPDF4.PdfFileReader(pdfFileObj)
number_of_pages=pdfReader.numPages

pages_text=[]
words_start_pos={}
words={}

with pdfFileObj as f1, open('keywords.txt') as f2:
    st = set(map(str.rstrip, pages_text))
    for word in f1:
        for page in range(number_of_pages):
            pages_text.append(pdfReader.getPage(page).extractText())
            print(pages_text)
        for line in f1:
            if any(word in st for word in  line.split()):
                print(line)

I guess I can see the hole text, because of the function print(pages_text), also the last loop for print(line) caches nothing, I think.

Can anyone see what I am doing wrong?

**Gribouillis** · Nov-24-2018, 07:34 AM

At line 15 I think it should be f2 instead of pages_text.

equaliser · Nov-27-2018, 09:26 AM

yeah, right. Thats what I had before. But both prints the hole text inside []. But yeah, this is how it looks like corrected. But I dont understand why I get the hoe text, not just the line

Correted:

import PyPDF4
import re
 
pdfFileObj=open(r'PDFFILE.pdf',mode='rb')
searchwords=['_aktz:', 'AZ']
 
pdfReader=PyPDF4.PdfFileReader(pdfFileObj)
number_of_pages=pdfReader.numPages
 
pages_text=[]
words_start_pos={}
words={}
 
with pdfFileObj as f1, open('keywords.txt') as f2:
    st = set(map(str.rstrip, f2))
    for word in f1:
        for page in range(number_of_pages):
            pages_text.append(pdfReader.getPage(page).extractText())
            print(pages_text)
        for line in f1:
            if any(word in st for word in  line.split()):
                print(line)

equaliser · Nov-27-2018, 10:36 AM

Ok, I have written a new Version, because, that was way too complicated. With the fallowing code I can get the hole page printed. But again it just doesnt print me only the line which is matching the keyword.

import PyPDF4
import re

pdfFileObj = open(r'PDFfile.pdf', 'rb')
pdfReader = PyPDF4.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
pages_text = pageObj.extractText()
print(pages_text)

for line in pages_text:
    if re.match("(.*)_akt:(.*)", line):
        print(line),

I dont get it. Then I do this with a textfile, it works. I onl get the wanted line printed. So what is the difference?

import re

text = open("text.txt", "r")

for line in text:
    if re.match("(.*)_akt:(.*)", line):
        print(line),

**Gribouillis** · (This post was last modified: Nov-27-2018, 11:18 AM by Gribouillis.)

If pages_text is a single string instead of a sequence of lines it won't work because the for ... in pages_text iterates over every character in the string. You could use a StringIO which works like a file

import io
for line in io.StringIO(pages_text):
    ...

equaliser · Nov-27-2018, 11:32 AM

Sounds great. I was wondering myself if the output is one object, or it is a list of lines.

I tried io.StringIO like the fallowing

import PyPDF4
import re
import io

pdfFileObj = open(r'PDFfile.pdf', 'rb')
pdfReader = PyPDF4.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
pages_text = pageObj.extractText()
# print(pages_text)

for line in io.StringIO(pages_text):
    if re.match("(.*)_akt:(.*)", line):
        print(line),

But the result is the same. Thats why I commented the line print(pages_text) out. But then there is no output.

**Gribouillis** · Nov-27-2018, 11:45 AM

What happens if you print(repr(line)) for every line regardless of the regex?

***snippsat*** · (This post was last modified: Nov-27-2018, 11:57 AM by snippsat.)

look at print(repr(pages_text)),edit also posted bye @Gribouillis.
When i test this i see \n(new line)
So split on \n to get lines.

for line in pages_text.split('\n'):
        print(line)

Test with my document take out lines start with Adobe.

import PyPDF4
import re
import io

pdfFileObj = open(r'pdf-sample.pdf', 'rb')
pdfReader = PyPDF4.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
pages_text = pageObj.extractText()
#print(repr(pages_text))

for line in pages_text.split('\n'):
    if line.startswith('Adobe'):
        print(line)

Output:Adobe Acrobat PDF Files
Adobe® Portable Document Format (PDF) is a universal file format that preserves all
Adobe PDF is an ideal format for electronic document distribution as it overcomes the

equaliser · (This post was last modified: Nov-27-2018, 12:05 PM by equaliser.)

ok, looks great so far. i get everyline as a string. But now, how can I loop though this? I have this now

import PyPDF4
import re
import io
 
pdfFileObj = open(r'PDFfile.pdf', 'rb')
pdfReader = PyPDF4.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(0)
pages_text = pageObj.extractText()
# print(pages_text)

for line in io.StringIO(pages_text):
    print(repr(line))
#    if re.match("(.*)_akt:(.*)", line):
#        print(line),

Shall I loop like this:

...
    lines = repr(line)
    for x in lines:
        if re.match("(.*)_akt:(.*)", lines):
            print(x),

?

***snippsat*** · (This post was last modified: Nov-27-2018, 12:45 PM by snippsat.)

(Nov-27-2018, 12:04 PM)equaliser Wrote: Shall I loop like this:

No it's like i show in my post,you split on \n to get line bye line.
So if i use regex to take out lines start with PDF,it look like this.

for line in pages_text.split('\n'):
    if re.match(r"^PDF", line):
            print(line)

Output:PDF files 
PDF files always display

Extract Line from PDF

User Panel Messages

Announcements