Python Forum

Hi everyone,

I'm working on a regular expression to extract text from a pdf and I wonder whether I can tidy up the code a bit. I'm using pdfplumber to open the file in Python and upon doing so, Python prints line numbers (highlighted in bold below) that mess up with the text I want to extract. For example:

23
24
10.00 to 11.00 - with a
yellow overcoat with brown buttons
and green rims.
25
26
12.00 to 13.00 - with short
27 shredded jeans with holes.
28

The RE is:

result = re.compile(r'\d+[.]\d+ to \d+[.]\d+ - \D+\d+\D+')
matches = result.findall(text)
for match in matches:
    print(match)

And the output text is:

Output:10.00 to 11.00 - with a
yellow overcoat with brown buttons
and green rims.
25

12.00 to 13.00 - with short
27 shredded jeans with holes.

In the output text above, 25 and 27 are line numbers, which I want to get rid of. Is it possible to stop Python from printing these line numbers?

Thanks!

You could use catpuring groups

import re

text = """
23
24
10.00 to 11.00 - with a
yellow overcoat with brown buttons
and green rims.
25
26
12.00 to 13.00 - with short
27 shredded jeans with holes.
28
"""

result = re.compile(r'(?P<A>\d+[.]\d+ to \d+[.]\d+ - \D+)\d+(?P<B>\D+)')
matches = result.finditer(text)
for match in matches:
    print(match.group('A') + match.group('B'))

Output:10.00 to 11.00 - with a
yellow overcoat with brown buttons
and green rims.


12.00 to 13.00 - with short
 shredded jeans with holes.

(Nov-10-2020, 06:53 PM)Gribouillis Wrote: [ -> ]You could use catpuring groups

import re

text = """
23
24
10.00 to 11.00 - with a
yellow overcoat with brown buttons
and green rims.
25
26
12.00 to 13.00 - with short
27 shredded jeans with holes.
28
"""

result = re.compile(r'(?P<A>\d+[.]\d+ to \d+[.]\d+ - \D+)\d+(?P<B>\D+)')
matches = result.finditer(text)
for match in matches:
    print(match.group('A') + match.group('B'))

Output:10.00 to 11.00 - with a
yellow overcoat with brown buttons
and green rims.


12.00 to 13.00 - with short
 shredded jeans with holes.

Thank you. That works really well and now I know something about groups, which led me to learn that we can use the 'either or' condition within REs using ' | '. I changed the input text slightly (added line 23) to test whether I could also extract the sentence starting with 'At 9.00 -' using the same RE. However, this doesn't do the trick and I can't figure out what I'm doing wrong.

import re
 
text = """
23 At 9.00 - people playing banjos wearing fancy clothes
24
10.00 to 11.00 - with a
yellow overcoat with brown buttons
and green rims.
25
26
12.00 to 13.00 - with short
27 shredded jeans with holes.
28
"""
 
result = re.compile(r'((\w+ \d+[.]d+ - \D+)|(?P<A>\d+[.]\d+ to \d+[.]\d+ - \D+)\d+(?P<B>\D+))')
matches = result.finditer(text)
for match in matches:
    print(match.group('A') + match.group('B'))

Output:10.00 to 11.00 - with a
yellow overcoat with brown buttons
and green rims.


12.00 to 13.00 - with short
 shredded jeans with holes.

Any suggestions? Thank you!

What do you mean by line 23? Your post doesn't seem to have any changes and only goes to line 19.

(Nov-13-2020, 07:36 AM)bowlofred Wrote: [ -> ]What do you mean by line 23? Your post doesn't seem to have any changes and only goes to line 19.

I mean the line that starts with 23 in the 'text' string, which is line 4 in the code.

pprod

Gribouillis

pprod

bowlofred

pprod