Python Forum
Regular Expressions - Printable Version

+- Python Forum (https://python-forum.io)
+-- Forum: Python Coding (https://python-forum.io/forum-7.html)
+--- Forum: General Coding Help (https://python-forum.io/forum-8.html)
+--- Thread: Regular Expressions (/thread-30863.html)



Regular Expressions - pprod - Nov-10-2020

Hi everyone,

I'm working on a regular expression to extract text from a pdf and I wonder whether I can tidy up the code a bit. I'm using pdfplumber to open the file in Python and upon doing so, Python prints line numbers (highlighted in bold below) that mess up with the text I want to extract. For example:

23
24
10.00 to 11.00 - with a
yellow overcoat with brown buttons
and green rims.
25
26
12.00 to 13.00 - with short
27 shredded jeans with holes.
28

The RE is:
result = re.compile(r'\d+[.]\d+ to \d+[.]\d+ - \D+\d+\D+')
matches = result.findall(text)
for match in matches:
    print(match)
And the output text is:
Output:
10.00 to 11.00 - with a yellow overcoat with brown buttons and green rims. 25 12.00 to 13.00 - with short 27 shredded jeans with holes.
In the output text above, 25 and 27 are line numbers, which I want to get rid of. Is it possible to stop Python from printing these line numbers?

Thanks!


RE: Regular Expressions - Gribouillis - Nov-10-2020

You could use catpuring groups
import re

text = """
23
24
10.00 to 11.00 - with a
yellow overcoat with brown buttons
and green rims.
25
26
12.00 to 13.00 - with short
27 shredded jeans with holes.
28
"""

result = re.compile(r'(?P<A>\d+[.]\d+ to \d+[.]\d+ - \D+)\d+(?P<B>\D+)')
matches = result.finditer(text)
for match in matches:
    print(match.group('A') + match.group('B'))
Output:
10.00 to 11.00 - with a yellow overcoat with brown buttons and green rims. 12.00 to 13.00 - with short shredded jeans with holes.



RE: Regular Expressions - pprod - Nov-13-2020

(Nov-10-2020, 06:53 PM)Gribouillis Wrote: You could use catpuring groups
import re

text = """
23
24
10.00 to 11.00 - with a
yellow overcoat with brown buttons
and green rims.
25
26
12.00 to 13.00 - with short
27 shredded jeans with holes.
28
"""

result = re.compile(r'(?P<A>\d+[.]\d+ to \d+[.]\d+ - \D+)\d+(?P<B>\D+)')
matches = result.finditer(text)
for match in matches:
    print(match.group('A') + match.group('B'))
Output:
10.00 to 11.00 - with a yellow overcoat with brown buttons and green rims. 12.00 to 13.00 - with short shredded jeans with holes.

Thank you. That works really well and now I know something about groups, which led me to learn that we can use the 'either or' condition within REs using ' | '. I changed the input text slightly (added line 23) to test whether I could also extract the sentence starting with 'At 9.00 -' using the same RE. However, this doesn't do the trick and I can't figure out what I'm doing wrong.


import re
 
text = """
23 At 9.00 - people playing banjos wearing fancy clothes
24
10.00 to 11.00 - with a
yellow overcoat with brown buttons
and green rims.
25
26
12.00 to 13.00 - with short
27 shredded jeans with holes.
28
"""
 
result = re.compile(r'((\w+ \d+[.]d+ - \D+)|(?P<A>\d+[.]\d+ to \d+[.]\d+ - \D+)\d+(?P<B>\D+))')
matches = result.finditer(text)
for match in matches:
    print(match.group('A') + match.group('B'))
Output:
10.00 to 11.00 - with a yellow overcoat with brown buttons and green rims. 12.00 to 13.00 - with short shredded jeans with holes.
Any suggestions? Thank you!


RE: Regular Expressions - bowlofred - Nov-13-2020

What do you mean by line 23? Your post doesn't seem to have any changes and only goes to line 19.


RE: Regular Expressions - pprod - Nov-13-2020

(Nov-13-2020, 07:36 AM)bowlofred Wrote: What do you mean by line 23? Your post doesn't seem to have any changes and only goes to line 19.

I mean the line that starts with 23 in the 'text' string, which is line 4 in the code.