Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Regular Expressions
#1
Hi everyone,

I'm working on a regular expression to extract text from a pdf and I wonder whether I can tidy up the code a bit. I'm using pdfplumber to open the file in Python and upon doing so, Python prints line numbers (highlighted in bold below) that mess up with the text I want to extract. For example:

23
24
10.00 to 11.00 - with a
yellow overcoat with brown buttons
and green rims.
25
26
12.00 to 13.00 - with short
27 shredded jeans with holes.
28

The RE is:
result = re.compile(r'\d+[.]\d+ to \d+[.]\d+ - \D+\d+\D+')
matches = result.findall(text)
for match in matches:
    print(match)
And the output text is:
Output:
10.00 to 11.00 - with a yellow overcoat with brown buttons and green rims. 25 12.00 to 13.00 - with short 27 shredded jeans with holes.
In the output text above, 25 and 27 are line numbers, which I want to get rid of. Is it possible to stop Python from printing these line numbers?

Thanks!
Reply
#2
You could use catpuring groups
import re

text = """
23
24
10.00 to 11.00 - with a
yellow overcoat with brown buttons
and green rims.
25
26
12.00 to 13.00 - with short
27 shredded jeans with holes.
28
"""

result = re.compile(r'(?P<A>\d+[.]\d+ to \d+[.]\d+ - \D+)\d+(?P<B>\D+)')
matches = result.finditer(text)
for match in matches:
    print(match.group('A') + match.group('B'))
Output:
10.00 to 11.00 - with a yellow overcoat with brown buttons and green rims. 12.00 to 13.00 - with short shredded jeans with holes.
pprod likes this post
Reply
#3
(Nov-10-2020, 06:53 PM)Gribouillis Wrote: You could use catpuring groups
import re

text = """
23
24
10.00 to 11.00 - with a
yellow overcoat with brown buttons
and green rims.
25
26
12.00 to 13.00 - with short
27 shredded jeans with holes.
28
"""

result = re.compile(r'(?P<A>\d+[.]\d+ to \d+[.]\d+ - \D+)\d+(?P<B>\D+)')
matches = result.finditer(text)
for match in matches:
    print(match.group('A') + match.group('B'))
Output:
10.00 to 11.00 - with a yellow overcoat with brown buttons and green rims. 12.00 to 13.00 - with short shredded jeans with holes.

Thank you. That works really well and now I know something about groups, which led me to learn that we can use the 'either or' condition within REs using ' | '. I changed the input text slightly (added line 23) to test whether I could also extract the sentence starting with 'At 9.00 -' using the same RE. However, this doesn't do the trick and I can't figure out what I'm doing wrong.


import re
 
text = """
23 At 9.00 - people playing banjos wearing fancy clothes
24
10.00 to 11.00 - with a
yellow overcoat with brown buttons
and green rims.
25
26
12.00 to 13.00 - with short
27 shredded jeans with holes.
28
"""
 
result = re.compile(r'((\w+ \d+[.]d+ - \D+)|(?P<A>\d+[.]\d+ to \d+[.]\d+ - \D+)\d+(?P<B>\D+))')
matches = result.finditer(text)
for match in matches:
    print(match.group('A') + match.group('B'))
Output:
10.00 to 11.00 - with a yellow overcoat with brown buttons and green rims. 12.00 to 13.00 - with short shredded jeans with holes.
Any suggestions? Thank you!
Reply
#4
What do you mean by line 23? Your post doesn't seem to have any changes and only goes to line 19.
Reply
#5
(Nov-13-2020, 07:36 AM)bowlofred Wrote: What do you mean by line 23? Your post doesn't seem to have any changes and only goes to line 19.

I mean the line that starts with 23 in the 'text' string, which is line 4 in the code.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Recursive regular expressions in Python risu252 2 1,127 Jul-25-2023, 12:59 PM
Last Post: risu252
Sad Regular Expressions - so close yet so far bigpapa 5 896 May-03-2023, 08:18 AM
Last Post: bowlofred
  Having trouble with regular expressions mikla 3 2,543 Mar-16-2021, 03:44 PM
Last Post: bowlofred
  Statements and Expressions Julie 1 1,591 Feb-26-2021, 05:19 PM
Last Post: nilamo
  Format phonenumbers - regular expressions Viking 2 1,858 May-11-2020, 07:27 PM
Last Post: Viking
  regular expressions in openpyxl. format picnic 0 2,448 Mar-28-2020, 09:47 PM
Last Post: picnic
  Unexpected (?) result with regular expressions guraknugen 2 2,165 Jan-18-2020, 02:33 PM
Last Post: guraknugen
  Strange output with regular expressions newbieAuggie2019 1 1,900 Nov-04-2019, 07:06 PM
Last Post: newbieAuggie2019
  Regular Expressions amitalable 4 2,719 Mar-14-2019, 04:31 PM
Last Post: DeaD_EyE
  Regular expressions help re.error: multiple repeat at position 23 JoseSalazar1 2 6,579 Sep-18-2018, 01:29 AM
Last Post: volcano63

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020