Regular Expressions

pprod · Nov-10-2020, 04:23 PM

Hi everyone,

I'm working on a regular expression to extract text from a pdf and I wonder whether I can tidy up the code a bit. I'm using pdfplumber to open the file in Python and upon doing so, Python prints line numbers (highlighted in bold below) that mess up with the text I want to extract. For example:

23
24
10.00 to 11.00 - with a
yellow overcoat with brown buttons
and green rims.
25
26
12.00 to 13.00 - with short
27 shredded jeans with holes.
28

The RE is:

result = re.compile(r'\d+[.]\d+ to \d+[.]\d+ - \D+\d+\D+')
matches = result.findall(text)
for match in matches:
    print(match)

And the output text is:

Output:10.00 to 11.00 - with a
yellow overcoat with brown buttons
and green rims.
25

12.00 to 13.00 - with short
27 shredded jeans with holes.

In the output text above, 25 and 27 are line numbers, which I want to get rid of. Is it possible to stop Python from printing these line numbers?

Thanks!

**Gribouillis** · Nov-10-2020, 06:53 PM

You could use catpuring groups

import re

text = """
23
24
10.00 to 11.00 - with a
yellow overcoat with brown buttons
and green rims.
25
26
12.00 to 13.00 - with short
27 shredded jeans with holes.
28
"""

result = re.compile(r'(?P<A>\d+[.]\d+ to \d+[.]\d+ - \D+)\d+(?P<B>\D+)')
matches = result.finditer(text)
for match in matches:
    print(match.group('A') + match.group('B'))

Output:10.00 to 11.00 - with a
yellow overcoat with brown buttons
and green rims.


12.00 to 13.00 - with short
 shredded jeans with holes.

pprod · (This post was last modified: Nov-13-2020, 07:30 AM by pprod.)

(Nov-10-2020, 06:53 PM)Gribouillis Wrote: You could use catpuring groups

import re

text = """
23
24
10.00 to 11.00 - with a
yellow overcoat with brown buttons
and green rims.
25
26
12.00 to 13.00 - with short
27 shredded jeans with holes.
28
"""

result = re.compile(r'(?P<A>\d+[.]\d+ to \d+[.]\d+ - \D+)\d+(?P<B>\D+)')
matches = result.finditer(text)
for match in matches:
    print(match.group('A') + match.group('B'))

Output:10.00 to 11.00 - with a
yellow overcoat with brown buttons
and green rims.


12.00 to 13.00 - with short
 shredded jeans with holes.

Thank you. That works really well and now I know something about groups, which led me to learn that we can use the 'either or' condition within REs using ' | '. I changed the input text slightly (added line 23) to test whether I could also extract the sentence starting with 'At 9.00 -' using the same RE. However, this doesn't do the trick and I can't figure out what I'm doing wrong.

import re
 
text = """
23 At 9.00 - people playing banjos wearing fancy clothes
24
10.00 to 11.00 - with a
yellow overcoat with brown buttons
and green rims.
25
26
12.00 to 13.00 - with short
27 shredded jeans with holes.
28
"""
 
result = re.compile(r'((\w+ \d+[.]d+ - \D+)|(?P<A>\d+[.]\d+ to \d+[.]\d+ - \D+)\d+(?P<B>\D+))')
matches = result.finditer(text)
for match in matches:
    print(match.group('A') + match.group('B'))

Output:10.00 to 11.00 - with a
yellow overcoat with brown buttons
and green rims.


12.00 to 13.00 - with short
 shredded jeans with holes.

Any suggestions? Thank you!

bowlofred · Nov-13-2020, 07:36 AM

What do you mean by line 23? Your post doesn't seem to have any changes and only goes to line 19.

pprod · (This post was last modified: Nov-13-2020, 11:13 AM by pprod.)

(Nov-13-2020, 07:36 AM)bowlofred Wrote: What do you mean by line 23? Your post doesn't seem to have any changes and only goes to line 19.

I mean the line that starts with 23 in the 'text' string, which is line 4 in the code.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Recursive regular expressions in Python	risu252	2	1,258	Jul-25-2023, 12:59 PM Last Post: risu252
	Regular Expressions - so close yet so far	bigpapa	5	966	May-03-2023, 08:18 AM Last Post: bowlofred
	Having trouble with regular expressions	mikla	3	2,591	Mar-16-2021, 03:44 PM Last Post: bowlofred
	Statements and Expressions	Julie	1	1,639	Feb-26-2021, 05:19 PM Last Post: nilamo
	Format phonenumbers - regular expressions	Viking	2	1,905	May-11-2020, 07:27 PM Last Post: Viking
	regular expressions in openpyxl. format	picnic	0	2,486	Mar-28-2020, 09:47 PM Last Post: picnic
	Unexpected (?) result with regular expressions	guraknugen	2	2,220	Jan-18-2020, 02:33 PM Last Post: guraknugen
	Strange output with regular expressions	newbieAuggie2019	1	1,939	Nov-04-2019, 07:06 PM Last Post: newbieAuggie2019
	Regular Expressions	amitalable	4	2,776	Mar-14-2019, 04:31 PM Last Post: DeaD_EyE
	Regular expressions help re.error: multiple repeat at position 23	JoseSalazar1	2	6,648	Sep-18-2018, 01:29 AM Last Post: volcano63

Regular Expressions

User Panel Messages

Announcements