Regular Expressions - Printable Version +- Python Forum (https://python-forum.io) +-- Forum: Python Coding (https://python-forum.io/forum-7.html) +--- Forum: General Coding Help (https://python-forum.io/forum-8.html) +--- Thread: Regular Expressions (/thread-30863.html) |
Regular Expressions - pprod - Nov-10-2020 Hi everyone, I'm working on a regular expression to extract text from a pdf and I wonder whether I can tidy up the code a bit. I'm using pdfplumber to open the file in Python and upon doing so, Python prints line numbers (highlighted in bold below) that mess up with the text I want to extract. For example: 23 24 10.00 to 11.00 - with a yellow overcoat with brown buttons and green rims. 25 26 12.00 to 13.00 - with short 27 shredded jeans with holes. 28 The RE is: result = re.compile(r'\d+[.]\d+ to \d+[.]\d+ - \D+\d+\D+') matches = result.findall(text) for match in matches: print(match)And the output text is: In the output text above, 25 and 27 are line numbers, which I want to get rid of. Is it possible to stop Python from printing these line numbers?Thanks! RE: Regular Expressions - Gribouillis - Nov-10-2020 You could use catpuring groups import re text = """ 23 24 10.00 to 11.00 - with a yellow overcoat with brown buttons and green rims. 25 26 12.00 to 13.00 - with short 27 shredded jeans with holes. 28 """ result = re.compile(r'(?P<A>\d+[.]\d+ to \d+[.]\d+ - \D+)\d+(?P<B>\D+)') matches = result.finditer(text) for match in matches: print(match.group('A') + match.group('B'))
RE: Regular Expressions - pprod - Nov-13-2020 (Nov-10-2020, 06:53 PM)Gribouillis Wrote: You could use catpuring groups Thank you. That works really well and now I know something about groups, which led me to learn that we can use the 'either or' condition within REs using ' | '. I changed the input text slightly (added line 23) to test whether I could also extract the sentence starting with 'At 9.00 -' using the same RE. However, this doesn't do the trick and I can't figure out what I'm doing wrong. import re text = """ 23 At 9.00 - people playing banjos wearing fancy clothes 24 10.00 to 11.00 - with a yellow overcoat with brown buttons and green rims. 25 26 12.00 to 13.00 - with short 27 shredded jeans with holes. 28 """ result = re.compile(r'((\w+ \d+[.]d+ - \D+)|(?P<A>\d+[.]\d+ to \d+[.]\d+ - \D+)\d+(?P<B>\D+))') matches = result.finditer(text) for match in matches: print(match.group('A') + match.group('B')) Any suggestions? Thank you!
RE: Regular Expressions - bowlofred - Nov-13-2020 What do you mean by line 23? Your post doesn't seem to have any changes and only goes to line 19. RE: Regular Expressions - pprod - Nov-13-2020 (Nov-13-2020, 07:36 AM)bowlofred Wrote: What do you mean by line 23? Your post doesn't seem to have any changes and only goes to line 19. I mean the line that starts with 23 in the 'text' string, which is line 4 in the code. |