Nov-10-2020, 04:23 PM
Hi everyone,
I'm working on a regular expression to extract text from a pdf and I wonder whether I can tidy up the code a bit. I'm using pdfplumber to open the file in Python and upon doing so, Python prints line numbers (highlighted in bold below) that mess up with the text I want to extract. For example:
23
24
10.00 to 11.00 - with a
yellow overcoat with brown buttons
and green rims.
25
26
12.00 to 13.00 - with short
27 shredded jeans with holes.
28
The RE is:
Thanks!
I'm working on a regular expression to extract text from a pdf and I wonder whether I can tidy up the code a bit. I'm using pdfplumber to open the file in Python and upon doing so, Python prints line numbers (highlighted in bold below) that mess up with the text I want to extract. For example:
23
24
10.00 to 11.00 - with a
yellow overcoat with brown buttons
and green rims.
25
26
12.00 to 13.00 - with short
27 shredded jeans with holes.
28
The RE is:
result = re.compile(r'\d+[.]\d+ to \d+[.]\d+ - \D+\d+\D+') matches = result.findall(text) for match in matches: print(match)And the output text is:
Output:10.00 to 11.00 - with a
yellow overcoat with brown buttons
and green rims.
25
12.00 to 13.00 - with short
27 shredded jeans with holes.
In the output text above, 25 and 27 are line numbers, which I want to get rid of. Is it possible to stop Python from printing these line numbers?Thanks!