Regular Expressions - so close yet so far

bigpapa · May-02-2023, 07:01 PM

Hi all,

So I have this code below that filters the weight column in a data frame. And I want to remove the : and ; in-front of the number:

for entry in df.loc[df["Weight"]
                    .str.replace("\s", '', regex = True)
                    .str.contains('Weight', case = False, na = False), "Weight"].sample(10, random_state=2):
    
    print(re.findall(r'(?<=weight).*?(?=kg)', 
                     re.sub("\s", "", entry).lower() 
                    )
         )

[':16.696,00']
[';16.981,44', ';13.672,10', ';16.981,44', ';16.981,44', ';16.235,86']
[':17.046,00']
[':18.345,00']
[':17.624,00']
[':17,063.00']
['6000.0000']
[':18.583,000']
[':18.520,00']
[';16.981,44']

Thus far have tried adding [;:] into the regular expression like so:

for entry in df.loc[df["Weight"]
                    .str.replace("\s", '', regex = True)
                    .str.contains('Weight', case = False, na = False), "Weight"].sample(10, random_state=2):
    
    print(re.findall(r'(?<=weight[:;]).*?(?=kg)', 
                     re.sub("\s", "", entry).lower() 
                    )
         )

But it returns this:

['16.696,00']
['16.981,44', '13.672,10', '16.981,44', '16.981,44', '16.235,86']
['17.046,00']
['18.345,00']
['17.624,00']
['17,063.00']
[]
['18.583,000']
['18.520,00']
['16.981,44']

Do you see how if the item does not have a : or a ; it gets deleted? How do I prevent this. Also! I have tried [:;?] and [:;*?]

Also if someone has any idea of how to fix the commas separating the numbers so they're consistent that would be an added bonus.

Thank you, regular expressions are hard Wall

bowlofred · May-02-2023, 08:17 PM

Your findall only matches when there is a "weight" followed by one character from [;:]. That character isn't optional. If it's missing, no match.

So you want it to be an optional match. Add a ? after the character class.

r'(?<=weight[:;]?).*?(?=kg)

bigpapa · May-02-2023, 08:23 PM

(May-02-2023, 08:17 PM)bowlofred Wrote: Your findall only matches when there is a "weight" followed by one character from [;:]. That character isn't optional. If it's missing, no match.

So you want it to be an optional match. Add a ? after the character class.

r'(?<=weight[:;]?).*?(?=kg)

Thanks for your reply, it still does the same thing :(

['16.696,00']
['16.981,44', '13.672,10', '16.981,44', '16.981,44', '16.235,86']
['17.046,00']
['18.345,00']
['17.624,00']
['17,063.00']
[]
['18.583,000']
['18.520,00']
['16.981,44']

bowlofred · (This post was last modified: May-02-2023, 10:12 PM by bowlofred.)

I can only see your output, not what the regex is getting. For instance, if there is space between the weight and colon, that might explain things.

Can you show what the entry object is during the loop? At least a couple with and without the color or semicolon.

bigpapa · May-02-2023, 10:26 PM

(May-02-2023, 10:12 PM)bowlofred Wrote: I can only see your output, not what the regex is getting. For instance, if there is space between the weight and colon, that might explain things.

Can you show what the entry object is during the loop? At least a couple with and without the color or semicolon.

Here is how some of the data is fed to regex. Let me know if you need more info! Thanks for your help with this.

WEIGHT: 18. 520, 0 0 KGS

WEIGHT: 18. 583, 000 KGS

WEIGHT 6000. 0000 KG

WEIGHT: 17. 624, 00 KGS

WEIGHT: 17. 046, 00 KGS

WEIGHT; 16. 235, 86 KGS

WEIGHT; 13. 672

WEIGHT: 29. 631, 000

WEIGHT: 218768. 000 KGS

WEIGHT: 15 MT

WEIGHT; 14. 834, 32 KGS

WEIGHT; 11. 311, 08 KG

bowlofred · May-03-2023, 08:18 AM

Oh, right. Lookbehind patterns must be fixed-width. Can't put an optional piece in them. In that case I'd just use a regular search and capture the number.

import re

text = """WEIGHT: 18. 520, 0 0 KGS
WEIGHT: 18. 583, 000 KGS
WEIGHT 6000. 0000 KG
WEIGHT: 17. 624, 00 KGS
WEIGHT: 17. 046, 00 KGS
WEIGHT; 16. 235, 86 KGS
WEIGHT; 13. 672
WEIGHT: 29. 631, 000
WEIGHT: 218768. 000 KGS
WEIGHT: 15 MT
WEIGHT; 14. 834, 32 KGS
WEIGHT; 11. 311, 08 KG"""

for entry in text.splitlines():
    entry = entry.replace(" ", "").lower()
    print(entry, end=" => ")

    if m := re.search('weight[;:]?([\d,.]+)kg', entry):
        print(m.group(1))
    else:
        print("NO MATCH")

Output:weight:18.520,00kgs => 18.520,00
weight:18.583,000kgs => 18.583,000
weight6000.0000kg => 6000.0000
weight:17.624,00kgs => 17.624,00
weight:17.046,00kgs => 17.046,00
weight;16.235,86kgs => 16.235,86
weight;13.672 => NO MATCH
weight:29.631,000 => NO MATCH
weight:218768.000kgs => 218768.000
weight:15mt => NO MATCH
weight;14.834,32kgs => 14.834,32
weight;11.311,08kg => 11.311,08

The inconsistent separators is more annoying. If you had something like 18,000, how would you be sure which way to interpret?

Assuming you want a period as a decimal separator, you could look if a comma occurs later than a period in the string. If it does, translate them to each other. If you only have one of the two, I don't have a good suggestion.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Recursive regular expressions in Python	risu252	2	1,262	Jul-25-2023, 12:59 PM Last Post: risu252
	Having trouble with regular expressions	mikla	3	2,597	Mar-16-2021, 03:44 PM Last Post: bowlofred
	Statements and Expressions	Julie	1	1,642	Feb-26-2021, 05:19 PM Last Post: nilamo
	Regular Expressions	pprod	4	3,093	Nov-13-2020, 07:45 AM Last Post: pprod
	Format phonenumbers - regular expressions	Viking	2	1,909	May-11-2020, 07:27 PM Last Post: Viking
	regular expressions in openpyxl. format	picnic	0	2,488	Mar-28-2020, 09:47 PM Last Post: picnic
	Unexpected (?) result with regular expressions	guraknugen	2	2,228	Jan-18-2020, 02:33 PM Last Post: guraknugen
	Strange output with regular expressions	newbieAuggie2019	1	1,941	Nov-04-2019, 07:06 PM Last Post: newbieAuggie2019
	Regular Expressions	amitalable	4	2,780	Mar-14-2019, 04:31 PM Last Post: DeaD_EyE
	Regular expressions help re.error: multiple repeat at position 23	JoseSalazar1	2	6,657	Sep-18-2018, 01:29 AM Last Post: volcano63

Regular Expressions - so close yet so far

User Panel Messages

Announcements