Python Forum
Regular Expressions - so close yet so far
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Regular Expressions - so close yet so far
#1
Sad 
Hi all,

So I have this code below that filters the weight column in a data frame. And I want to remove the : and ; in-front of the number:

for entry in df.loc[df["Weight"]
                    .str.replace("\s", '', regex = True)
                    .str.contains('Weight', case = False, na = False), "Weight"].sample(10, random_state=2):
    
    print(re.findall(r'(?<=weight).*?(?=kg)', 
                     re.sub("\s", "", entry).lower() 
                    )
         ) 
[':16.696,00']
[';16.981,44', ';13.672,10', ';16.981,44', ';16.981,44', ';16.235,86']
[':17.046,00']
[':18.345,00']
[':17.624,00']
[':17,063.00']
['6000.0000']
[':18.583,000']
[':18.520,00']
[';16.981,44']
Thus far have tried adding [;:] into the regular expression like so:

for entry in df.loc[df["Weight"]
                    .str.replace("\s", '', regex = True)
                    .str.contains('Weight', case = False, na = False), "Weight"].sample(10, random_state=2):
    
    print(re.findall(r'(?<=weight[:;]).*?(?=kg)', 
                     re.sub("\s", "", entry).lower() 
                    )
         ) 
But it returns this:

['16.696,00']
['16.981,44', '13.672,10', '16.981,44', '16.981,44', '16.235,86']
['17.046,00']
['18.345,00']
['17.624,00']
['17,063.00']
[]
['18.583,000']
['18.520,00']
['16.981,44']
Do you see how if the item does not have a : or a ; it gets deleted? How do I prevent this. Also! I have tried [:;?] and [:;*?]


Also if someone has any idea of how to fix the commas separating the numbers so they're consistent that would be an added bonus.

Thank you, regular expressions are hard Wall Heart Think
Reply
#2
Your findall only matches when there is a "weight" followed by one character from [;:]. That character isn't optional. If it's missing, no match.

So you want it to be an optional match. Add a ? after the character class.

r'(?<=weight[:;]?).*?(?=kg)
bigpapa likes this post
Reply
#3
(May-02-2023, 08:17 PM)bowlofred Wrote: Your findall only matches when there is a "weight" followed by one character from [;:]. That character isn't optional. If it's missing, no match.

So you want it to be an optional match. Add a ? after the character class.

r'(?<=weight[:;]?).*?(?=kg)

Thanks for your reply, it still does the same thing :(

['16.696,00']
['16.981,44', '13.672,10', '16.981,44', '16.981,44', '16.235,86']
['17.046,00']
['18.345,00']
['17.624,00']
['17,063.00']
[]
['18.583,000']
['18.520,00']
['16.981,44']
Reply
#4
I can only see your output, not what the regex is getting. For instance, if there is space between the weight and colon, that might explain things.

Can you show what the entry object is during the loop? At least a couple with and without the color or semicolon.
bigpapa likes this post
Reply
#5
(May-02-2023, 10:12 PM)bowlofred Wrote: I can only see your output, not what the regex is getting. For instance, if there is space between the weight and colon, that might explain things.

Can you show what the entry object is during the loop? At least a couple with and without the color or semicolon.

Here is how some of the data is fed to regex. Let me know if you need more info! Thanks for your help with this.

WEIGHT: 18. 520, 0 0 KGS

WEIGHT: 18. 583, 000 KGS

WEIGHT 6000. 0000 KG

WEIGHT: 17. 624, 00 KGS

WEIGHT: 17. 046, 00 KGS

WEIGHT; 16. 235, 86 KGS

WEIGHT; 13. 672

WEIGHT: 29. 631, 000

WEIGHT: 218768. 000 KGS

WEIGHT: 15 MT

WEIGHT; 14. 834, 32 KGS

WEIGHT; 11. 311, 08 KG
Reply
#6
Oh, right. Lookbehind patterns must be fixed-width. Can't put an optional piece in them. In that case I'd just use a regular search and capture the number.

import re

text = """WEIGHT: 18. 520, 0 0 KGS
WEIGHT: 18. 583, 000 KGS
WEIGHT 6000. 0000 KG
WEIGHT: 17. 624, 00 KGS
WEIGHT: 17. 046, 00 KGS
WEIGHT; 16. 235, 86 KGS
WEIGHT; 13. 672
WEIGHT: 29. 631, 000
WEIGHT: 218768. 000 KGS
WEIGHT: 15 MT
WEIGHT; 14. 834, 32 KGS
WEIGHT; 11. 311, 08 KG"""

for entry in text.splitlines():
    entry = entry.replace(" ", "").lower()
    print(entry, end=" => ")

    if m := re.search('weight[;:]?([\d,.]+)kg', entry):
        print(m.group(1))
    else:
        print("NO MATCH")
Output:
weight:18.520,00kgs => 18.520,00 weight:18.583,000kgs => 18.583,000 weight6000.0000kg => 6000.0000 weight:17.624,00kgs => 17.624,00 weight:17.046,00kgs => 17.046,00 weight;16.235,86kgs => 16.235,86 weight;13.672 => NO MATCH weight:29.631,000 => NO MATCH weight:218768.000kgs => 218768.000 weight:15mt => NO MATCH weight;14.834,32kgs => 14.834,32 weight;11.311,08kg => 11.311,08
The inconsistent separators is more annoying. If you had something like 18,000, how would you be sure which way to interpret?

Assuming you want a period as a decimal separator, you could look if a comma occurs later than a period in the string. If it does, translate them to each other. If you only have one of the two, I don't have a good suggestion.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Recursive regular expressions in Python risu252 2 1,262 Jul-25-2023, 12:59 PM
Last Post: risu252
  Having trouble with regular expressions mikla 3 2,597 Mar-16-2021, 03:44 PM
Last Post: bowlofred
  Statements and Expressions Julie 1 1,642 Feb-26-2021, 05:19 PM
Last Post: nilamo
  Regular Expressions pprod 4 3,093 Nov-13-2020, 07:45 AM
Last Post: pprod
  Format phonenumbers - regular expressions Viking 2 1,909 May-11-2020, 07:27 PM
Last Post: Viking
  regular expressions in openpyxl. format picnic 0 2,488 Mar-28-2020, 09:47 PM
Last Post: picnic
  Unexpected (?) result with regular expressions guraknugen 2 2,228 Jan-18-2020, 02:33 PM
Last Post: guraknugen
  Strange output with regular expressions newbieAuggie2019 1 1,941 Nov-04-2019, 07:06 PM
Last Post: newbieAuggie2019
  Regular Expressions amitalable 4 2,780 Mar-14-2019, 04:31 PM
Last Post: DeaD_EyE
  Regular expressions help re.error: multiple repeat at position 23 JoseSalazar1 2 6,657 Sep-18-2018, 01:29 AM
Last Post: volcano63

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020