Python Forum
trying to recall a regex for re.split()
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
trying to recall a regex for re.split()
#1
i can't find where this was and can't recall it. regex stuff still doesn't sink in.

what i am wanting to do is split with all leading decimal digits going to the 1st result and the first non-decimal and everything after it going to the 2nd result. what i want to do is convert a number in a string where the number has a units designation after it, such as '144mHz' giving me ['144', 'mHz'].

i think i need to include '.' for float cases in with the digits.

people tell me regex is easy but i never "get it". i think it is because i've never seen an explanation with any example. they just show the example and show the result and expect everyone to understand how it worked.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#2
The following seems to work; I've tried to be exhaustive (but a simplier way should exist)
data = '144mhz'
Number, Units = re.search(r"([+\-]?\d+\.\d+[eE]?[+\-]?[\d]?[\d]?|[+\-]?\d+[eE]?[+\-]?[\d]?[\d]?)[\s+]?([a-z]+)", data.lower()).groups()
With :
  • [+\-]? => if a sign is encountered (if comes from the "?")
  • \d+\.\d+ => for floats (the "+" indicates one or more occurences)
  • [eE]?[+\-]?[\d]?[\d]? => for scientific notation with or without the sign, and with 2 digits max here
  • | => means "or"
  • [+\-]?\d+[eE]?[+\-]?[\d]?[\d]?)[\s+]? => same thing for intergers (the dot followed by "\d+" have been removed from the previous sentence)
  • [[\s+]?=> if there's any space
  • [a-z]+ => to get strings
  • remember that parentheses indicate what you want to recover
Reply
#3
>>> import re
>>> 
>>> n = '144mHz'
>>> re.split(r'(\d+\.?\d+)', n)[1:]
['144', 'mHz']
>>> 
>>> n = '25.99mHz'
>>> re.split(r'(\d+\.?\d+)', n)[1:]
['25.99', 'mHz']
>>> 
>>> n = '19999.9mHz'
>>> re.split(r'(\d+\.?\d+)', n)[1:]
['19999.9', 'mHz']
Reply
#4
(May-18-2022, 12:24 PM)paul18fr Wrote: [eE]?[+\-]?[\d]?[\d]? => for scientific notation with or without the sign, and with 2 digits max here
how does it do 2 digits max? what if i want 3? what if i want no limit?
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#5
\d matches a single digit, so \d\d matches exactly 2. If you wanted exactly 3, you could write \d{3}, for example. The documentation tells you the syntax, so you should go there to see what things are possible.Then, there are regular expression testers, e.g. https://pythex.org/ where you can try them out.
Reply
#6
(May-18-2022, 12:24 PM)paul18fr Wrote: remember that parentheses indicate what you want to recover
when do i use parenthesis if i am doing re.split()? why are you using re.search()?

if i don't use re, then the way i would do this is a loop through each character and trying it, alone, in int(ch,10) or .isdigit() or .isdecimal(), then splicing up to, and from, that position where the loop breaks.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#7
like this:
def numunits(s=None):
    if not isinstance(s,str):
        raise TypeError('string expected')
    i = iter(range(len(s)))
    for p in i:
        try:
            v = float(s[:p])
            break
        except:
            continue
    else:
        raise ValueError('no number')
    for p in i:
        try:
            v = float(s[:p])
            continue
        except:
            break
    else:
        raise ValueError('no units')
    if s[p] != ' ':
        p -= 1
    return v,s[p:]
if __name__ == '__main__':
    a = ['144mHz','432 mHz','1.296GHz','2.304 GHz']
    for x in a:
        print(repr(x))
        print(repr(numunits(x)))
Output:
lt2a/forums/1 /home/forums 13> py numunits.py '144mHz' (144.0, 'mHz') '432 mHz' (432.0, 'mHz') '1.296GHz' (1.296, 'GHz') '2.304 GHz' (2.304, 'GHz') lt2a/forums/1 /home/forums 14>
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#8
Quote:when do i use parenthesis

In the folowing example, the code [\s+]?[a-z]+?\s+ is not between parenthesis, then no string won't be recovered

data2 = '144 XXX mhz'
AAA = re.search(r"([+\-]?\d+\.\d+[eE]?[+\-]?[\d]?[\d]?|[+\-]?\d+[eE]?[+\-]?[\d]?[\d]?)[\s+]?[a-z]+?\s+([a-z]+)", data2.lower()).groups()
if you want to get "XXX" as well, add a parenthesis between [a-z]+?
AAA2 = re.search(r"([+\-]?\d+\.\d+[eE]?[+\-]?[\d]?[\d]?|[+\-]?\d+[eE]?[+\-]?[\d]?[\d]?)[\s+]?([a-z]+?)\s+([a-z]+)", data2.lower()).groups()
Quote:why are you using re.search()?
Provides the same result, isn't it ? but if you prefere re.split ...
Reply
#9
If you accept only the syntax of Python numbers, as in the previous post, you could use tokenize
>>> import io
>>> from tokenize import tokenize
>>> def parse(s):
...     t = tokenize(io.BytesIO(s.encode()).readline)
...     next(t)
...     x, u = next(t), next(t)
...     return (x.string, u.string)
... 
>>> for a in ['144mHz','432 mHz','1.296GHz','2.304 GHz']:
...     print(repr(a), parse(a))
... 
'144mHz' ('144', 'mHz')
'432 mHz' ('432', 'mHz')
'1.296GHz' ('1.296', 'GHz')
'2.304 GHz' ('2.304', 'GHz')
But you cannot extend the syntax to allow fancy numbers representation.

By the way, in order to specify the problem clearly, it would be good to write a complete syntax of the strings that you want to be able to parse.
Reply
#10
(May-18-2022, 06:29 PM)Gribouillis Wrote: By the way, in order to specify the problem clearly, it would be good to write a complete syntax of the strings that you want to be able to parse.
i'm trying to generalize this to make a function. the first part is a decimal number, although i may, someday, try to extend that to hexadecimal (including float). the 2nd part is any string of characters that could be taken as a unit suffix like 'km' or 'Hz'. the intended function splits it into a converted value and a string or raises an exception if something is bad. i hadn't thought about scientific notation but i should do that just in case someone gives '1.25E-20watts'. right now, it's about making that function. then i will be making a few app scripts that take these from command line arguments, using that function.

i wasn't thinking of this as "parsing" although i can understand that it is, even if just a small amount (kind of like str.split() is).
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Split pdf in pypdf based upon file regex standenman 1 2,096 Feb-03-2023, 12:01 PM
Last Post: SpongeB0B
  recall cool_person 1 1,039 May-07-2022, 08:04 AM
Last Post: menator01

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020