Python Forum

Pages: 1 2

maybe re can do this but regular expressions just make no sense in my mind. i need to split a string with the separator(s) being any combination of one or more non-alphanumeric characters.

(Feb-03-2022, 06:23 PM)Skaperen Wrote: [ -> ]maybe re can do this but regular expressions just make no sense in my mind

Regex has been up in your Threads before,it's not so difficult understand if make an effort🤔
Doc and regex101.

(Feb-03-2022, 06:23 PM)Skaperen Wrote: [ -> ]i need to split a string with the separator(s) being any combination of one or more non-alphanumeric characters.

>>> import re
>>> 
>>> s = 'hello@world'
>>> r = re.split('[^a-zA-Z0-9]', s)
>>> r
['hello', 'world']
>>> 
>>> s = 'hello@"!world *^ car'
>>> r = re.split('[^a-zA-Z0-9]', s)
>>> r
['hello', '', '', 'world', '', '', '', 'car']
>>> [x for x in r if x]
['hello', 'world', 'car']
>>> 
>>> s = '123` green ?... color'
>>> r = re.split('[^a-zA-Z0-9]', s)
>>> ' '.join([x for x in r if x])
'123 green color'

i do know how to make some of the simpler regular expressions, but reading documentation and tutorials have left me more confused. once i have a question, it is usually so localized that i cannot understand the next part until that question is answered. vague descriptions can also do that to me. i am a detail thinker that most authors are not writing for.

in the example code i see 3 examples but the results don't make sense unless you have omitted the separation characters (i don't see any). the characters could each be any character of chr(x)for x in range(1114112) (or, at least, the printable ones). there would be a string of them. i'm thinking of a non-re way involving a loop and many calls to split(). (only in my head, for now) but if re can do this, then i should do it that way.

i am totally clueless on that regex101 page. no idea how to start.

(Feb-06-2022, 01:01 AM)Skaperen Wrote: [ -> ]he characters could each be any character of chr(x)for x in range(1114112)

It would split on any Unicode character or character's to.

>>> import re
>>> 
>>> s = 'hello🤨world'
>>> r = re.split('[^a-zA-Z0-9]', s)
>>> r
['hello', 'world']
>>> 
>>> s = 'hello🤨world記者car'
>>> r = re.split('[^a-zA-Z0-9]', s)
>>> r
['hello', 'world', '', 'car']

Quote:i am totally clueless on that regex101 page. no idea how to start.

You start simple eg hello123world,let say what to find 123 on regex 101.
So paste in string and test a regex pattern on top,on right see EXPLANATION and Match info.
Then test in python.

>>> import re
>>> 
>>> s = 'hello123world'
>>> re.findall(r'\d+', s)
['123']

Test can test other methods of re module.

>>> import re
>>> 
>>> s = 'hello123world'
>>> re.split(r'\d+', s)
['hello', 'world']
>>> 
>>> # Make it a group
>>> re.split(r'(\d+)', s)
['hello', '123', 'world']
>>>
>>> re.search(r'(\d+)', s)
<re.Match object; span=(5, 8), match='123'>
>>> r = re.search(r'(\d+)', s)
>>> r.group(1)
'123'

i want to specify which characters to be split on. any character not specified is not a character to split on. the example would invoke str method split() for each character in the list in a way to accumulate the splits properly. if ',' is not in the list then splitting will not happen at ',' and they will be passed along into the str objects in the result list. this is split by (each) single character, not strings.

if i do implement this, i will try to implement support for bytes, but that is not my goal i don't expect this from re.

Skaperen Wrote:i want to specify which characters to be split on.

You can specify what ever you want.
Let don't want to split on these.

>>> print([chr(x)for x in range(5000, 5010)])
['ᎈ', 'ᎉ', 'ᎊ', 'ᎋ', 'ᎌ', 'ᎍ', 'ᎎ', 'ᎏ', '᎐', '᎑']

>>> import re
>>> 
>>> s = 'hello🤨world@carᎈᎉbus'
>>> r = re.split('[^a-zA-Z0-9ᎈᎉᎊᎋᎌᎍᎎᎏ᎐᎑]', s)
>>> r
['hello', 'world', 'carᎈᎉbus']

i see you are customizing the regular expression for the specific range of characters. do i need that ^a-zA-Z0-9 part? does this code make sense? is there no way to make a regular expression use a str argument?

def splitchars(pattern=None,chars=None):
    if not isinstance(pattern,str):
        raise TypeError('pattern (arg 1) is not a str')
    if not isinstance(chars,str):
        raise TypeError('chars (arg 2) is not a str')
    if not chars:
        return [pattern]
    e = chars.replace(r'[',r'\[').replace(r']',r'\]').replace(r'\',r'\\')
    return re.split(f'[{e}]',pattern)

(Feb-06-2022, 11:22 PM)Skaperen Wrote: [ -> ]i see you are customizing the regular expression for the specific range of characters. do i need that ^a-zA-Z0-9 part?

Can put whatever you want in there,now only split on what's in the list.
Taken out eg a-z matches a single character in the range between a (index 97) and z (index 122) (case sensitive).

>>> import re
>>> 
>>> s = 'bus and cab'
>>> r = re.split(r'[tom🤨]', s)
>>> r
['bus and cab']
>>> 
>>> s = 'bus and taxi'
>>> r = re.split(r'[tom🤨]', s)
>>> r
['bus and ', 'axi']
>>>

Quote:does this code make sense?

Maybe if it dos what you want,regex could simplify your line 8,
3 replace is kind of okay.

Quote:is there no way to make a regular expression use a str argument?

Of course that what regex takes in string(which is all Unicode in Python 3).

Doc Wrote:Both patterns and strings to be searched can be Unicode strings (str) as well as 8-bit strings (bytes).
However, Unicode strings and 8-bit strings cannot be mixed: that is,
you cannot match a Unicode string with a byte pattern or vice-versa; similarly,
when asking for a substitution, the replacement string must be of the same type as both the pattern and the search string.

Pages: 1 2

Skaperen

snippsat

Skaperen

Skaperen

Skaperen

snippsat

Skaperen

snippsat

Skaperen

snippsat