i am looking for a simpler tokenizer

Skaperen · (This post was last modified: Jul-28-2019, 11:09 PM by Skaperen.)

the tokenize module is too much for what i need. what i need is basically like str.split() but with support for quoted strings, including quotes within different quotes.

    s = 'alpha  beta  " gamma  epsilon"  omega'
    r = simple_tokenize(s)

r would get

Output:
['alpha','beta',' gamma  epsilon','omega']

triple quotes are not essential but it would be OK if they work right. backslashes re not essential but it would be OK if they work right. both single and double quotes are what is needed.

anyone know a way to get this in the Python3 library?

i edited this to use output because inline didn't show the double space.

**scidam** · Jul-28-2019, 11:17 PM

At least, you can try tokenize module. I never used it, but it seems to be developed for parsing Python code.

Another option is to use ply or its newer version
sly.

It looks not too complex, so you probably can parse it yourself, using regular expressions.

jonnin · Jul-29-2019, 12:21 AM

show what you want to see out for triple quotes?
This does not even need regx if its what I think you want, but need to be sure.

Skaperen · Jul-29-2019, 12:33 AM

i don't care about triple quotes for this. they won't be used. but if it does support them, i would expect a behavior similar to Python code.

this use case is not for parsing Python code. it's for simple commands that you might see given to a simple shell. things like ( or ) or [ or ] or { or } do not need to be parsed, either. but if they are supported, i expect each such character not inside quotes to be separated into its own token. they will not be used outside of quotes.

escape sequences, other than escaping quotes inside quotes, will not be used, either.

jonnin · (This post was last modified: Jul-29-2019, 12:01 PM by jonnin.)

Ok. In that case I think a modest loop can do all you ask. Just iterate and break by supported grouping tokens, using counters to track.
for example z = 'abc[d{e}f] "xyz 'bbb' www" '

logic is 'if its a grouping symbol, increment or decrement, stop current token (if any) and register it..
a new token. ab, abc, grouping symbol, bracket var adds 1, stop, abc is token. bracket discarded. d, brace, stop, brace var incremented. d is a token (abc,d,). e is new token. end brace, brace decremented, stop, e is token (abc,d,e,). find f, bracket decremented, abc,d,e,f, .. space AND all counters are at zero, stop current token and register (none, ignore) double quote var increments, and so on you end up with abc,d,e,f,xyz,bbb,www and all the counters are at zero at the end so it was legal. or "abc def" gives ,abc def, because we took spaces as part of string when counters not all zero. (you can play with that rule, maybe you want space big token if its quoted but not if in braces, ?)

does that make sense? if you have more than 4 or 5 grouping symbols, you need to organize them better than loose variables but the algorithm is the same -- its just counts & conditions in a loop.

detail.. as I stated this your gamma epsilon keeps leading space as you show due to 'space inside groups' rule. If you want spaces outside groups this way as well, you need extra logic for that to determine if a space is a token break or part of token, whatever rules you want to tie to that... if the leading space was not really wanted, you would need extra logic on the other side instead. Space may be your most tricky thing to handle if your rules are complex around it.

DeaD_EyE · (This post was last modified: Jul-29-2019, 02:20 PM by DeaD_EyE.)

import string


def tokenize(s):
    return ''.join(c for c in s if c in string.ascii_letters + ' ').split()

s = 'alpha  beta  " gamma  epsilon"  omega'
tokenize(s)

Output:
['alpha', 'beta', 'gamma', 'epsilon', 'omega']

Just stripping everthing away which is not included in string.ascii_letters (no digits) and a whitespace.
Using a simple split to split the tokens.

If you want to preserve also digits, then add them. Same with special chars like ( or [.

**Gribouillis** · (This post was last modified: Jul-29-2019, 03:26 PM by Gribouillis.)

Skaperen Wrote:it's for simple commands that you might see given to a simple shell.

I suggest shlex.split()

>>> s = 'alpha  beta  " gamma  epsilon"  omega'
>>> import shlex
>>> shlex.split(s)
['alpha', 'beta', ' gamma  epsilon', 'omega']

Skaperen · Jul-29-2019, 05:31 PM

i want quoted strings to be kept whole, including all the spaces. sure, i can implement this, myself (BTDT in C). in Python, i want to use what's there. that's the Python way. i just need to know what is there.

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	How can you make it simpler?	bitcoin10mil	5	2,500	Aug-22-2020, 08:38 PM Last Post: snippsat
	How to make this simpler.	leoahum	7	3,686	Mar-11-2019, 01:57 PM Last Post: leoahum

i am looking for a simpler tokenizer

User Panel Messages

Announcements