Python Forum

Full Version: i woule a way to parse a line python source like split
You're currently viewing a stripped down version of our content. View the full version with proper formatting.
i would like to parse a line of python source instead of merely split it:
    line = "blank=' '"
    parts = line.split()
    parts -> ["blank='","'"]
    pieces = ??parse??(line)
    pieces -> ["blank"."=","' '"]
i want to get what pieces ends up with.
"a[0]=int(ab[16])+2" -> ["a","[","0","]","=","int","(","ab","[","16","]",")","+","2"]
i need to parse some python source only one line at a time to make some edits by other scripts, although i am not sure how best to handle continuations. comments should be one big string. if white spaces are always included, that's ok.
You can use the tokenize module to get a list of tokens from python code:
>>> f = io.BytesIO(b"a[0]=int(ab[16])+2")
>>> tokens = list(tokenize.tokenize(f.readline))
>>> pp(tokens)
[
    tokenize.TokenInfo(
        type=57,
        string='utf-8',
        start=(0, 0),
        end=(0, 0),
        line=''
    ),
    tokenize.TokenInfo(
        type=1,
        string='a',
        start=(1, 0),
        end=(1, 1),
        line='a[0]=int(ab[16])+2'
    ),
    tokenize.TokenInfo(
        type=53,
        string='[',
        start=(1, 1),
        end=(1, 2),
        line='a[0]=int(ab[16])+2'
    ),
    tokenize.TokenInfo(
        type=2,
        string='0',
        start=(1, 2),
        end=(1, 3),
        line='a[0]=int(ab[16])+2'
    ),
    tokenize.TokenInfo(
        type=53,
        string=']',
        start=(1, 3),
        end=(1, 4),
        line='a[0]=int(ab[16])+2'
    ),
    tokenize.TokenInfo(
        type=53,
        string='=',
        start=(1, 4),
        end=(1, 5),
        line='a[0]=int(ab[16])+2'
    ),
    tokenize.TokenInfo(
        type=1,
        string='int',
        start=(1, 5),
        end=(1, 8),
        line='a[0]=int(ab[16])+2'
    ),
    tokenize.TokenInfo(
        type=53,
        string='(',
        start=(1, 8),
        end=(1, 9),
        line='a[0]=int(ab[16])+2'
    ),
    tokenize.TokenInfo(
        type=1,
        string='ab',
        start=(1, 9),
        end=(1, 11),
        line='a[0]=int(ab[16])+2'
    ),
    tokenize.TokenInfo(
        type=53,
        string='[',
        start=(1, 11),
        end=(1, 12),
        line='a[0]=int(ab[16])+2'
    ),
    tokenize.TokenInfo(
        type=2,
        string='16',
        start=(1, 12),
        end=(1, 14),
        line='a[0]=int(ab[16])+2'
    ),
    tokenize.TokenInfo(
        type=53,
        string=']',
        start=(1, 14),
        end=(1, 15),
        line='a[0]=int(ab[16])+2'
    ),
    tokenize.TokenInfo(
        type=53,
        string=')',
        start=(1, 15),
        end=(1, 16),
        line='a[0]=int(ab[16])+2'
    ),
    tokenize.TokenInfo(
        type=53,
        string='+',
        start=(1, 16),
        end=(1, 17),
        line='a[0]=int(ab[16])+2'
    ),
    tokenize.TokenInfo(
        type=2,
        string='2',
        start=(1, 17),
        end=(1, 18),
        line='a[0]=int(ab[16])+2'
    ),
    tokenize.TokenInfo(
        type=0,
        string='',
        start=(2, 0),
        end=(2, 0),
        line=''
    )
]
>>> [t.string for t in tokens]
['utf-8', 'a', '[', '0', ']', '=', 'int', '(', 'ab', '[', '16', ']', ')', '+', '2', '']
However, if you're looking to modify the code, using ast with a custom NodeTransformer might be simpler. ast will get rid of comments though, so that might be a problem.
yeah, i am looking to modify code. but i am also looking to test if lines of code have a matching pattern. if so, the line will be subject to modification. if not, the line will be printed in whole. some modifications may need to wait until some later lines to determine what the modification is. some lines could be deleted.

the most difficult part is that things i will be looking for could be in quoted string literals. something that breaks up code from quoted string literal contents and comments would be useful, i think.

it is possible that a comment line or a comment at the end of a line could match the pattern as a false-positive. a # could be in a string literal giving a false appearance of a comment. comments could be a tough issue though not as tough for Python as C/Pike code was when i have needed to modify it, due to embedded comments.