Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
parsin python code
#1
is there a function to parse or lex python code? in particular i want to extract the name from a source line that is a def or class statement. i don't need anything else from the statement but i do want to know if it looks valid up to the first '(' in the sources.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#2
Python's lexer is tokenize.tokenize(). The syntaxic analyser is ast.parse() but I don't think it will work with a line fragment.
>>> import io
>>> from tokenize import tokenize
>>> data = "def some_func(x, y"
>>> 
>>> for token in tokenize(io.BytesIO(data.encode()).readline):
...     print(token)
... 
TokenInfo(type=59 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=1 (NAME), string='def', start=(1, 0), end=(1, 3), line='def some_func(x, y')
TokenInfo(type=1 (NAME), string='some_func', start=(1, 4), end=(1, 13), line='def some_func(x, y')
TokenInfo(type=53 (OP), string='(', start=(1, 13), end=(1, 14), line='def some_func(x, y')
TokenInfo(type=1 (NAME), string='x', start=(1, 14), end=(1, 15), line='def some_func(x, y')
TokenInfo(type=53 (OP), string=',', start=(1, 15), end=(1, 16), line='def some_func(x, y')
TokenInfo(type=1 (NAME), string='y', start=(1, 17), end=(1, 18), line='def some_func(x, y')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.5/tokenize.py", line 597, in _tokenize
    raise TokenError("EOF in multi-line statement", (lnum, 0))
tokenize.TokenError: ('EOF in multi-line statement', (2, 0))
Reply
#3
then what i might need is a lexer that breaks things apart in the way a Python engine would need to do. that should work OK starting at the start of any line in the source code.

such code would leave variable names, reserved words (looking like variable names) and literals in a single string and most other things in a single character string (certain combinations would remain joined, such as ==, >=, **). thus if my line was
def foo(*bar, **baz):
i should get
('def','foo','(','*','bar',',','**','baz',')',':')
it might need to be a bit more elaborate to account for things like variable names vs string literals for the caller to know the difference. the return value should be an iterator in the ideal implementation.

i just want t tear the code apart at places that should come apart. clearly code like
def "foo" '(' bar ''')''':
makes no sense and tearing apart stuff like that needs something more.
('def','foo','(','bar',')',':')
would not be a very useful result.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#4
what i am doing right now is making a script that tears apart a big file of many functions into individual files. in the future i might need something more involved. for now i just need to detect the start of a new class or function and detect the end (2 empty lines).
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply
#5
Quote:then what i might need is a lexer that breaks things apart in the way a Python engine would need to do.
This is what tokenize.tokenize() does.
Quote:what i am doing right now is making a script that tears apart a big file of many functions into individual files.
The appropriate tool could be ast.parse(). One of the issues you'll have is that functions use global symbols defined at the file level such as imported names or other function's names.
Reply
#6
(Feb-25-2021, 05:33 AM)Gribouillis Wrote: One of the issues you'll have is that functions use global symbols defined at the file level such as imported names or other function's names.
does that cause the name the function or class uses to change? i'm only parsing my own code. so, if it's something i don't know about then i won't code it and thus won't run into it.
Tradition is peer pressure from dead people

What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American.
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020