Python Forum
Simple Recursive-Descent Parser - Python 3.7
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Simple Recursive-Descent Parser - Python 3.7
#1
Hello.
I'm implementing some logic system from ground up using this framework: https://www.logicthrupython.org/api/index.html

Now, for implementing this method: https://www.logicthrupython.org/api/pred...rse_prefix
I had to implement a recursive-descent parser.

I'm not sure that my design was too good and my code is Pythonic enough, so I'd be glad to share my code below, and hear recommendations/modifications.

Also, you can see another question I have commented with TODO in code:

Pastebin link: https://pastebin.com/zTF26FAh
Reply
#2
WannaBePythonDev Wrote:I'd be glad to share my code below, and hear recommendations/modifications.
It seems to me that the first missing piece is a unittest.TestCase class to run the parser against a set of formulas and check that it outputs what it should. Having such tests would be a good security net to start modifications.

Reporting errors is often a tricky part in parsers. This language is simple enough for the parser to tell exactly at which position it encounters an error in the input string in the same way that Python tells you the exact position when it encounters a syntax error. To implement this, you may consider tokenizing the string prior to RD parsing. You could also consider using Python's own tokenizer for this task, for example
>>> from tokenize import tokenize
>>> import io
>>> source = "(p23 & p12) -> T"
>>> for token in tokenize(io.BytesIO(source.encode()).readline):
...     print(token)
... 
TokenInfo(type=62 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=54 (OP), string='(', start=(1, 0), end=(1, 1), line='(p23 & p12) -> T')
TokenInfo(type=1 (NAME), string='p23', start=(1, 1), end=(1, 4), line='(p23 & p12) -> T')
TokenInfo(type=54 (OP), string='&', start=(1, 5), end=(1, 6), line='(p23 & p12) -> T')
TokenInfo(type=1 (NAME), string='p12', start=(1, 7), end=(1, 10), line='(p23 & p12) -> T')
TokenInfo(type=54 (OP), string=')', start=(1, 10), end=(1, 11), line='(p23 & p12) -> T')
TokenInfo(type=54 (OP), string='->', start=(1, 12), end=(1, 14), line='(p23 & p12) -> T')
TokenInfo(type=1 (NAME), string='T', start=(1, 15), end=(1, 16), line='(p23 & p12) -> T')
TokenInfo(type=4 (NEWLINE), string='', start=(1, 16), end=(1, 17), line='')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')
Also, using assertions is not a Pythonic way to stop parsing when there is an error, raise your own exception instead, for example
class ParseError(RuntimeError):
    pass
...
if not op_remainder
    raise ParseError("The remainder of the binary operator should be non empty")
WannaBePythonDev likes this post
Reply
#3
(Sep-13-2021, 08:05 AM)Gribouillis Wrote:
WannaBePythonDev Wrote:I'd be glad to share my code below, and hear recommendations/modifications.
It seems to me that the first missing piece is a unittest.TestCase class to run the parser against a set of formulas and check that it outputs what it should. Having such tests would be a good security net to start modifications.

Reporting errors is often a tricky part in parsers. This language is simple enough for the parser to tell exactly at which position it encounters an error in the input string in the same way that Python tells you the exact position when it encounters a syntax error. To implement this, you may consider tokenizing the string prior to RD parsing. You could also consider using Python's own tokenizer for this task, for example
>>> from tokenize import tokenize
>>> import io
>>> source = "(p23 & p12) -> T"
>>> for token in tokenize(io.BytesIO(source.encode()).readline):
...     print(token)
... 
TokenInfo(type=62 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=54 (OP), string='(', start=(1, 0), end=(1, 1), line='(p23 & p12) -> T')
TokenInfo(type=1 (NAME), string='p23', start=(1, 1), end=(1, 4), line='(p23 & p12) -> T')
TokenInfo(type=54 (OP), string='&', start=(1, 5), end=(1, 6), line='(p23 & p12) -> T')
TokenInfo(type=1 (NAME), string='p12', start=(1, 7), end=(1, 10), line='(p23 & p12) -> T')
TokenInfo(type=54 (OP), string=')', start=(1, 10), end=(1, 11), line='(p23 & p12) -> T')
TokenInfo(type=54 (OP), string='->', start=(1, 12), end=(1, 14), line='(p23 & p12) -> T')
TokenInfo(type=1 (NAME), string='T', start=(1, 15), end=(1, 16), line='(p23 & p12) -> T')
TokenInfo(type=4 (NEWLINE), string='', start=(1, 16), end=(1, 17), line='')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')
Also, using assertions is not a Pythonic way to stop parsing when there is an error, raise your own exception instead, for example
class ParseError(RuntimeError):
    pass
...
if not op_remainder
    raise ParseError("The remainder of the binary operator should be non empty")

First, thank you for reviewing my code and sharing your idea with me.

As for unittesting - actually, I'm testing my framework using Pytest. Is there some reason unittest would be better than that?

As for the tokenizer - I think I didn't fully understand the purpose of this tokenization as regards to my code? You recommend it as an addition to my code? Because I don't clearly see how can I take it to my behalf. I'd be glad if you can elaborate on that.

As for assertion - how do I know what standard exception should I inherit from for raising an exception? For example, You've inherited RuntimeError in your example - what's posed behind this choice?

BTW - Can you look at the TODO in line 8?

Another question - is it fine to make the Parser as a class?
This is where and how I use this parser, in a method of the Formula class:

    @staticmethod
    def _parse_prefix(string: str) -> Tuple[Union[Formula, None], str]:
        """Parses a prefix of the given string into a formula.

        Parameters:
            string: string to parse.

        Returns:
            A pair of the parsed formula and the unparsed suffix of the string.
            If the given string has as a prefix a variable name (e.g.,
            ``'x12'``) or a unary operator follows by a variable name, then the
            parsed prefix will include that entire variable name (and not just a
            part of it, such as ``'x1'``). If no prefix of the given string is a
            valid standard string representation of a formula then returned pair
            should be of ``None`` and an error message, where the error message
            is a string with some human-readable content.
        """
        # Task 1.4
        return Parser().parse_formula(string)
Makes sense?
Reply
#4
WannaBePythonDev Wrote:As for unittesting - actually, I'm testing my framework using Pytest. Is there some reason unittest would be better than that?
Not especially, many developers prefer pytest because it's easier to parametrize the tests with pytest. Unittest is in the standard library, so that everybody can run the code without installing anything. It is too bad you don't include the tests for your parser, it would allow us to run the code.

WannaBePythonDev Wrote:As for the tokenizer - I think I didn't fully understand the purpose of this tokenization as regards to my code? You recommend it as an addition to my code? Because I don't clearly see how can I take it to my behalf. I'd be glad if you can elaborate on that.
In the field of language parsing, it is customary to separate the lexical analyser from the parser. Of course it is not mandatory. In your case, I'd see two advantages of this
  • A lexer separates the input into tokens which carry additional information: the token type and the exact position of the token in the input string. You can use this position information to generate detailed error messages. This is time saved for the future users of the parser.
  • You could use Python's own lexer and filter the stream of tokens, thus using an extremely robust lexer.

WannaBePythonDev Wrote:As for assertion - how do I know what standard exception should I inherit from for raising an exception? For example, You've inherited RuntimeError in your example - what's posed behind this choice?
RuntimeError is raised for any error that doesn't fall in the other built-in exception categories. Unless your own exceptions fall into these categories, it is a good idea to always make them subclasses of RuntimeError. It is consistent with the documentation.

WannaBePythonDev Wrote:BTW - Can you look at the TODO in line 8?
Unfortunately, I'm not a Windows programmer and I don't know how to use Microsoft tools.

WannaBePythonDev Wrote:Another question - is it fine to make the Parser as a class?
I think it is perfectly sensible as it is what most parsing modules do. One could object that your parser doesn't need to maintain a state during the parsing which makes the object unnecessary, but it is reasonable to think that after a few iterations of the code, you'll want to add some state in the parser, so keep the class.
WannaBePythonDev likes this post
Reply
#5
(Sep-13-2021, 05:06 PM)Gribouillis Wrote:
WannaBePythonDev Wrote:As for unittesting - actually, I'm testing my framework using Pytest. Is there some reason unittest would be better than that?
Not especially, many developers prefer pytest because it's easier to parametrize the tests with pytest. Unittest is in the standard library, so that everybody can run the code without installing anything. It is too bad you don't include the tests for your parser, it would allow us to run the code.

WannaBePythonDev Wrote:As for the tokenizer - I think I didn't fully understand the purpose of this tokenization as regards to my code? You recommend it as an addition to my code? Because I don't clearly see how can I take it to my behalf. I'd be glad if you can elaborate on that.
In the field of language parsing, it is customary to separate the lexical analyser from the parser. Of course it is not mandatory. In your case, I'd see two advantages of this
  • A lexer separates the input into tokens which carry additional information: the token type and the exact position of the token in the input string. You can use this position information to generate detailed error messages. This is time saved for the future users of the parser.
  • You could use Python's own lexer and filter the stream of tokens, thus using an extremely robust lexer.

WannaBePythonDev Wrote:As for assertion - how do I know what standard exception should I inherit from for raising an exception? For example, You've inherited RuntimeError in your example - what's posed behind this choice?
RuntimeError is raised for any error that doesn't fall in the other built-in exception categories. Unless your own exceptions fall into these categories, it is a good idea to always make them subclasses of RuntimeError. It is consistent with the documentation.

WannaBePythonDev Wrote:BTW - Can you look at the TODO in line 8?
Unfortunately, I'm not a Windows programmer and I don't know how to use Microsoft tools.

WannaBePythonDev Wrote:Another question - is it fine to make the Parser as a class?
I think it is perfectly sensible as it is what most parsing modules do. One could object that your parser doesn't need to maintain a state during the parsing which makes the object unnecessary, but it is reasonable to think that after a few iterations of the code, you'll want to add some state in the parser, so keep the class.

Great. Thank you so much.

As for the lexer - actually, I tried implementing my Parser using a lexer but I had hard typing issues. Eventually, I gave up and left this way.

######## Start of my trial to implement a firmer hierarchy as befits a RD parser ########
# class TokenType(str, Enum):
#     And: str = "&"
#     Or: str = "|"
#     Imply: str = "->"
#     Neg: str = "~"
#     T: str = "T"
#     F: str = "F"


##### Token types: #####
# BinaryOp = Union[TokenType.And, TokenType.Or, TokenType.Imply]
# UnaryOp = Union[TokenType.Neg]

#### Tokens: ####
# Variable = str

# Constant = Union[TokenType.T, TokenType.F]

# @dataclass
# class UnaryFormula:
#     op: UnaryOp
#     operand: Optional[Formula]  # To follow the typing of Formula class,
#     # though it doesn't make sense for this field to be None

# @dataclass
# class BinaryFormula:
#     left: Optional[Formula]
#     right: Optional[Formula]
#     op: BinaryOp

# ParserFormula = Union[Variable, Constant, UnaryFormula, BinaryFormula]

# def parser_adapter_to_formula(formula_to_adapt: ParserFormula) -> Optional[Formula]:
#     if isinstance(formula_to_adapt, Variable):
#         return Formula(formula_to_adapt, None, None)
#     if isinstance(formula_to_adapt, UnaryFormula):
#         return Formula(formula_to_adapt.op, formula_to_adapt.operand, None)
#     if isinstance(formula_to_adapt, BinaryFormula):
#         return Formula(
#             formula_to_adapt.op, formula_to_adapt.left, formula_to_adapt.right
#         )
#     return None

######## End of trial to implement a firmer hierarchy ########
Of course, I tried using this accordingly in the rest of the code but it caused those issues...
Reply


Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020