Python Forum
Thread Rating:
  • 0 Vote(s) - 0 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Regex higher IR library
#1
Before I delve into making my own module, I was wondering if there already exists a regex higher IR (intermediary representation) of regex tokens for python
I can import the actual regex parser from re like:
from re._parse import parse

parse("reg|ex")
and it gives an output like
Output:
[(BRANCH, (None, [[(LITERAL, 114), (LITERAL, 101), (LITERAL, 103)], [(LITERAL, 101), (LITERAL, 120)]]))]
but it's not in a very useful format for doing other stuff with. Similar libraries exist in Rust regex_syntax::hir or JS's regexp-tree.
For python, I find searching for regex related projects online pretty difficult as most results end up being something along the lines of how to use re or regex in python, but I would be surprised if something like the other projects shown doesn't exist in python. It wouldn't need to support any advanced regex features like lookaheads, but just have a better way of representing regex tokens.

Does anyone know of a module like this?
Thanks,
Dream
Reply
#2
(Jun-04-2022, 11:37 PM)DreamingInsanity Wrote: but it's not in a very useful format for doing other stuff with.
Why is this not useful enough and what do you want to do with it? It looks like a regular syntax tree (by the way on my system it is sre_parse.parse())
Reply
#3
(Jun-05-2022, 05:46 AM)Gribouillis Wrote:
(Jun-04-2022, 11:37 PM)DreamingInsanity Wrote: but it's not in a very useful format for doing other stuff with.
Why is this not useful enough and what do you want to do with it? It looks like a regular syntax tree (by the way on my system it is sre_parse.parse())
I called it not useful because there's no documentation on the syntax tree, which makes it hard to understand sometimes, and I wouldn't say its designed so that it's easy for to use outside of the re module. Having an object or classes over lists of tuples is just easier for working with the syntax tree, for example:
class SubPattern:
	pat: Any

class Repetition:
	pat: Any
	greedy: bool

class WordBoundary:
	pass

# (\w+) becomes:
SubPattern(Repetition(WordBoundary, True))
# instead of
[(SUBPATTERN, (1, 0, 0, [(MAX_REPEAT, (1, MAXREPEAT, [(IN, [(CATEGORY, CATEGORY_WORD)])]))]))]
Gribouillis Wrote: what do you want to do with it
I want to be able to generate a "priority" for a regex. For example (abc) has a higher priority than ([a-c]+) as the former is more explicit and will only match abc versus 1 or more of the a, b, or c characters as the latter would match. And (a|bc) would have a priority higher than ([a-c]+) but lower than (abc), as its shortest match is a which is shorter than abc as the longer regex is prioritised more.
Reply
#4
I don't know such a module, but with a little effort you could build your own by transforming those tuples into class instances with the help of the code of re/_parser.py in the standard library (or sre_parse.py for older versions).

The idea of building an order relation for the regular expressions looks creative. What are you going to do with this order relation?
Reply
#5
(Jun-05-2022, 03:12 PM)Gribouillis Wrote: The idea of building an order relation for the regular expressions looks creative. What are you going to do with this order relation?
Just for a bit of a fun (and also to use for another project) I felt like recreating the Rust logos crate to be able to create custom lexers. I know there exist already many python modules to do this I just felt recreating this would be fun.
Logos uses a priority system to determine the priority of the tokens that you defined, which includes parsing regexes into an AST to define a priority for them. I feel like making my python as close to logos as possible so I wanted to do that too, hence the reason to get an AST that is a bit easier to work with.
Reply
#6
The closest thing I can think about in Python is the lexer in the PLY module, where tokens are also specified as regular expressions. As David Beazley explains in the documentation, ply sorts the regexes by order of decreasing length to define priority. Instead of building a DFA as logos seems to do, ply builds an master regex and invokes the re module.

Of course, you cannot expect a blazingly fast lexer in Python as you would in Rust. Apart from the regex sorting part, logos reminds me of the venerable flex from C, and I guess it has similar performances. For most uses, however, the lexer is usually not a bottleneck.
Reply
#7
(Jun-05-2022, 06:21 PM)Gribouillis Wrote: The closest thing I can think about in Python is the lexer in the PLY module, where tokens are also specified as regular expressions. As David Beazley explains in the documentation, ply sorts the regexes by order of decreasing length to define priority. Instead of building a DFA as logos seems to do, ply builds an master regex and invokes the re module.

Of course, you cannot expect a blazingly fast lexer in Python as you would in Rust. Apart from the regex sorting part, logos reminds me of the venerable flex from C, and I guess it has similar performances. For most uses, however, the lexer is usually not a bottleneck.
Yeah, speed was never an issue for me cause I don't plan to really use it for lexing large things, more just for smaller files and its a way to work on something new for a bit.
I'll check out what you linked too, it looks interesting.

Thanks,
Dream
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  Beginner Higher Lower Game wallytan 2 1,607 Sep-29-2022, 05:14 PM
Last Post: deanhystad
  finding the next higher representable floating point value Skaperen 0 1,954 Sep-13-2019, 11:16 PM
Last Post: Skaperen
  the next higher character Skaperen 13 4,872 Jun-07-2019, 01:44 PM
Last Post: heiner55
  How do I loop through a list and delete numerical elements that are 1 lower/higher? neko 4 4,325 Sep-05-2017, 02:25 PM
Last Post: ichabod801
  PyInstaller, how to create library folder instead of library.zip file ? harun2525 2 4,817 May-06-2017, 11:29 AM
Last Post: harun2525

Forum Jump:

User Panel Messages

Announcements
Announcement #1 8/1/2020
Announcement #2 8/2/2020
Announcement #3 8/6/2020