Regex higher IR library

DreamingInsanity · Jun-04-2022, 11:37 PM

Before I delve into making my own module, I was wondering if there already exists a regex higher IR (intermediary representation) of regex tokens for python
I can import the actual regex parser from re like:

from re._parse import parse

parse("reg|ex")

and it gives an output like

Output:
[(BRANCH, (None, [[(LITERAL, 114), (LITERAL, 101), (LITERAL, 103)], [(LITERAL, 101), (LITERAL, 120)]]))]

but it's not in a very useful format for doing other stuff with. Similar libraries exist in Rust regex_syntax::hir or JS's regexp-tree.
For python, I find searching for regex related projects online pretty difficult as most results end up being something along the lines of how to use re or regex in python, but I would be surprised if something like the other projects shown doesn't exist in python. It wouldn't need to support any advanced regex features like lookaheads, but just have a better way of representing regex tokens.

Does anyone know of a module like this?
Thanks,
Dream

**Gribouillis** · (This post was last modified: Jun-05-2022, 05:46 AM by Gribouillis.)

(Jun-04-2022, 11:37 PM)DreamingInsanity Wrote: but it's not in a very useful format for doing other stuff with.

Why is this not useful enough and what do you want to do with it? It looks like a regular syntax tree (by the way on my system it is sre_parse.parse())

DreamingInsanity · Jun-05-2022, 10:06 AM

(Jun-05-2022, 05:46 AM)Gribouillis Wrote:
(Jun-04-2022, 11:37 PM)DreamingInsanity Wrote: but it's not in a very useful format for doing other stuff with.
Why is this not useful enough and what do you want to do with it? It looks like a regular syntax tree (by the way on my system it is sre_parse.parse())

I called it not useful because there's no documentation on the syntax tree, which makes it hard to understand sometimes, and I wouldn't say its designed so that it's easy for to use outside of the re module. Having an object or classes over lists of tuples is just easier for working with the syntax tree, for example:

class SubPattern:
	pat: Any

class Repetition:
	pat: Any
	greedy: bool

class WordBoundary:
	pass

# (\w+) becomes:
SubPattern(Repetition(WordBoundary, True))
# instead of
[(SUBPATTERN, (1, 0, 0, [(MAX_REPEAT, (1, MAXREPEAT, [(IN, [(CATEGORY, CATEGORY_WORD)])]))]))]

Gribouillis Wrote: what do you want to do with it

I want to be able to generate a "priority" for a regex. For example (abc) has a higher priority than ([a-c]+) as the former is more explicit and will only match abc versus 1 or more of the a, b, or c characters as the latter would match. And (a|bc) would have a priority higher than ([a-c]+) but lower than (abc), as its shortest match is a which is shorter than abc as the longer regex is prioritised more.

**Gribouillis** · Jun-05-2022, 03:12 PM

I don't know such a module, but with a little effort you could build your own by transforming those tuples into class instances with the help of the code of re/_parser.py in the standard library (or sre_parse.py for older versions).

The idea of building an order relation for the regular expressions looks creative. What are you going to do with this order relation?

DreamingInsanity · Jun-05-2022, 04:07 PM

(Jun-05-2022, 03:12 PM)Gribouillis Wrote: The idea of building an order relation for the regular expressions looks creative. What are you going to do with this order relation?

Just for a bit of a fun (and also to use for another project) I felt like recreating the Rust logos crate to be able to create custom lexers. I know there exist already many python modules to do this I just felt recreating this would be fun.
Logos uses a priority system to determine the priority of the tokens that you defined, which includes parsing regexes into an AST to define a priority for them. I feel like making my python as close to logos as possible so I wanted to do that too, hence the reason to get an AST that is a bit easier to work with.

**Gribouillis** · (This post was last modified: Jun-05-2022, 06:25 PM by Gribouillis.)

The closest thing I can think about in Python is the lexer in the PLY module, where tokens are also specified as regular expressions. As David Beazley explains in the documentation, ply sorts the regexes by order of decreasing length to define priority. Instead of building a DFA as logos seems to do, ply builds an master regex and invokes the re module.

Of course, you cannot expect a blazingly fast lexer in Python as you would in Rust. Apart from the regex sorting part, logos reminds me of the venerable flex from C, and I guess it has similar performances. For most uses, however, the lexer is usually not a bottleneck.

DreamingInsanity · Jun-06-2022, 02:37 PM

(Jun-05-2022, 06:21 PM)Gribouillis Wrote: The closest thing I can think about in Python is the lexer in the PLY module, where tokens are also specified as regular expressions. As David Beazley explains in the documentation, ply sorts the regexes by order of decreasing length to define priority. Instead of building a DFA as logos seems to do, ply builds an master regex and invokes the re module.

Of course, you cannot expect a blazingly fast lexer in Python as you would in Rust. Apart from the regex sorting part, logos reminds me of the venerable flex from C, and I guess it has similar performances. For most uses, however, the lexer is usually not a bottleneck.

Yeah, speed was never an issue for me cause I don't plan to really use it for lexing large things, more just for smaller files and its a way to work on something new for a bit.
I'll check out what you linked too, it looks interesting.

Thanks,
Dream

Possibly Related Threads…
Thread		Author	Replies	Views	Last Post
	Beginner Higher Lower Game	wallytan	2	2,619	Sep-29-2022, 05:14 PM Last Post: deanhystad
	finding the next higher representable floating point value	Skaperen	0	2,494	Sep-13-2019, 11:16 PM Last Post: Skaperen
	the next higher character	Skaperen	13	7,231	Jun-07-2019, 01:44 PM Last Post: heiner55
	How do I loop through a list and delete numerical elements that are 1 lower/higher?	neko	4	5,542	Sep-05-2017, 02:25 PM Last Post: ichabod801
	PyInstaller, how to create library folder instead of library.zip file ?	harun2525	2	5,902	May-06-2017, 11:29 AM Last Post: harun2525

Regex higher IR library

User Panel Messages

Announcements