Analysing regexp with Golang
DISCLAIMER: Let me say this first. Google’s regular expression implementations are known for not implementing features that make them, well, not regular. Both re2
and Golang’s regexp
do not support backreferences. Otherwise, the things done here would be hard, or impossible.
I may not be Golang’s biggest fan in general (lack of generics, verbose syntax, simplistic type system, etc), but I’ve written a bunch of it in the last couple of years and found an unexpectedly useful feature. Golang provides a package called regexp/syntax
that proves to be as useful as its documentation is sparse.
In essence, this package exposes the user to the finite state machines built by the regexp compiler. This can be used to do analyses on regular expressions such as “does this regexp ever match a particular character after matching n
characters?” or “does this regexp match any strings starting with a particular prefix?”. These might sound like constructed examples, but both of them actually popped up in my dayjob.
Toy Example: Loop detection
For sake of simplicity, let’s explore a constructed example first in this post: does a regexp match strings of arbitrary length? Or, in more technical terms: is there a loop in the finite state machine? Let’s get right started with the boilerplate of compiling a regular expression into something called a Prog
.
This actually produces a nice and readable representation of the generated finite state machine. Note the * next to 2 which tells us that this is the initial state.
What we see here is the textual representation of the Prog
object. It’s a struct containing
Inst
: list of instructionsStart
: initial instruction
Inst
is the type used to represent an instruction. It’s a struct containing:
Op
: type of instructionOut
: next instruction (except for InstMatch and InstFail, which are terminal instructions)Arg
: additional argument to some instructions
For the purposes of this toy example, all we care about is which instructions can follow from a given instruction. For most instructions, that is the instruction referred to in Out
. Let’s introduce the odd ones here:
InstMatch
: successfully match input string. Does not have successor instruction.InstFail
: reject input string. Does not have successor instruction.InstAlt
/InstAltMatch
: eitherOut
orArg
are the successor instruction. A string matches if either of the branches arrives atInstMatch
.
If you are curious about the difference between InstAlt
and InstAltMatch
: From all I could determine, InstAltMatch
is an optimisation where it is known that one branch leads to a match while the other branch consumes characters. I do not see the compiler or any rewriting actually using this instruction, so it does not seem to be in use. Most of the implementation treats them interchangeably, while backtrack.go in the regex evaluator appears to use it to determine which branch to take.
This information allows us to implement a helper function to determining the successor instructions, given an instruction.
Now we can implement loop-detection by a simple breadth-first search, keeping track of already visited nodes in a set (i.e. a map[uint32]bool
, because Golang does not have a set type).
A regex engine
Now let’s try to build a very inefficient regex engine based on this. To do so, let us first introduce the various rune instructions (rune is Golang for Unicode codepoint). There is InstRune
, InstRune1
, InstRuneAny
and InstRuneAnyNotNL
. Most of them (except InstRune
and InstRune1
) should be self-explanatory, but here’s the whole list:
InstRuneAny
matches any rune.InstRuneAnyNotNL
matches any rune except newlines.InstRune
has aMatchRune
method to determine whether it matches a rune.InstRune1
matches the rune provided ini.Rune[0]
(obviously.)
This leaves us with two remaining useful instructions:
InstCapture
: capture a match into a capture group. We won’t bother with this here.InstEmptyWidth
: match constrains on the current position in the string. This has aMatchEmptyWidth
to determine whether it matches.
There’s also InstNop
which, well, does nothing.
Of course, the easiest way to do this is a recursive evaluator. We pass in a program, the current instruction, and input, and the current position in the input.
Let’s first determine the previous and current rune at the current position, and use -1
for the borders of the string (to be consistent with MatchEmptyWidth
).
And now all that’s left is one fairly large switch (if you look at actual implementations of regex engines, there are also often giant switches, so it’s legit).
InstAlt
and InstAltMatch
are the same, so let’s use the fallthrough
statement for go switch statements (this is much more sane than C-style fallthrough by default).
We don’t care about InstCapture
or InstNop
here.
For InstEmptyWidth
we use the method that was given to us for this.
(I could get used to this). InstMatch
and InstFail
are obvious.
Then there come the various ways of matching runes. Note that this is the only time we have to increment the index into our input, as this is the only time we actually consume any runes.
That’s it. Now some due diligence against us being bad programmers, and we’re done.
Well, that was fun. But not terribly exciting. But what this gives is as good mental model of what exactly the different instructions mean, which can be used to build more exciting things.
Do we only match even-length strings?
Now that we can proudly proclaim we have written a regular expression engine (well, maybe we were slightly cheating and someone else helped a bit), let’s take up a bigger challenge. Given a regular expression that possibly matches arbitrarily length strings, determine whether all strings matched have an even size. Sounds like an interview question? A bit, but I also hope I’ll never get this as an actual interview question.
Let’s start with a similar prototype as for our Match
function, but instead of our input let’s have something that flips around whether we are at an even or odd step. For reasons that will make sense later, let’s encode this as an integer that flips between 0
and 1
(so it’s idx % 2
). We also need to keep track of which nodes we have seen before, or we will wait for a long time. But if you think about it a bit, we need to keep track of this for even and odd steps, an even visit and an odd visit are not the same. That results in the beautiful type of map[int]map[uint32]bool
, or a map from an integer to a set of uint32
.
Now let’s start with the easy part. I literally copy & pasted the Match
code, removed all the stuff that actually does any … matching, plumbed through the visited map and made idx
mod 2. That gives us all the rune instructions and InstCapture
, InstNop
and InstEmptyWidth
(which are all, for all intents and purposes, noop).
InstMatch
is straightforward, as we just have to check whether we are at an even step. InstFail
confusingly returns true
, as we do not care about branches that do not lead to matches.
Now to one of my least favourite parts of Golang, copying nested maps. But here we go. Let’s introduce a helper method, as when we branch for InstAlt
we will need a separate ropy of the visited map for both branches.
OK, with that out of the way, we can tackle the InstAlt
instructions.
This one is different, as we want to make sure we can only match even length strings, so we need to &&
the conditions, rather than ||
them.
Due diligence again.
If we stop for a second to see what we’ve built, it’s obvious we’ve built a potential infinite loop. Let’s rectify this. If we arrive back at an instruction that we’ve been before with the same idx % 2
, we can just prune this branch and return true
as it is exactly the same.
So let’s prepend our switch statement with the following.
And then, of course, we need to keep track of what we’ve seen before, which again involves one of my least favourite parts of Golang, nested maps.
And that’s done! To see the whole code used in this post head of to this Github Gist.
If you have made it this far, it is probably also worth noting that the Z3 SMT solver has a Regex Theory. The last time I’ve played with it it was giving obviously incorrect answers, but that seems to have been rectified since. The nice thing about regexp/syntax
is that it uses the same library your application uses if you write it in Golang, and can be used in programs with more predictable performance and fewer dependencies than Z3.