DISCLAIMER: Let me say this first. Google’s regular expression implementations are known for not implementing features that make them, well, not regular. Both re2 and Golang’s regexp do not support backreferences. Otherwise, the things done here would be hard, or impossible.

I may not be Golang’s biggest fan in general (lack of generics, verbose syntax, simplistic type system, etc), but I’ve written a bunch of it in the last couple of years and found an unexpectedly useful feature. Golang provides a package called regexp/syntax that proves to be as useful as its documentation is sparse.

In essence, this package exposes the user to the finite state machines built by the regexp compiler. This can be used to do analyses on regular expressions such as “does this regexp ever match a particular character after matching n characters?” or “does this regexp match any strings starting with a particular prefix?”. These might sound like constructed examples, but both of them actually popped up in my dayjob.

# Toy Example: Loop detection

For sake of simplicity, let’s explore a constructed example first in this post: does a regexp match strings of arbitrary length? Or, in more technical terms: is there a loop in the finite state machine? Let’s get right started with the boilerplate of compiling a regular expression into something called a Prog.

This actually produces a nice and readable representation of the generated finite state machine. Note the * next to 2 which tells us that this is the initial state.

What we see here is the textual representation of the Prog object. It’s a struct containing

• Inst: list of instructions
• Start: initial instruction

Inst is the type used to represent an instruction. It’s a struct containing:

• Op: type of instruction
• Out: next instruction (except for InstMatch and InstFail, which are terminal instructions)
• Arg: additional argument to some instructions

For the purposes of this toy example, all we care about is which instructions can follow from a given instruction. For most instructions, that is the instruction referred to in Out. Let’s introduce the odd ones here:

• InstMatch: successfully match input string. Does not have successor instruction.
• InstFail: reject input string. Does not have successor instruction.
• InstAlt / InstAltMatch: either Out or Arg are the successor instruction. A string matches if either of the branches arrives at InstMatch.

If you are curious about the difference between InstAlt and InstAltMatch: From all I could determine, InstAltMatch is an optimisation where it is known that one branch leads to a match while the other branch consumes characters. I do not see the compiler or any rewriting actually using this instruction, so it does not seem to be in use. Most of the implementation treats them interchangeably, while backtrack.go in the regex evaluator appears to use it to determine which branch to take.

This information allows us to implement a helper function to determining the successor instructions, given an instruction.

Now we can implement loop-detection by a simple breadth-first search, keeping track of already visited nodes in a set (i.e. a map[uint32]bool, because Golang does not have a set type).

# A regex engine

Now let’s try to build a very inefficient regex engine based on this. To do so, let us first introduce the various rune instructions (rune is Golang for Unicode codepoint). There is InstRune, InstRune1, InstRuneAny and InstRuneAnyNotNL. Most of them (except InstRune and InstRune1) should be self-explanatory, but here’s the whole list:

• InstRuneAny matches any rune.
• InstRuneAnyNotNL matches any rune except newlines.
• InstRune has a MatchRune method to determine whether it matches a rune.
• InstRune1 matches the rune provided in i.Rune[0] (obviously.)

This leaves us with two remaining useful instructions:

• InstCapture: capture a match into a capture group. We won’t bother with this here.
• InstEmptyWidth: match constrains on the current position in the string. This has a MatchEmptyWidth to determine whether it matches.

There’s also InstNop which, well, does nothing.

Of course, the easiest way to do this is a recursive evaluator. We pass in a program, the current instruction, and input, and the current position in the input.

Let’s first determine the previous and current rune at the current position, and use -1 for the borders of the string (to be consistent with MatchEmptyWidth).

And now all that’s left is one fairly large switch (if you look at actual implementations of regex engines, there are also often giant switches, so it’s legit).

InstAlt and InstAltMatch are the same, so let’s use the fallthrough statement for go switch statements (this is much more sane than C-style fallthrough by default).

We don’t care about InstCapture or InstNop here.

For InstEmptyWidth we use the method that was given to us for this.

(I could get used to this). InstMatch and InstFail are obvious.

Then there come the various ways of matching runes. Note that this is the only time we have to increment the index into our input, as this is the only time we actually consume any runes.

That’s it. Now some due diligence against us being bad programmers, and we’re done.

Well, that was fun. But not terribly exciting. But what this gives is as good mental model of what exactly the different instructions mean, which can be used to build more exciting things.

# Do we only match even-length strings?

Now that we can proudly proclaim we have written a regular expression engine (well, maybe we were slightly cheating and someone else helped a bit), let’s take up a bigger challenge. Given a regular expression that possibly matches arbitrarily length strings, determine whether all strings matched have an even size. Sounds like an interview question? A bit, but I also hope I’ll never get this as an actual interview question.

Let’s start with a similar prototype as for our Match function, but instead of our input let’s have something that flips around whether we are at an even or odd step. For reasons that will make sense later, let’s encode this as an integer that flips between 0 and 1 (so it’s idx % 2). We also need to keep track of which nodes we have seen before, or we will wait for a long time. But if you think about it a bit, we need to keep track of this for even and odd steps, an even visit and an odd visit are not the same. That results in the beautiful type of map[int]map[uint32]bool, or a map from an integer to a set of uint32.

Now let’s start with the easy part. I literally copy & pasted the Match code, removed all the stuff that actually does any … matching, plumbed through the visited map and made idx mod 2. That gives us all the rune instructions and InstCapture, InstNop and InstEmptyWidth (which are all, for all intents and purposes, noop).

InstMatch is straightforward, as we just have to check whether we are at an even step. InstFail confusingly returns true, as we do not care about branches that do not lead to matches.

Now to one of my least favourite parts of Golang, copying nested maps. But here we go. Let’s introduce a helper method, as when we branch for InstAlt we will need a separate ropy of the visited map for both branches.

OK, with that out of the way, we can tackle the InstAlt instructions.

This one is different, as we want to make sure we can only match even length strings, so we need to && the conditions, rather than || them.

Due diligence again.

If we stop for a second to see what we’ve built, it’s obvious we’ve built a potential infinite loop. Let’s rectify this. If we arrive back at an instruction that we’ve been before with the same idx % 2, we can just prune this branch and return true as it is exactly the same.

So let’s prepend our switch statement with the following.

And then, of course, we need to keep track of what we’ve seen before, which again involves one of my least favourite parts of Golang, nested maps.

And that’s done! To see the whole code used in this post head of to this Github Gist.

If you have made it this far, it is probably also worth noting that the Z3 SMT solver has a Regex Theory. The last time I’ve played with it it was giving obviously incorrect answers, but that seems to have been rectified since. The nice thing about regexp/syntax is that it uses the same library your application uses if you write it in Golang, and can be used in programs with more predictable performance and fewer dependencies than Z3.