« Back to home

"Stand back! I know regular expressions!"

For the last few days I’ve been taking part in the Advent of Code. Each day there are a couple of small programming problems to solve in your choice of language. There’s a Reddit community where people discuss the problems and their solutions.

After solving the Day 5 problems pretty quickly, I went to the discussions and was a bit surprised to find that everyone was talking about regular expressions. I hadn’t used them. It had seemed pretty obvious to me that the state machines and string searches required could be handled explicitly, could do the job in linear time, and would likely be faster than a regexp engine.

But… what if I was wrong? What if I’d written the extra code for nothing?

I decided to implement a second pair of solutions using precompiled regexps and the PCRE library, make sure they passed the same tests, and benchmark all four. I had to install a PCRE library, because Go’s built-in regular expression library doesn’t support all the Perl-like features needed to solve the problem concisely.

Here’s what I got when I benchmarked:

BenchmarkNice1-4            2000        738282 ns/op
BenchmarkNice1RegExp-4      1000       2269376 ns/op
BenchmarkNice2-4            2000        901082 ns/op
BenchmarkNice2RegExp-4       300       4140755 ns/op

The second number in each row is the number of runs Go decided it needed to average to get a good measure of the speed.

My intuition was right, using a compiled language and a straightforward non-regexp state machine implementation beat a PCRE-based one, at least for this fairly simple problem.

But what about Unicode?

Pondering performance some more, I realized that Unicode processing was likely impacting the performance of my code, what with Go handling strings as UTF-8 by default.

I decided that for fairness, I should make sure both the regexp and non-regexp implementations dealt properly with Unicode.

The problem specifically says that the only characters deemed to be vowels are aeiou. However, part 2 talks about letters, which would logically include things like é and ü.

Also, nowhere does it state that the algorithm is allowed to assume that the input strings are only letters, so I decided to make sure my code could process things like a☃️ee⛄️ioo🌨ee🎅u. This is complicated in some programming languages, because Santa does not reside in the the [Basic Multilingual Plane](https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane). Not a big deal in Go, though.

Once I’d done that and made sure all the tests still ran, I reran the benchmarks:

BenchmarkNice1-4            2000        674877 ns/op
BenchmarkNice1RegExp-4       500       2412329 ns/op
BenchmarkNice2-4            1000       2160843 ns/op
BenchmarkNice2RegExp-4       300       4806563 ns/op

I’d managed to speed up my part 1 code while reworking it, but part 2 was slower. Turning on UTF-8 processing for PCRE and fixing the regexps made all of them a bit slower too. Overall, the non-regexp solutions still win.

The moral of the story: Yes, regular expressions are really useful, but they’re not necessarily the first thing you should reach for.

Of course, the usual caveats about premature optimization and the benefits of writing as little code as possible still apply.