Helix

Helix is a maximal munch lexer for Haskell, which will form part of my parser. I hope to release Helix first as a standalone tool, hopefully with compatability modes for both Alex and Flex. A prototype currently exists.

Problems with Maximal Munch Lexers

Traditionally, most maximal munch lexers are O(n2). Note that while regular expression matching is O(n), this is a fundamentally different problem. To observe the worst case behaviour requires a pathalogical lexer specification.

Consider the example of the regular expressions `a*b` or `a`, parsing maximal munch over a long list of 'a' characters.

This can be written as an Alex specification as:

{
module Alex(alexScanTokens) where
}

%wrapper "basic"

tokens :-
 "a" { id }
 "a"*"b" { id }

And a driver written as:

n = 10000
main = print $ length $ alexScanTokens $ replicate n 'a'

The performance of this is

Graph of Alex's performance

As you can see from this graph, the performance is O(n2). Statistical tests confirm this.

The reason is that the lexer runs to the end looking for a 'b', it doesn't find one so at this point it outputs a single letter 'a' as the token, and restarts the lexer at the second character.

A solution has been given by Thomas Reps, in a paper entitled "Maximal-munch" tokenization in linear time. This gives O(n) performance but requires O(n) memory, both in the best case and the worse case. Contrast this to the standard lexers which require O(n) memory in the worst case, but a typical case of O(1) memory.

Tags: haskell parsing program