# Helix

Helix is a maximal munch lexer for Haskell, which will form part of my parser. I hope to release Helix first as a standalone tool, hopefully with compatability modes for both Alex and Flex. A prototype currently exists.

## Problems with Maximal Munch Lexers

Traditionally, most maximal munch lexers are *O(n ^{2})*. Note that while regular expression matching is

*O(n)*, this is a fundamentally different problem. To observe the worst case behaviour requires a pathalogical lexer specification.

Consider the example of the regular expressions ``a*b`` or ``a``, parsing maximal munch over a long list of `'a'` characters.

This can be written as an Alex specification as:

{ module Alex(alexScanTokens) where } %wrapper "basic" tokens :- "a" { id } "a"*"b" { id }

And a driver written as:

n = 10000 main = print $ length $ alexScanTokens $ replicate n 'a'

The performance of this is

As you can see from this graph, the performance is *O(n ^{2})*. Statistical tests confirm this.

The reason is that the lexer runs to the end looking for a `'b'`, it doesn't find one so at this point it outputs a single letter `'a'` as the token, and restarts the lexer at the second character.

A solution has been given by Thomas Reps, in a paper entitled "Maximal-munch" tokenization in linear time. This gives *O(n)* performance but requires *O(n)* memory, both in the best case and the worse case. Contrast this to the standard lexers which require O(n) memory in the worst case, but a typical case of O(1) memory.