A parser is a tool used to split a text stream, typically in some human readable form, into a representation suitable for understanding by a computer. There are many parser tools in existence, but by far the most well known are Lex and Yacc, and their open source alternatives Flex and Bison.

Unfortunately there are many problems with Lex and Yacc, including language dependence and the difficultly of specifying grammar which will work. These issues are discussed, along with the things that are hard to do in this system, and yet are required frequently.

An alternative design for a parsing system is given, comprising of three separate modules being Bracket, Lex and Token. Their advantages are discussed, along with their relationship to traditional Lex and Yacc. Details of implementation are given.

Some of the performance claims in the system are wrong, particular the claim that maximal munch lexing is O(n). I am hoping to fix this (read more).

Parsing as Types

If a parser were written as a Haskell program, then the types of Lex and Yacc would probably be given as:

parsing :: String -> Tree Token
parsing = yacc . lex

lex :: String -> List Token
yacc :: List Token -> Tree Token

The alternative design presented by my parser can be thought of as:

parsing :: String -> Tree Token
parsing = group . map lex . bracket

bracket :: String -> Tree String
lex :: String -> List Token
group :: Tree Token -> Tree Token



Tags: parsing