A parser is a tool used to split a text stream, typically in some human readable form, into a representation suitable for understanding by a computer. There are many parser tools in existence, but by far the most well known are Lex and Yacc, and their open source alternatives Flex and Bison.
Unfortunately there are many problems with Lex and Yacc, including language dependence and the difficultly of specifying grammar which will work. These issues are discussed, along with the things that are hard to do in this system, and yet are required frequently.
An alternative design for a parsing system is given, comprising of three separate modules being Bracket, Lex and Token. Their advantages are discussed, along with their relationship to traditional Lex and Yacc. Details of implementation are given.
Some of the performance claims in the system are wrong, particular the claim that maximal munch lexing is O(n). I am hoping to fix this (read more).
Parsing as Types
If a parser were written as a Haskell program, then the types of Lex and Yacc would probably be given as:
parsing :: String -> Tree Token parsing = yacc . lex lex :: String -> List Token yacc :: List Token -> Tree Token
The alternative design presented by my parser can be thought of as:
parsing :: String -> Tree Token parsing = group . map lex . bracket bracket :: String -> Tree String lex :: String -> List Token group :: Tree Token -> Tree Token