TagSoup is a library for parsing HTML/XML. It supports the HTML 5 specification, and can be used to parse either well-formed XML, or unstructured and malformed HTML from the web. The library also provides useful functions to extract information from an HTML document, making it ideal for screen-scraping.
The library provides a basic data type for a list of unstructured tags, a parser to convert HTML into this tag type, and useful functions and combinators for finding and extracting information.
- TagSoup for Java - an independently written malformed HTML parser for Java. Including links to other HTML parsers.
- HXT: Haskell XML Toolbox - a more comprehensive XML parser, giving the option of using TagSoup as a lexer.
- Other Related Work - as described on the HXT pages.
- Using TagSoup with Parsec - a nice combination of Haskell libraries.
- tagsoup-parsec - a library for easily using TagSoup as a token type in Parsec.
- WraXML - construct a lazy tree from TagSoup lexemes.