Introduction

The following is a tutorial on how to use the Haskell grammar-combinators parser library.

The grammar-combinators library is a next-generation parser library which addresses the fundamental limitations inherent to current parser combinator libraries. For more info, see the project's website.

In the following text, we define an example grammar and use it with some grammar algorithms. This text assumes you are more or less familiar with common context-free grammar terminology and parser combinators. This text does not attempt to explain the internals of the grammar-combinators library, but tries to explain it from the point of view of a library user.

Preliminaries

First install the grammar-combinators library through Cabal with the following commands:

cabal update
cabal install grammar-combinators

The rest of this text contains code fragments. These code fragments can be copied into a Haskell source file and executed directly. You can also download the source code of this page and export its contents to a Haskell script that you can work with in GHCi (all examples can be executed):

darcs get http://code.haskell.org/grammar-combinators/
cd grammar-combinators/website
make
ghci tutorial.hs

We will use some GHC extensions (you can't use grammar-combinators without these).

{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE GADTs #-}
{-# LANGUAGE TypeFamilies #-}

The base import for grammar-combinators is the following:

import Text.GrammarCombinators.Base

And beyond that, we use algorithms from the following modules.

import Text.GrammarCombinators.Utils.PrintGrammar
import Text.GrammarCombinators.Utils.CalcFirst
import Text.GrammarCombinators.Parser.Packrat
import Text.GrammarCombinators.Parser.Parsec
import Text.GrammarCombinators.Parser.UUParse
import Text.GrammarCombinators.Transform.FilterDies
import Text.GrammarCombinators.Transform.FoldLoops
import Text.GrammarCombinators.Transform.UniformPaull
import Text.GrammarCombinators.Transform.UnfoldDead

The grammar

The grammar we will work with is the following (in a BNF-like notation).

Line   ::= Expr EOF
Expr   ::= Expr '+' Term
         | Term
Term   ::= Term '*' Factor
         | Factor
Factor ::= '(' Expr ')'
         | Digit+
Digit  ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'

Non-terminals and the AST

To define this grammar using the grammar-combinators library, we start by defining data types for the Abstract Syntax Tree (AST) of the grammar. This is important because it defines the structure of the grammar and the structural relations between the non-terminals.

newtype Line  =  SExpr Expr
data Expr     =  Sum Expr Term
              |  STerm Term
data Term     =  Product Term Factor
              |  SFactor Factor
data Factor   =  Paren Expr
              |  LiteralNumber [Digit]
newtype Digit =  MkDigit Char

We can see from these definitions that for example an expression must either be a sum of a subexpression and a term or it must be a single term itself.

The grammar's domain

Next, we define a GADT representing the domain (the set of non-terminals) of our grammar. The ArithDomain ix functor contains proof terms proving that the five AST data types are elements of the domain of our grammar.

data ArithDomain ix where
    Line :: ArithDomain Line
    Expr :: ArithDomain Expr
    Term :: ArithDomain Term
    Factor :: ArithDomain Factor
    Digit :: ArithDomain Digit

Note that the GADT's constructors have the same names as the corresponding non-terminal types (this is possible because Haskell separates value level and type level name spaces).

Any domain needs to implement some type classes. The FoldFam type class defines a foldFam operation folding a function over all non-terminals in the domain. ShowFam defines function showIdx returning a String representation of a non-terminal. EqFam defines the function overrideIdx, which allows type-safe overriding a domain function for a single non-terminal. Finally, MemoFam defines fromMemo and toMemo, allowing memoization of a domain function. The Domain type class is empty, but requires the other four type classes to be implemented.

All of the following is boilerplate, and it is probably possible to generate this code from the domain definition automatically using Template Haskell, but this is not yet implemented. Contributions are welcome ;).

instance FoldFam ArithDomain where
  foldFam f n = f Line $ f Expr $ f Term $ f Factor $ f Digit n
instance ShowFam ArithDomain where
  showIdx Line = "Line"
  showIdx Expr = "Expr"
  showIdx Term = "Term"
  showIdx Factor = "Factor"
  showIdx Digit = "Digit"
instance EqFam ArithDomain where
  overrideIdx _ Line v Line = v
  overrideIdx _ Expr v Expr = v
  overrideIdx _ Term v Term = v
  overrideIdx _ Factor v Factor = v
  overrideIdx _ Digit v Digit = v
  overrideIdx f _ _ idx = f idx
instance MemoFam ArithDomain where
  data Memo ArithDomain v = MemoArithDomain (v Line) (v Expr) (v Term) (v Factor) (v Digit)
  toMemo f = MemoArithDomain (f Line) (f Expr) (f Term) (f Factor) (f Digit)
  fromMemo (MemoArithDomain v _ _ _ _) Line = v 
  fromMemo (MemoArithDomain _ v _ _ _) Expr = v 
  fromMemo (MemoArithDomain _ _ v _ _) Term = v 
  fromMemo (MemoArithDomain _ _ _ v _) Factor = v 
  fromMemo (MemoArithDomain _ _ _ _ v) Digit = v 
instance Domain ArithDomain

The pattern functor

Next, we need to define the pattern functor PFArith for our domain, defining the structure of the non-terminals in a more general way than the AST defined above. It is a GADT with constructors analogous to the AST data types defined above, but with recursive positions replaced by corresponding values of a given subtree representation functor r. Pattern functor values are tagged with the AST type they represent.

data PFArith r ix where
  LineF ::                r Expr ->               PFArith r Line
  SumF ::                 r Expr -> r Term ->     PFArith r Expr
  SubtractionF ::         r Expr -> r Term ->     PFArith r Expr
  SingleTermF ::          r Term ->               PFArith r Expr
  ProductF ::             r Term -> r Factor ->   PFArith r Term
  QuotientF ::            r Term -> r Factor ->   PFArith r Term
  SingleFactorF ::        r Factor ->             PFArith r Term
  ParenthesizedF ::       r Expr ->               PFArith r Factor
  NumberF ::              [r Digit] ->            PFArith r Factor
  DigitF ::               Char ->                 PFArith r Digit

The following type instance registers PFArith as the pattern functor for domain ArithDomain.

type instance PF ArithDomain = PFArith

The grammar

We can now define our grammar. The grammar is represented by a function grammarArith, which returns for each non-terminal its production rules.

grammarArith :: ExtendedContextFreeGrammar ArithDomain Char
grammarArith Line =
      LineF $>>             ref Expr >>>* endOfInput
grammarArith Expr =
      SubtractionF $>>      ref Expr >>>* token '-' >>> ref Term
  ||| SumF $>>              ref Expr >>>* token '+' >>> ref Term
  ||| SingleTermF $>>       ref Term
grammarArith Term =
      SingleFactorF $>>     ref Factor
  ||| QuotientF $>>         ref Term >>>* token '/' >>> ref Factor
  ||| ProductF $>>          ref Term >>>* token '*' >>> ref Factor
grammarArith Factor =
      NumberF $>>           many1Ref Digit
  ||| ParenthesizedF $>>*   token '(' >>> ref Expr >>>* token ')'
grammarArith Digit =
      DigitF $>>            tokenRange ['0' .. '9']

The production rules are composed using the following combinators:

The grammar's type is ExtendedContextFreeGrammar ArithDomain Char, meaning that it is an extended context-free grammar for domain ArithDomain and token type Char. It is a context-free grammar because it uses the ref recursion combinator and it is extended because it additionally uses many1Ref. Several other types of grammars exist in the grammar-combinators library.

First useful results

We can now already do some useful stuff with the newly defined grammar. For example, we can print a textual representation of the grammar using the printGrammar algorithm.

This gives the following result:

*Main> putStr $ printGrammar grammarArith
<Line> ::= <Expr> EOI
<Expr> ::= (<Expr> '-' <Term>) | (<Expr> '+' <Term>) | <Term>
<Term> ::= <Factor> | (<Term> '/' <Factor>) | (<Term> '*' <Factor>)
<Factor> ::= <Digit>+ | ('(' <Expr> ')')
<Digit> ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'

We could also compute FIRST sets for non-terminals with the calcFirst algorithm:

*Main> calcFirst grammarArith Line
FS {firstSet = fromList "(0123456789", canBeEmpty = False, canBeEOI = False}

Parsing

The most important thing to do with a grammar is of course to parse strings according to it. To do this, we first need to obtain a processing grammar: a grammar combined with a set of semantic actions (a semantic processor). The standard trivialProcessor returns unit values wrapped in the Multirec identity functor K0 for all non-terminals. A parser that does not produce semantic values is often called a recognizer.

recognizeGrammarArith :: ProcessingExtendedContextFreeGrammar ArithDomain Char (K0 ())
recognizeGrammarArith = applyProcessorE grammarArith trivialProcessor

Note that we absolutely need to provide a type signature for recognizeGrammarArith, because otherwise GHC will give a strange error message:

    No instances for (LoopParserRule p ArithDomain (K0 ()),
                      TokenParserRule p Char)

If we try to use this recognizer grammar with the Packrat parser, we get the following error.

*Main> parsePackrat recognizeGrammarArith Line "123"

<interactive>:1:13:
    Could not deduce (LoopParserRule p ArithDomain (K0 ()))
      from the context (ParserRule p,
                        RecParserRule p ArithDomain (K0 ()),
                        TokenParserRule p Char)
      arising from a use of `recognizeGrammarArith'
                   at <interactive>:1:13-33

This error message is not very clear (if you're not the grammar-combinators library author ;)), but what it means is that we are trying to use the packrat parser with an extended context-free grammar, which it is not capable of. The solution to this problem is to use a first grammar transformation, in this case, the equivalent of the standard translation of extended context-free grammars to context-free grammars, which we call loop folding.

Folding loops

Applying the foldLoops algorithm to our arithmetic expressions recognizer grammar gives us the following definition:

recognizeGrammarArithFolded :: ProcessingContextFreeGrammar (FoldLoopsDomain ArithDomain) Char (FoldLoopsValue (K0 ()))
recognizeGrammarArithFolded = foldAndProcessLoops recognizeGrammarArith

We get a new, non-extended, grammar over a new domain FoldLoopsDomain ArithDomain with a new semantic value family FoldLoopsValue (K0 ()). If we print this new grammar, we get the following:

*Main> putStr $ printGrammar recognizeGrammarArithFolded 
<Line*> ::= (<Line> <Line*>) | epsilon
<Expr*> ::= (<Expr> <Expr*>) | epsilon
<Term*> ::= (<Term> <Term*>) | epsilon
<Factor*> ::= (<Factor> <Factor*>) | epsilon
<Digit*> ::= (<Digit> <Digit*>) | epsilon
<Line> ::= <Expr> EOI
<Expr> ::= (<Expr> '-' <Term>) | (<Expr> '+' <Term>) | <Term>
<Term> ::= <Factor> | (<Term> '/' <Factor>) | (<Term> '*' <Factor>)
<Factor> ::= (<Digit> <Digit*>) | ('(' <Expr> ')')
<Digit> ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'

We can see that the foldLoops algorithm transformed the original grammar as follows: for each non-terminal idx, add a new non-terminal idx\*, representing a Kleene-*-quantified version of the base non-terminal. Additionally, quantified references (using manyRef and many1Ref) are translated into normal references to the new Kleene-* non-terminals.

We can now try to parse a string with this grammar, but GHCi complains that it cannot print a value of type K0 () Line.

*Main> parsePackrat recognizeGrammarArithFolded (FLBase Line) "123" 

<interactive>:1:0:
    No instance for (Show (K0 () Line))
      arising from a use of `print' at <interactive>:1:0-59
    Possible fix: add an instance declaration for (Show (K0 () Line))
    In a stmt of a 'do' expression: print it

We are not sure why Multirec doesn't provide this by default, but we can easily add a suitable Show instance:

instance (Show v) => Show (K0 v ix) where
  show (K0 v) = "K0 (" ++ show v ++ ")"

When we now try to parse, we go into an infinite loop:

*Main> parsePackrat recognizeGrammarArithFolded (FLBase Line) "123" 
^CInterrupted.

The problem is a well-known problem for parser combinator libraries, namely that they can't handle grammars with left-recursion (which means that a production rule directly references itself in the left-most position (like Term and Expr above). Standard parser combinator libraries cannot handle this problem. In fact, they cannot even detect that it occurs.

One of the most important advantages of the grammar-combinators library is that we can automatically fix this problem.

Transforming away left-recursion

Grammar-combinators actually provides two transformations that remove left-recursion from a grammar: the left-corner transformation and a uniform variant of the classic Paull transformation. We will use the latter in this text, because it is currently better optimized for grammars that only feature indirect left-recursion.

recognizeGrammarArithTP :: ProcessingExtendedContextFreeGrammar (UPDomain ArithDomain) Char (UPValue (K0 ()))
recognizeGrammarArithTP = transformUniformPaullE recognizeGrammarArith

The transformation gives us a new, transformed, grammar over new domain UPDomain ArithDomain with transformed semantic value family UPValue (K0 ()).

Printing this grammar gives us a grammar that is rather difficult to read:

*Main> putStr $ printGrammar recognizeGrammarArithTP
<Line> ::= <Line_head> <Line_tail>*
<Line_head> ::= die | ((<Expr> | die | (die <Expr>) | die | (die <Expr>)) EOI) | die | (die <Expr> EOI) | die
<Line_tail> ::= die
<Expr> ::= <Expr_head> <Expr_tail>*
<Expr_head> ::= die | ((die | ((((die | die) <Expr_tail>*) | die | (die <Expr>) | die | (die <Expr>)) '-')) <Term>) | die | ((die | ((((die | die) <Expr_tail>*) | die | (die <Expr>) | die | (die <Expr>)) '+')) <Term>) | <Term> | die | (die <Term>) | die | (die ((<Expr> '-' <Term>) | (<Expr> '+' <Term>) | <Term>)) | die
<Expr_tail> ::= ('-' <Term>) | ('+' <Term>) | die
<Term> ::= <Term_head> <Term_tail>*
<Term_head> ::= <Factor> | die | (die <Factor>) | die | ((die | ((((die | die) <Term_tail>*) | die | (die <Term>) | die | (die <Term>)) '/')) <Factor>) | die | ((die | ((((die | die) <Term_tail>*) | die | (die <Term>) | die | (die <Term>)) '*')) <Factor>) | die | (die (<Factor> | (<Term> '/' <Factor>) | (<Term> '*' <Factor>))) | die
<Term_tail> ::= ('/' <Factor>) | ('*' <Factor>) | die
<Factor> ::= <Factor_head> <Factor_tail>*
<Factor_head> ::= <Digit>+ | die | (die <Digit>+) | die | ((die | (('(' | die | ((die | die | die) '(')) <Expr>) | die | (die '(' <Expr>)) ')') | die | (die (<Digit>+ | ('(' <Expr> ')'))) | die
<Factor_tail> ::= die
<Digit> ::= <Digit_head> <Digit_tail>*
<Digit_head> ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' | die | (die ('0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9')) | die | (die ('0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9')) | die
<Digit_tail> ::= die

The grammar is actually cluttered by many die rules (can never match), and non-terminals that can never match anything (e.g. Line_tail). We can remove these from the grammar using algorithms filterDies and unfoldDead:

recognizeGrammarArithTPF :: ProcessingExtendedContextFreeGrammar (UPDomain ArithDomain) Char (UPValue (K0 ()))
recognizeGrammarArithTPF = filterDiesE (unfoldDeadE recognizeGrammarArithTP)

Additionally, we use print algorithm printReachableGrammar, which only prints non-terminals that are reachable from a given start non-terminal:

*Main> putStr $ printReachableGrammar recognizeGrammarArithTPF (UPBase Line)
<Digit_head> ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
<Digit> ::= <Digit_head>
<Factor_head> ::= <Digit>+ | ('(' <Expr> ')')
<Factor> ::= <Factor_head>
<Term_head> ::= <Factor>
<Term_tail> ::= ('/' <Factor>) | ('*' <Factor>)
<Term> ::= <Term_head> <Term_tail>*
<Expr_head> ::= <Term>
<Expr_tail> ::= ('-' <Term>) | ('+' <Term>)
<Expr> ::= <Expr_head> <Expr_tail>*
<Line_head> ::= <Expr> EOI
<Line> ::= <Line_head>

The grammar is now pretty readable and actually corresponds perfectly to what we would get after doing a standard manual removal of left recursion.

Recognizing...

After folding loops in the transformed grammar, we can parse using the packrat parser:

recognizeGrammarArithTPFF :: ProcessingContextFreeGrammar (FoldLoopsDomain (UPDomain ArithDomain)) Char (FoldLoopsValue (UPValue (K0 ())))
recognizeGrammarArithTPFF = foldAndProcessLoops recognizeGrammarArithTPF 
*Main> parsePackrat recognizeGrammarArithTPFF (FLBase (UPBase Line)) "123"
Parsed FLBV {unFLBV = UPBV {unUPBV = K0 (())}} _
*Main> parsePackrat recognizeGrammarArithTPFF (FLBase (UPBase Line)) "123+"
NoParse
*Main> parsePackrat recognizeGrammarArithTPFF (FLBase (UPBase Line)) "123+12"
Parsed FLBV {unFLBV = UPBV {unUPBV = K0 (())}} _

Our parser does not yet produce useful result values (because we used trivialProcessor), but it already correctly recognizes correct strings in our language. For succesful parses, we get a wrapped version of the value K0 () produced by trivialProcessor.

Semantic Actions

In order to produce meaningful result values using our grammar, we need a suitable semantic processor. We first define its semantic value family, which defines the set of values it produces.

data family ArithValue ix 
newtype instance ArithValue Line   = ArithValueL Int deriving (Show)
newtype instance ArithValue Expr   = ArithValueE Int deriving (Show)
newtype instance ArithValue Term   = ArithValueT Int deriving (Show)
newtype instance ArithValue Factor = ArithValueF Int deriving (Show)
newtype instance ArithValue Digit  = ArithValueD Char deriving (Show)

A semantic value family is a type family that specifies for each non-terminal what its semantic value is. In this case, we have an Int for all non-terminals except Digit, for which we use a character.

We now define a semantic processor using this semantic value family:

calcArith :: Processor ArithDomain ArithValue
calcArith Line   (LineF (ArithValueE e))                    = ArithValueL e
calcArith Expr   (SumF (ArithValueE e) (ArithValueT t))     = ArithValueE $ e + t
calcArith Expr   (SingleTermF (ArithValueT t))              = ArithValueE t
calcArith Term   (ProductF (ArithValueT e) (ArithValueF t)) = ArithValueT $ e * t
calcArith Term   (SingleFactorF (ArithValueF t))            = ArithValueT t
calcArith Factor (ParenthesizedF (ArithValueE e))           = ArithValueF e
calcArith Factor (NumberF ds)                               = ArithValueF $ read $ map unArithValueD ds
calcArith Digit  (DigitF c)                                 = ArithValueD c

unArithValueD :: ArithValue Digit -> Char
unArithValueD (ArithValueD c) = c

The function calcArith is given the non-terminal matched, and the components of the matched production rule that the grammar stored in the pattern functor value. For example, for a match of the Expr non-terminal, using the production rule "Expr ::= Expr '+' Term" (2nd case in the calcArith definition, PFArith constructor SumF), the value of the left-hand side and the right-hand side expressions are unwrapped, added and wrapped in the ArithValue constructor for Expr again. Note how all of this corresponds almost perfectly to what you will find in a syntax-directed translation definition.

We can now repeat the grammar transformations that we used with the trivialProcessor before to get the following:

calcGrammarArith :: ProcessingExtendedContextFreeGrammar ArithDomain Char ArithValue
calcGrammarArith = applyProcessorE grammarArith calcArith
calcGrammarArithTP :: ProcessingExtendedContextFreeGrammar (UPDomain ArithDomain) Char (UPValue ArithValue)
calcGrammarArithTP = transformUniformPaullE calcGrammarArith
calcGrammarArithTPF :: ProcessingExtendedContextFreeGrammar (UPDomain ArithDomain) Char (UPValue ArithValue)
calcGrammarArithTPF = filterDiesE (unfoldDeadE calcGrammarArithTP) 
calcGrammarArithTPFF :: ProcessingContextFreeGrammar (FoldLoopsDomain (UPDomain ArithDomain)) Char (FoldLoopsValue (UPValue ArithValue))
calcGrammarArithTPFF = foldAndProcessLoops calcGrammarArithTPF

And parsing now gives us:

*Main> parsePackrat calcGrammarArithTPFF (FLBase (UPBase Line)) "123"
Parsed FLBV {unFLBV = UPBV {unUPBV = ArithValueL 123}} _
*Main> parsePackrat calcGrammarArithTPFF (FLBase (UPBase Line)) "123+"
NoParse
*Main> parsePackrat calcGrammarArithTPFF (FLBase (UPBase Line)) "123+12"
Parsed FLBV {unFLBV = UPBV {unUPBV = ArithValueL 135}} _

Our grammar can now indeed calculate the values for the strings that it parses. Note how we have used a single grammar with multiple semantic processors, without using the AST as an intermediate representation.

Other Parsers

Note how we can actually use the exact same grammar with Parsec or UUParse parse algorithms:

*Main> parseParsec calcGrammarArithTPFF (FLBase (UPBase Line)) "" "123+12"
Right (FLBV {unFLBV = UPBV {unUPBV = ArithValueL 135}})
*Main> parseUU calcGrammarArithTPFF (FLBase (UPBase Line)) "123+12"
FLBV {unFLBV = UPBV {unUPBV = ArithValueL 135}}

One of the features we can take advantage of by using UUParse is its error correction facilities. If we parse a faulty string, we get a corrected result, and there is a way to obtain a list of errors it found and corrections it applied.

*Main> parseUU calcGrammarArithTPFF (FLBase (UPBase Line)) (0 :: Int) "123+(12"
FLBV {unFLBV = UPBV {unUPBV = ArithValueL 135}}