Parsers

The parser combinator library described here is based on a library written for the Clean pure functional programming language and described in chapter 5 of the 'Clean Book' (PDF available here). Based on the description in that chapter I developed a version for Factor, a concatenative language.

A parser is a word or quotation that, when called, processes an input string on the stack, performs some parsing operation on it, and returns a result indicating the success of the parsing operation.

The result returned by a parser is known as a 'list of successes'. It is a lazy list of standard Factor cons cells. Each cons cell is a result of a parse. The car of the cell is the remaining input left to be parsed and the cdr of the cell is the result of the parsing operation.

A lazy list is used for the result as a parse operation can potentially return many successful results. For example, a parser that parses one or more digits will return more than one result for the input "123". A successful parse could be "1", "12" or "123".

The list is lazy so if only one parse result is required the remaining results won't actually be processed if they are not requested. This improves efficiency.

The cdr of the result pair can be any value that the parser wishes to return. It could be the successful portion of the input string parsed, an abstract syntax tree representing the parsed input, or even a quotation that should get called for later processing.

A Parser Combinator is a word that takes one or more parsers and returns a parser that when called uses the original parsers in some manner.

Example Parsers

The following are some very simple parsers that demonstrate how general parsers work and the 'list of sucesses' that are returned as a result.

  (1) : char-a ( inp -- result )
        0 over string-nth CHAR: a = [
          1 swap string-tail CHAR: a cons unit delay lunit
        ] [
          drop lnil
        ] ifte ;
  (2) "atest" char-a [ [ . ] leach ] when*
      => [[ "test" 97 ]]
  (3) "test"  char-a [ [ . ] leach ] when*
      =>

'char-a' is a parser that only accepts the character 'a' in the input string. When passed an input string with a string with a leading 'a' then the 'list of successes' has 1 result value. The cdr of that result value is the character 'a' successfully parsed, and the car is the remaining input string. On failure of the parse an empty list is returned.

The parser combinator library provides a combinator, <&>, that takes two parsers off the stack and returns a parser that calls the original two in sequence. An example of use would be calling 'char-a' twice, which would then result in an input string expected with two 'a' characters leading:

  (1) "aatest" [ char-a ] [ char-a ] <&> call
      => < list of successes >
  (2) [ . ] leach
      => [[ "test" [[ 97 97 ]] ]]

Tokens

Creating parsers for specfic characters and tokens can be a chore so there is a word that, given a string token on the stack, returns a parser that parses that particular token:

  (1) "begin" token 
      => < a parser that parses the token "begin" >
  (2) dup "this should fail" swap call lnil? .
      => t
  (3) "begin a successfull parse" swap call 
      => < lazy list >
  (4) [ . ] leach
      => [[ " a successfull parse" "begin" ]]

Predicate matching

The word 'satisfy' takes a quotation from the top of the stack and returns a parser than when called will call the quotation with the first item in the input string on the stack. If the quotation returns true then the parse is successful, otherwise it fails:

  (1) : digit-parser ( -- parser )
        [ digit? ] satisfy ;
  (2) "5" digit-parser call [ . ] leach
      => [[ "" 53 ]]
  (3) "a" digit-parser call lnil? .
      => t

Note that 'digit-parser' returns a parser, it is not the parser itself. It is really a parser generating word like 'token'. Whereas our 'char-a' word defined originally was a parser itself.

Zero or more matches

Now that we can parse single digits it would be nice to easily parse a string of them. The '<*>' parser combinator word will do this. It accepts a parser on the top of the stack and produces a parser that parses zero or more of the constructs that the original parser parsed. The result of the '<*>' generated parser will be a list of the successful results returned by the original parser.

  (1) digit-parser <*>
      => < parser >
  (2) "123" swap call
      => < lazy list >
  (3) [ . ] leach
      => [ "" [ 49 50 51 ] ]
           [ "3" [ 49 50 ] ]
           [ "23" [ 49 ] ]
           [ "123" ]

In this case there are multiple successful parses. This is because the occurrence of zero or more digits happens more than once. There is also the 'f' case where zero digits is parsed. If only the 'longest match' is required then the lcar of the lazy list can be used and the remaining parse results are never produced.

Manipulating parse trees

The result of the previous parse was the list of characters parsed. Sometimes you want this to be something else, like an abstract syntax tree, or some calculation. For the digit case we may want the actual integer number.

For this we can use the '<@' parser combinator. This combinator takes a parser and a quotation on the stack and returns a new parser. When the new parser is called it will call the original parser to produce the results, then it will call the quotation on each successfull result, and the result of that quotation will be the result of the parse:

  (1) : digit-parser2 ( -- parser )
        [ digit? ] satisfy [ digit> ] <@ ;
  (2) "5" digit-parser2 call [ . ] leach
      => [[ "" 5 ]]

Notice that now the result is the actual integer '5' rather than character code '53'.

  (1) : digit-list>number ( list -- number )
         #! Converts a list of digits to a number
         [ >digit ] map >string dup empty? [ 
           drop 0 
         ] [
	   str>number 
         ]  ifte ;
  (2) : natural-parser ( -- parser )
        digit-parser2 <*> [ car digit-list>number unit  ] <@  ;
  (3) "123" natural-parser call
      => < lazy list >
  (4) [ . ] leach
      => [ "" 123 ]
           [ "3" 12 ]
           [ "23" 1 ]
           [ "123" 0 ]
           [ [ 123 ] | "" ]

The number parsed is the actual integer number due to the operation of the '<@' word. This allows parsers to not only parse the input string but perform operations and transformations on the syntax tree returned.

A useful debugging method to work out what to use in the quotation passed to <@ is to write an initial version of the parser that just displays the topmost item on the stack:

  (1) : natural-parser-debug ( -- parser )
        digit-parser2 <*> [ "debug: " write dup . ] <@  ;
  (3) "123" natural-parser-debug call lcar .
      => debug: [ [ 1 2 3 ] ]
           [ "" [ 1 2 3 ] ]

From the debug output we can see how to manipulate the result to get what we want. In this case it's the quotation in the previous example.

Sequential combinator

To create a full grammar we need a parser combinator that does sequential compositions. That is, given two parsers, the sequential combinator will first run the first parser, and then run the second on the remaining text to be parsed. As the first parser returns a lazy list, the second parser will be run on each item of the lazy list. Of course this is done lazily so it only ends up being done when those list items are requested. The sequential combinator word is <&>.

  ( 1 ) "number:" token 
       => < parser that parses the text 'number:' >
  ( 2 ) natural-parser
       => < parser that parses natural numbers >
  ( 3 ) <&>
       => < parser that parses 'number:' followed by a natural >
  ( 4 ) "number:100" swap call
       => < list of successes >
  ( 5 ) [ . ] leach
       => [ "" "number:" 100 ]
            [ "0" "number:" 10 ]
            [ "00" "number:" 1 ]
            [ "100" "number:" 0 ]

In this example we might prefer not to have the parse result contain the token, we want just the number. Two alternatives to <&> provide the ability to select which result to use from the two parsers. These operators are <& and &>. The < or > points in the direction of which parser to retain the results from. So our example above could be:

  ( 1 ) "number:" token 
       => < parser that parses the text 'number:' >
  ( 2 ) natural-parser
       => < parser that parses natural numbers >
  ( 3 ) &>
       => < parser that parses 'number:' followed by a natural >
  ( 4 ) "number:100" swap call
       => < list of successes >
  ( 5 ) [ . ] leach
       => [ "" 100 ]
            [ "0" 10 ]
            [ "00" 1 ]
            [ "100" 0 ]

Notice how the parse result only contains the number due to &> being used to retain the result of the second parser.

Choice combinator

As well as a sequential combinator we need an alternative combinator. The word for this is <|>. It takes two parsers from the stack and returns a parser that will first try the first parser. If it succeeds then the result for that is returned. If it fails then the second parser is tried and its result returned.

  ( 1 ) "one" token
        => < parser that parses the text 'one' >
  ( 2 ) "two" token 
        => < parser that parses the text 'two' >
  ( 3 ) <|>
        => < parser that parses 'one' or 'two' >
  ( 4 ) "one" over call [ . ] leach
        => [[ "" "one" ]]
  ( 5 ) "two" swap call [ . ] leach
        => [[ "" "two" ]]

Option combinator

The option combinator, <?> allows adding optional elements to a parser. It takes one parser off the stack and if the parse succeeds add it to the result tree, otherwise it will ignore it and continue. The example below extends our natural-parser to parse integers with an optional leading minus sign.

  ( 1 ) : integer-parser
          "-" token <?> natural-parser <&> ;
  ( 2 ) "200" integer-parser call [ . ] leach 
       => [ "" [ ] 200 ]
            [ "0" [ ] 20 ]
            [ "00" [ ] 2 ]
            [ "200" [ ] 0 ]
  ( 3 ) "-200" integer-parser call [ . ] leach
       => [ "" [ "-" ] 200 ]
            [ "0" [ "-" ] 20 ]
            [ "00" [ "-" ] 2 ]
            [ "200" [ "-" ] 0 ]
            [ "-200" [ ] 0 ]
  ( 4 ) : integer-parser2
          integer-parser [ uncons swap [ car -1 * ] when ] <@ ;
  ( 5 ) "200" integer-parser2 call [ . ] leach 
       => [ "" 200 ]
            [ "0" 20 ]
            [ "00" 2 ]
            [ "200" 0 ]
  ( 6 ) "-200" integer-parser2 call [ . ] leach
       => [ "" -200 ]
            [ "0" -20 ]
            [ "00" -2 ]
            [ "200" 0 ]
            [ "-200" 0 ]

Skipping Whitespace

A parser transformer exists, the word 'sp', that takes an existing parser and returns a new one that will first skip any whitespace before calling the original parser. This makes it easy to write grammers that avoid whitespace without having to explicitly code it into the grammar.

  ( 1 ) "  123" natural-parser call [ . ] leach
        => [ "  123" 0 ]
  ( 2 ) "  123" natural-parser sp call [ . ] leach
        => [ "" 123 ]
             [ "3" 12 ]
             [ "23" 1 ]
             [ "123" 0 ]

Eval grammar example

This example presents a simple grammar that will parse a number followed by an operator and another number. A factor expression that computes the entered value will be executed.

  ( 1 ) natural-parser
        => < a parser for natural numbers >
  ( 2 ) "/" token "*" token "+" token "-" token <|> <|> <|>
        => < a parser for the operator >
  ( 3 ) sp [ "\\ " swap cat2 eval unit ] <@
        => < operator parser that skips whitespace and converts to a 
             factor expression >
  ( 4 ) natural-parser sp
        => < a whitespace skipping natural parser >
  ( 5 ) <&> <&> [ uncons uncons swap append append call ] <@
        => < a parser that parsers the expression, converts it to
             factor, calls it and puts the result in the parse tree >
  ( 6 ) "123 + 456" over call lcar .
        => [[ "" 579 ]]
  ( 7 ) "300-100" over call lcar .
        => [[ "" 200 ]]
  ( 8 ) "200/2" over call lcar .
        => [[ "" 100 ]]

It looks complicated when expanded as above but the entire parser, factored a little, looks quite readable:

  ( 1 ) : operator ( -- parser )
          "/" token 
          "*" token <|>
          "+" token <|>
          "-" token <|>
          [ "\\ " swap cat2 eval unit ] <@ ;
  ( 2 ) : expression ( -- parser )
          natural-parser 
          operator sp <&>  
          natural-parser sp <&> 
          [ uncons swap uncons -rot append append reverse call ] <@ ;
  ( 3 ) "40+2" expression call lcar .
        => [[ "" 42 ]]