libs/parser-combinators/parser-combinators.html

   1 <html>
   2   <head>
   3     <title>Parser Combinators</title>
   4     <link rel="stylesheet" type="text/css" href="style.css">
   5       </head>
   6   <body>
   7     <h1>Parsers</h1>
   8 <p class="note">The parser combinator library described here is based
   9   on a library written for the Clean pure functional programming language and
  10   described in chapter 5 of the 'Clean Book' (<a
  11   href="ftp://ftp.cs.kun.nl/pub/Clean/papers/cleanbook/II.05.ParserCombinators.pdf">PDF
  12   available here</a>). Based on the description
  13   in that chapter I developed a version for Factor, a concatenative
  14   language.</p>
  15 <p>A parser is a word or quotation that, when called, processes
  16    an input string on the stack, performs some parsing operation on
  17    it, and returns a result indicating the success of the parsing
  18    operation.</p>
  19 <p>The result returned by a parser is known as a 'list of
  20 successes'. It is a lazy list of standard Factor cons cells. Each cons
  21 cell is a result of a parse. The car of the cell is the remaining
  22 input left to be parsed and the cdr of the cell is the result of the
  23 parsing operation.</p>
  24 <p>A lazy list is used for the result as a parse operation can potentially
  25 return many successful results. For example, a parser that parses one
  26 or more digits will return more than one result for the input "123". A
  27 successful parse could be "1", "12" or "123".</p>
  28 <p>The list is lazy so if only one parse result is required the
  29 remaining results won't actually be processed if they are not
  30 requested. This improves efficiency.</p>
  31 <p>The cdr of the result pair can be any value that the parser wishes
  32 to return. It could be the successful portion of the input string
  33 parsed, an abstract syntax tree representing the parsed input, or even
  34 a quotation that should get called for later processing.</p>
  35 <p>A Parser Combinator is a word that takes one or more parsers and
  36 returns a parser that when called uses the original parsers in some
  37 manner.</p>
  38 <h1>Example Parsers</h1>
  39 <p>The following are some very simple parsers that demonstrate how
  40 general parsers work and the 'list of sucesses' that are returned as a
  41 result.</p>
  42 <pre class="code">
  43   (1) : char-a ( inp -- result )
  44         0 over string-nth CHAR: a = [
  45           1 swap string-tail CHAR: a cons unit delay lunit
  46         ] [
  47           drop lnil
  48         ] ifte ;
  49   (2) "atest" char-a [ [ . ] leach ] when*
  50       =&gt; [[ "test" 97 ]]
  51   (3) "test"  char-a [ [ . ] leach ] when*
  52       =&gt;
  53 </pre>
  54 <p>'char-a' is a parser that only accepts the character 'a' in the
  55 input string. When passed an input string with a string with a leading
  56 'a' then the 'list of successes' has 1 result value. The cdr of that
  57 result value is the character 'a' successfully parsed, and the car is
  58 the remaining input string. On failure of the parse an empty list is
  59 returned.</p>
  60 <p>The parser combinator library provides a combinator, &lt;&amp;&gt;, that takes
  61 two parsers off the stack and returns a parser that calls the original
  62 two in sequence. An example of use would be calling 'char-a' twice,
  63 which would then result in an input string expected with two 'a'
  64 characters leading:</p>
  65 <pre class="code">
  66   (1) "aatest" [ char-a ] [ char-a ] &lt;&amp;&gt; call
  67       =&gt; < list of successes >
  68   (2) [ . ] leach
  69       =&gt; [[ "test" [[ 97 97 ]] ]]
  70 </pre>
  71 <h2>Tokens</h2>
  72 <p>Creating parsers for specfic characters and tokens can be a chore
  73 so there is a word that, given a string token on the stack, returns
  74 a parser that parses that particular token:</p>
  75 <pre class="code">
  76   (1) "begin" token
  77       =&gt; < a parser that parses the token "begin" >
  78   (2) dup "this should fail" swap call lnil? .
  79       =&gt; t
  80   (3) "begin a successfull parse" swap call
  81       =&gt; < lazy list >
  82   (4) [ . ] leach
  83       =&gt; [[ " a successfull parse" "begin" ]]
  84 </pre>
  85 <h2>Predicate matching</h2>
  86 <p>The word 'satisfy' takes a quotation from the top of the stack and
  87 returns a parser than when called will call the quotation with the
  88 first item in the input string on the stack. If the quotation returns
  89 true then the parse is successful, otherwise it fails:</p>
  90 <pre class="code">
  91   (1) : digit-parser ( -- parser )
  92         [ digit? ] satisfy ;
  93   (2) "5" digit-parser call [ . ] leach
  94       =&gt; [[ "" 53 ]]
  95   (3) "a" digit-parser call lnil? .
  96       =&gt; t
  97 </pre>
  98 <p>Note that 'digit-parser' returns a parser, it is not the parser
  99 itself. It is really a parser generating word like 'token'. Whereas
 100 our 'char-a' word defined originally was a parser itself.</p>
 101 <h2>Zero or more matches</h2>
 102 <p>Now that we can parse single digits it would be nice to easily
 103 parse a string of them. The '<*>' parser combinator word will do
 104 this. It accepts a parser on the top of the stack and produces a
 105 parser that parses zero or more of the constructs that the original
 106 parser parsed. The result of the '<*>' generated parser will be a list
 107 of the successful results returned by the original parser.</p>
 108 <pre class="code">
 109   (1) digit-parser <*>
 110       =&gt; < parser >
 111   (2) "123" swap call
 112       =&gt; < lazy list >
 113   (3) [ . ] leach
 114       =&gt; [ "" [ 49 50 51 ] ]
 115            [ "3" [ 49 50 ] ]
 116            [ "23" [ 49 ] ]
 117            [ "123" ]
 118 </pre>
 119 <p>In this case there are multiple successful parses. This is because
 120 the occurrence of zero or more digits happens more than once. There is
 121 also the 'f' case where zero digits is parsed. If only the 'longest
 122 match' is required then the lcar of the lazy list can be used and the
 123 remaining parse results are never produced.</p>
 124 <h2>Manipulating parse trees</h2>
 125 <p>The result of the previous parse was the list of characters
 126 parsed. Sometimes you want this to be something else, like an abstract
 127 syntax tree, or some calculation. For the digit case we may want the
 128 actual integer number.</p>
 129 <p>For this we can use the '&lt;@' parser
 130 combinator. This combinator takes a parser and a quotation on the
 131 stack and returns a new parser. When the new parser is called it will
 132 call the original parser to produce the results, then it will call the
 133 quotation on each successfull result, and the result of that quotation
 134 will be the result of the parse:</p>
 135 <pre class="code">
 136   (1) : digit-parser2 ( -- parser )
 137         [ digit? ] satisfy [ digit> ] &lt;@ ;
 138   (2) "5" digit-parser2 call [ . ] leach
 139       =&gt; [[ "" 5 ]]
 140 </pre>
 141 <p>Notice that now the result is the actual integer '5' rather than
 142 character code '53'.</p>
 143 <pre class="code">
 144   (1) : digit-list>number ( list -- number )
 145          #! Converts a list of digits to a number
 146          [ >digit ] map >string dup empty? [
 147            drop 0
 148          ] [
 149            str>number
 150          ]  ifte ;
 151   (2) : natural-parser ( -- parser )
 152         digit-parser2 <*> [ car digit-list>number unit  ] &lt;@  ;
 153   (3) "123" natural-parser call
 154       =&gt; < lazy list >
 155   (4) [ . ] leach
 156       =&gt; [ "" 123 ]
 157            [ "3" 12 ]
 158            [ "23" 1 ]
 159            [ "123" 0 ]
 160            [ [ 123 ] | "" ]
 161 </pre>
 162 <p>The number parsed is the actual integer number due to the operation
 163 of the '&lt;@' word. This allows parsers to not only parse the input
 164 string but perform operations and transformations on the syntax tree
 165 returned.</p>
 166 <p>A useful debugging method to work out what to use in the quotation
 167 passed to &lt;@ is to write an initial version of the parser that just
 168 displays the topmost item on the stack:</p>
 169 <pre class="code">
 170   (1) : natural-parser-debug ( -- parser )
 171         digit-parser2 <*> [ "debug: " write dup . ] &lt;@  ;
 172   (3) "123" natural-parser-debug call lcar .
 173       =&gt; debug: [ [ 1 2 3 ] ]
 174            [ "" [ 1 2 3 ] ]
 175 </pre>
 176 <p>From the debug output we can see how to manipulate the result to
 177 get what we want. In this case it's the quotation in the previous example.</p>
 178
 179 <h2>Sequential combinator</h2>
 180 <p>To create a full grammar we need a parser combinator that does
 181 sequential compositions. That is, given two parsers, the sequential
 182 combinator will first run the first parser, and then run the second on
 183 the remaining text to be parsed. As the first parser returns a lazy
 184 list, the second parser will be run on each item of the lazy list. Of
 185 course this is done lazily so it only ends up being done when those
 186 list items are requested. The sequential combinator word is &lt;&amp;&gt;.</p>
 187 <pre class="code">
 188   ( 1 ) "number:" token
 189        =&gt; < parser that parses the text 'number:' >
 190   ( 2 ) natural-parser
 191        =&gt; < parser that parses natural numbers >
 192   ( 3 ) &lt;&amp;&gt;
 193        =&gt; < parser that parses 'number:' followed by a natural >
 194   ( 4 ) "number:100" swap call
 195        =&gt; < list of successes >
 196   ( 5 ) [ . ] leach
 197        =&gt; [ "" "number:" 100 ]
 198             [ "0" "number:" 10 ]
 199             [ "00" "number:" 1 ]
 200             [ "100" "number:" 0 ]
 201 </pre>
 202 <p>In this  example we might prefer not to have the parse result
 203 contain the token, we want just the number. Two alternatives to &lt;&amp;&gt;
 204 provide the ability to select which result to use from the two
 205 parsers. These operators are &lt;&amp; and &amp;&gt;. The &lt; or &gt; points
 206 in the direction of which parser to retain the results from. So our
 207 example above could be:</p>
 208 <pre class="code">
 209   ( 1 ) "number:" token
 210        =&gt; < parser that parses the text 'number:' >
 211   ( 2 ) natural-parser
 212        =&gt; < parser that parses natural numbers >
 213   ( 3 ) &amp;&gt;
 214        =&gt; < parser that parses 'number:' followed by a natural >
 215   ( 4 ) "number:100" swap call
 216        =&gt; < list of successes >
 217   ( 5 ) [ . ] leach
 218        =&gt; [ "" 100 ]
 219             [ "0" 10 ]
 220             [ "00" 1 ]
 221             [ "100" 0 ]
 222 </pre>
 223 <p>Notice how the parse result only contains the number due to &&gt;
 224 being used to retain the result of the second parser.</p>
 225
 226 <h2>Choice combinator</h2>
 227 <p>As well as a sequential combinator we need an alternative
 228 combinator. The word for this is &lt;|&gt;. It takes two parsers from the
 229 stack and returns a parser that will first try the first parser. If it
 230 succeeds then the result for that is returned. If it fails then the
 231 second parser is tried and its result returned.</p>
 232 <pre class="code">
 233   ( 1 ) "one" token
 234         =&gt; < parser that parses the text 'one' >
 235   ( 2 ) "two" token
 236         =&gt; < parser that parses the text 'two' >
 237   ( 3 ) &lt;|&gt;
 238         =&gt; < parser that parses 'one' or 'two' >
 239   ( 4 ) "one" over call [ . ] leach
 240         =&gt; [[ "" "one" ]]
 241   ( 5 ) "two" swap call [ . ] leach
 242         =&gt; [[ "" "two" ]]
 243 </pre>
 244
 245 <h2>Option combinator</h2>
 246 <p>The option combinator, &lt;?&gt; allows adding optional elements to
 247 a parser. It takes one parser off the stack and if the parse succeeds
 248 add it to the result tree, otherwise it will ignore it and
 249 continue. The example below extends our natural-parser to parse
 250 integers with an optional leading minus sign.</p>
 251 <pre class="code">
 252   ( 1 ) : integer-parser
 253           "-" token &lt;?&gt; natural-parser &lt;&amp;&gt; ;
 254   ( 2 ) "200" integer-parser call [ . ] leach
 255        =&gt; [ "" [ ] 200 ]
 256             [ "0" [ ] 20 ]
 257             [ "00" [ ] 2 ]
 258             [ "200" [ ] 0 ]
 259   ( 3 ) "-200" integer-parser call [ . ] leach
 260        =&gt; [ "" [ "-" ] 200 ]
 261             [ "0" [ "-" ] 20 ]
 262             [ "00" [ "-" ] 2 ]
 263             [ "200" [ "-" ] 0 ]
 264             [ "-200" [ ] 0 ]
 265   ( 4 ) : integer-parser2
 266           integer-parser [ uncons swap [ car -1 * ] when ] &lt;@ ;
 267   ( 5 ) "200" integer-parser2 call [ . ] leach
 268        =&gt; [ "" 200 ]
 269             [ "0" 20 ]
 270             [ "00" 2 ]
 271             [ "200" 0 ]
 272   ( 6 ) "-200" integer-parser2 call [ . ] leach
 273        =&gt; [ "" -200 ]
 274             [ "0" -20 ]
 275             [ "00" -2 ]
 276             [ "200" 0 ]
 277             [ "-200" 0 ]
 278
 279 </pre>
 280
 281 <h2>Skipping Whitespace</h2>
 282 <p>A parser transformer exists, the word 'sp', that takes an existing
 283 parser and returns a new one that will first skip any whitespace
 284 before calling the original parser. This makes it easy to write
 285 grammers that avoid whitespace without having to explicitly code it
 286 into the grammar.</p>
 287 <pre class="code">
 288   ( 1 ) "  123" natural-parser call [ . ] leach
 289         =&gt; [ "  123" 0 ]
 290   ( 2 ) "  123" natural-parser sp call [ . ] leach
 291         =&gt; [ "" 123 ]
 292              [ "3" 12 ]
 293              [ "23" 1 ]
 294              [ "123" 0 ]
 295 </pre>
 296 <h2>Eval grammar example</h2>
 297 <p>This example presents a simple grammar that will parse a number
 298 followed by an operator and another number. A factor expression that
 299 computes the entered value will be executed.</p>
 300 <pre class="code">
 301   ( 1 ) natural-parser
 302         =&gt; < a parser for natural numbers >
 303   ( 2 ) "/" token "*" token "+" token "-" token &lt;|&gt; &lt;|&gt; &lt;|&gt;
 304         =&gt; < a parser for the operator >
 305   ( 3 ) sp [ "\\ " swap cat2 eval unit ] &lt;@
 306         =&gt; < operator parser that skips whitespace and converts to a
 307              factor expression >
 308   ( 4 ) natural-parser sp
 309         =&gt; < a whitespace skipping natural parser >
 310   ( 5 ) &lt;&amp;&gt; &lt;&amp;&gt; [ uncons uncons swap append append call ] &lt;@
 311         =&gt; < a parser that parsers the expression, converts it to
 312              factor, calls it and puts the result in the parse tree >
 313   ( 6 ) "123 + 456" over call lcar .
 314         =&gt; [[ "" 579 ]]
 315   ( 7 ) "300-100" over call lcar .
 316         =&gt; [[ "" 200 ]]
 317   ( 8 ) "200/2" over call lcar .
 318         =&gt; [[ "" 100 ]]
 319 </pre>
 320 <p>It looks complicated when expanded as above but the entire parser,
 321 factored a little, looks quite readable:</p>
 322 <pre class="code">
 323   ( 1 ) : operator ( -- parser )
 324           "/" token
 325           "*" token &lt;|&gt;
 326           "+" token &lt;|&gt;
 327           "-" token &lt;|&gt;
 328           [ "\\ " swap cat2 eval unit ] &lt;@ ;
 329   ( 2 ) : expression ( -- parser )
 330           natural-parser
 331           operator sp &lt;&amp;&gt;
 332           natural-parser sp &lt;&amp;&gt;
 333           [ uncons swap uncons -rot append append reverse call ] &lt;@ ;
 334   ( 3 ) "40+2" expression call lcar .
 335         =&gt; [[ "" 42 ]]
 336 </pre>
 337 <p class="footer">
 338 News and updates to this software can be obtained from the authors
 339 weblog: <a href="http://radio.weblogs.com/0102385">Chris Double</a>.</p>
 340 <p id="copyright">Copyright (c) 2004, Chris Double. All Rights Reserved.</p>
 341 </body> </html>