4 <title>Parser Combinators</title>
5 <link rel="stylesheet" type="text/css" href="style.css">
9 <p class="note">The parser combinator library described here is based
10 on a library written for the Clean pure functional programming language and
11 described in chapter 5 of the 'Clean Book' (<a
12 href="ftp://ftp.cs.kun.nl/pub/Clean/papers/cleanbook/II.05.ParserCombinators.pdf">PDF
13 available here</a>). Based on the description
14 in that chapter I developed a version for Factor, a concatenative
16 <p>A parser is a word or quotation that, when called, processes
17 an input string on the stack, performs some parsing operation on
18 it, and returns a result indicating the success of the parsing
20 <p>The result returned by a parser is known as a 'list of
21 successes'. It is a lazy list of standard Factor cons cells. Each cons
22 cell is a result of a parse. The car of the cell is the remaining
23 input left to be parsed and the cdr of the cell is the result of the
24 parsing operation.</p>
25 <p>A lazy list is used for the result as a parse operation can potentially
26 return many successful results. For example, a parser that parses one
27 or more digits will return more than one result for the input "123". A
28 successful parse could be "1", "12" or "123".</p>
29 <p>The list is lazy so if only one parse result is required the
30 remaining results won't actually be processed if they are not
31 requested. This improves efficiency.</p>
32 <p>The cdr of the result pair can be any value that the parser wishes
33 to return. It could be the successful portion of the input string
34 parsed, an abstract syntax tree representing the parsed input, or even
35 a quotation that should get called for later processing.</p>
36 <p>A Parser Combinator is a word that takes one or more parsers and
37 returns a parser that when called uses the original parsers in some
39 <h1>Example Parsers</h1>
40 <p>The following are some very simple parsers that demonstrate how
41 general parsers work and the 'list of sucesses' that are returned as a
44 (1) : char-a ( inp -- result )
45 0 over string-nth CHAR: a = [
46 1 swap string-tail CHAR: a cons unit delay lunit
50 (2) "atest" char-a [ [ . ] leach ] when*
52 (3) "test" char-a [ [ . ] leach ] when*
55 <p>'char-a' is a parser that only accepts the character 'a' in the
56 input string. When passed an input string with a string with a leading
57 'a' then the 'list of successes' has 1 result value. The cdr of that
58 result value is the character 'a' successfully parsed, and the car is
59 the remaining input string. On failure of the parse an empty list is
61 <p>The parser combinator library provides a combinator, <&>, that takes
62 two parsers off the stack and returns a parser that calls the original
63 two in sequence. An example of use would be calling 'char-a' twice,
64 which would then result in an input string expected with two 'a'
65 characters leading:</p>
67 (1) "aatest" [ char-a ] [ char-a ] <&> call
68 => < list of successes >
70 => [[ "test" [[ 97 97 ]] ]]
73 <p>Creating parsers for specfic characters and tokens can be a chore
74 so there is a word that, given a string token on the stack, returns
75 a parser that parses that particular token:</p>
78 => < a parser that parses the token "begin" >
79 (2) dup "this should fail" swap call lnil? .
81 (3) "begin a successfull parse" swap call
84 => [[ " a successfull parse" "begin" ]]
86 <h2>Predicate matching</h2>
87 <p>The word 'satisfy' takes a quotation from the top of the stack and
88 returns a parser than when called will call the quotation with the
89 first item in the input string on the stack. If the quotation returns
90 true then the parse is successful, otherwise it fails:</p>
92 (1) : digit-parser ( -- parser )
94 (2) "5" digit-parser call [ . ] leach
96 (3) "a" digit-parser call lnil? .
99 <p>Note that 'digit-parser' returns a parser, it is not the parser
100 itself. It is really a parser generating word like 'token'. Whereas
101 our 'char-a' word defined originally was a parser itself.</p>
102 <h2>Zero or more matches</h2>
103 <p>Now that we can parse single digits it would be nice to easily
104 parse a string of them. The '<*>' parser combinator word will do
105 this. It accepts a parser on the top of the stack and produces a
106 parser that parses zero or more of the constructs that the original
107 parser parsed. The result of the '<*>' generated parser will be a list
108 of the successful results returned by the original parser.</p>
115 => [ "" [ 49 50 51 ] ]
120 <p>In this case there are multiple successful parses. This is because
121 the occurrence of zero or more digits happens more than once. There is
122 also the 'f' case where zero digits is parsed. If only the 'longest
123 match' is required then the lcar of the lazy list can be used and the
124 remaining parse results are never produced.</p>
125 <h2>Manipulating parse trees</h2>
126 <p>The result of the previous parse was the list of characters
127 parsed. Sometimes you want this to be something else, like an abstract
128 syntax tree, or some calculation. For the digit case we may want the
129 actual integer number.</p>
130 <p>For this we can use the '<@' parser
131 combinator. This combinator takes a parser and a quotation on the
132 stack and returns a new parser. When the new parser is called it will
133 call the original parser to produce the results, then it will call the
134 quotation on each successfull result, and the result of that quotation
135 will be the result of the parse:</p>
137 (1) : digit-parser2 ( -- parser )
138 [ digit? ] satisfy [ digit> ] <@ ;
139 (2) "5" digit-parser2 call [ . ] leach
142 <p>Notice that now the result is the actual integer '5' rather than
143 character code '53'.</p>
145 (1) : digit-list>number ( list -- number )
146 #! Converts a list of digits to a number
147 [ >digit ] map >string dup empty? [
152 (2) : natural-parser ( -- parser )
153 digit-parser2 <*> [ car digit-list>number unit ] <@ ;
154 (3) "123" natural-parser call
163 <p>The number parsed is the actual integer number due to the operation
164 of the '<@' word. This allows parsers to not only parse the input
165 string but perform operations and transformations on the syntax tree
167 <p>A useful debugging method to work out what to use in the quotation
168 passed to <@ is to write an initial version of the parser that just
169 displays the topmost item on the stack:</p>
171 (1) : natural-parser-debug ( -- parser )
172 digit-parser2 <*> [ "debug: " write dup . ] <@ ;
173 (3) "123" natural-parser-debug call lcar .
174 => debug: [ [ 1 2 3 ] ]
177 <p>From the debug output we can see how to manipulate the result to
178 get what we want. In this case it's the quotation in the previous example.</p>
180 <h2>Sequential combinator</h2>
181 <p>To create a full grammar we need a parser combinator that does
182 sequential compositions. That is, given two parsers, the sequential
183 combinator will first run the first parser, and then run the second on
184 the remaining text to be parsed. As the first parser returns a lazy
185 list, the second parser will be run on each item of the lazy list. Of
186 course this is done lazily so it only ends up being done when those
187 list items are requested. The sequential combinator word is <&>.</p>
189 ( 1 ) "number:" token
190 => < parser that parses the text 'number:' >
192 => < parser that parses natural numbers >
194 => < parser that parses 'number:' followed by a natural >
195 ( 4 ) "number:100" swap call
196 => < list of successes >
198 => [ "" "number:" 100 ]
201 [ "100" "number:" 0 ]
203 <p>In this example we might prefer not to have the parse result
204 contain the token, we want just the number. Two alternatives to <&>
205 provide the ability to select which result to use from the two
206 parsers. These operators are <& and &>. The < or > points
207 in the direction of which parser to retain the results from. So our
208 example above could be:</p>
210 ( 1 ) "number:" token
211 => < parser that parses the text 'number:' >
213 => < parser that parses natural numbers >
215 => < parser that parses 'number:' followed by a natural >
216 ( 4 ) "number:100" swap call
217 => < list of successes >
224 <p>Notice how the parse result only contains the number due to &>
225 being used to retain the result of the second parser.</p>
227 <h2>Choice combinator</h2>
228 <p>As well as a sequential combinator we need an alternative
229 combinator. The word for this is <|>. It takes two parsers from the
230 stack and returns a parser that will first try the first parser. If it
231 succeeds then the result for that is returned. If it fails then the
232 second parser is tried and its result returned.</p>
235 => < parser that parses the text 'one' >
237 => < parser that parses the text 'two' >
239 => < parser that parses 'one' or 'two' >
240 ( 4 ) "one" over call [ . ] leach
242 ( 5 ) "two" swap call [ . ] leach
246 <h2>Option combinator</h2>
247 <p>The option combinator, <?> allows adding optional elements to
248 a parser. It takes one parser off the stack and if the parse succeeds
249 add it to the result tree, otherwise it will ignore it and
250 continue. The example below extends our natural-parser to parse
251 integers with an optional leading minus sign.</p>
253 ( 1 ) : integer-parser
254 "-" token <?> natural-parser <&> ;
255 ( 2 ) "200" integer-parser call [ . ] leach
260 ( 3 ) "-200" integer-parser call [ . ] leach
261 => [ "" [ "-" ] 200 ]
266 ( 4 ) : integer-parser2
267 integer-parser [ uncons swap [ car -1 * ] when ] <@ ;
268 ( 5 ) "200" integer-parser2 call [ . ] leach
273 ( 6 ) "-200" integer-parser2 call [ . ] leach
282 <h2>Skipping Whitespace</h2>
283 <p>A parser transformer exists, the word 'sp', that takes an existing
284 parser and returns a new one that will first skip any whitespace
285 before calling the original parser. This makes it easy to write
286 grammers that avoid whitespace without having to explicitly code it
287 into the grammar.</p>
289 ( 1 ) " 123" natural-parser call [ . ] leach
291 ( 2 ) " 123" natural-parser sp call [ . ] leach
297 <h2>Eval grammar example</h2>
298 <p>This example presents a simple grammar that will parse a number
299 followed by an operator and another number. A factor expression that
300 computes the entered value will be executed.</p>
303 => < a parser for natural numbers >
304 ( 2 ) "/" token "*" token "+" token "-" token <|> <|> <|>
305 => < a parser for the operator >
306 ( 3 ) sp [ "\\ " swap cat2 eval unit ] <@
307 => < operator parser that skips whitespace and converts to a
309 ( 4 ) natural-parser sp
310 => < a whitespace skipping natural parser >
311 ( 5 ) <&> <&> [ uncons uncons swap append append call ] <@
312 => < a parser that parsers the expression, converts it to
313 factor, calls it and puts the result in the parse tree >
314 ( 6 ) "123 + 456" over call lcar .
316 ( 7 ) "300-100" over call lcar .
318 ( 8 ) "200/2" over call lcar .
321 <p>It looks complicated when expanded as above but the entire parser,
322 factored a little, looks quite readable:</p>
324 ( 1 ) : operator ( -- parser )
329 [ "\\ " swap cat2 eval unit ] <@ ;
330 ( 2 ) : expression ( -- parser )
332 operator sp <&>
333 natural-parser sp <&>
334 [ uncons swap uncons -rot append append reverse call ] <@ ;
335 ( 3 ) "40+2" expression call lcar .
339 News and updates to this software can be obtained from the authors
340 weblog: <a href="http://radio.weblogs.com/0102385">Chris Double</a>.</p>
341 <p id="copyright">Copyright (c) 2004, Chris Double. All Rights Reserved.