(chibi parse)

A parser combinator library with optional memoization and convenient syntax.

Parse Streams

Parse streams are an abstraction to treat ports as proper streams so that we can backtrack from previous states. A single Parse-Stream record represents a single buffered chunk of text.

(make-parse-stream filename [port len])

Create a parse stream open on the given filename, with a possibly already opened port.

(file->parse-stream filename)

Open filename and create a parse stream on it.

(string->parse-stream str)

Create a parse stream on a string str.

(parse-stream-tail source)

Access the next buffered chunk of a parse stream.

(parse-stream-start? source i)

Returns true iff i is the first character position in the parse stream source.

(parse-stream-end? source i)

Returns true iff i is the last character position in the parse stream source.

(parse-stream-ref source i)

Returns the character in parse stream source indexed by i.

(parse-stream-substring s0 i0 s1 i1)

Returns a string composed of the characters starting at parse stream s0 index i0 (inclusive), and ending at s1 index i1 (exclusive).

Parser Interface

(parse-failure s i reason)

Combinator to indicate failure.

(call-with-parse f source index sk [fk])

Call the parser combinator f on the parse stream source, starting at index index, passing the result to the given success continuation sk, which should be a procedure of the form (result source index fail). The optional failure continuation should be a procedure of the form (source index reason), and defaults to just returning #f.

(parse f source [index])

Call the parser combinator f on the parse stream source, at index index, and return the result, or #f if parsing fails.

(parse-fully f source [index])

Call the parser combinator f on the parse stream source, at index index. If the entire source is not parsed, raises an error, otherwise returns the result.

(parse-fold f kons knil source [index])

The fundamental parse iterator. Repeatedly applies the parser combinator f to source, starting at index, as long as a valid parse is found. On each successful parse applies the procedure kons to the parse result and the previous kons result, beginning with knil. If no parses succeed returns knil.

(parse->list f source [index])

Parse as many of the parser combinator f from the parse stream source, starting at index, as possible, and return the result as a list.

(parse-fully->list f source [index])

As parse->list but requires the entire source be parsed with no left over characters, signalling an error otherwise.

(parse-with-failure-reason f reason)

Return a new parser combinator with the same behavior as f, but on failure replaces the reason with reason. This can be useful to provide more descriptive parse failure reasons when chaining combinators. For example, parse-string just expects to parse a single fixed string. If it were defined in terms of parse-char, failure would indicate some char failed to match, but it's more useful to describe the whole string we were expecting to see.

Basic Parsing Combinators

(parse-epsilon source index sk fk)

Parse nothing successfully.

(parse-anything source index sk fk)

Parse any single character successfully. Fails at end of input.

(parse-nothing source index sk fk)

Always fail to parse.

(parse-or f . o)

The disjunction combinator. Returns the first combinator that succeeds parsing from the same source and index.

(parse-and f g)

The conjunction combinator. If both f and g parse successfully starting at the same source and index, returns the result of g. Otherwise fails.

(parse-not f)

The negation combinator. If f succeeds, fails, otherwise succeeds with #t.

(parse-seq . o)

The sequence combinator. Each combinator is applied in turn just past the position of the previous. If all succeed, returns a list of the results in order, skipping any ignored values.

(list->parse-seq ls)

Convert the list of parser combinators ls to a parse-seq sequence.

(parse-optional f . o)

The optional combinator. Parse the combinator f (in sequence with any additional combinator args o), and return the result, or parse nothing successully on failure.

(parse-repeat f [lo hi])

The repetition combinator. Parse f repeatedly and return a list of the results. lo is the minimum number of parses (deafult 0) to be considered a successful parse, and hi is the maximum number (default infinite) before stopping.

(parse-repeat+ f)

Parse f one or more times.

(parse-map f proc)

Parse f and apply the procedure proc to the result on success.

(parse-map-substring f [proc])

Parse f and apply the procedure proc to the substring of the parsed data. proc defaults to the identity.

(parse-ignore f)

Parses the same streams as f but ignores the result on success. Inside a parse-seq the result will not be included in the list of results. Useful for discarding boiler-plate without the need for post-processing results.

(parse-assert f check?)

Parse with f and further require check? to return true when applied to the result.

(parse-atomic f)

Parse with f once and keep the first result, not allowing further backtracking within f.

(parse-commit f)

Parse with f once, keep the first result, and commit to the current parse path, discarding any prior backtracking options.

Boundary Checks

(parse-beginning source index sk fk)

Returns true iff index is the first index of the first parse stream source.

(parse-end source index sk fk)

Returns true iff index is the last index of the last parse stream source.

(parse-beginning-of-line source index sk fk)

Returns true iff source, index indicate the beginning of a line (or the entire stream).

(parse-end-of-line source index sk fk)

Returns true iff source, index indicate the end of a line (or the entire stream).

(parse-beginning-of-word source index sk fk)

Returns true iff source, index indicate the beginning of a word (or the entire stream).

(parse-end-of-word source index sk fk)

Returns true iff source, index indicate the end of a word (or the entire stream).

(parse-word [word])

Parse the combinator word (default a parse-token of char-alphabetic? or underscores), ensuring it begins and ends on a word boundary.

(parse-word+)

As parse-word, but instead of an arbitrary word combinator takes a character predicate pred (conjoined with char-alphabetic? or underscore), and parses a sequence of those characters with parse-token. Returns the parsed substring.

Constant Parsers

(parse-char x)

Parse a single char which matches x, which can be a character, character set, or arbitrary procedure.

(parse-not-char x)

Parse a single char which does not match x, which can be a character, character set, or arbitrary procedure.

(parse-string str)

Parse the exact string str.

(parse-token x)

Parse a sequence of characters matching x as with parse-char, and return the resulting substring.

(parse-sre x)

We provide an extended subset of SRE syntax (see SRFI 115), taking advantage of more general parsing features. These are just translated directly into parser combinators, with characters and strings implicitly matching themselves. For example, '(or "foo" "bar") matches either of the strings "foo" or "bar". Existing parser combinators may be embedded directly. This is of course more powerful than SREs since it is not restricted to regular languages (or in fact any languages), though it does not provide the same performance guarantees.

Laziness

A delayed combinator. This is equivalent to the parser combinator f, but is delayed so it can be more efficient if never used and f is expensive to compute. Moreover, it can allow self-referentiality as in:

(letrec* ((f (parse-lazy (parse-or (parse-seq g f) h))))
  ...)

Memoization

(parse-memoize name f)

Parse the same strings as f, but memoize the result at each source and index to avoid exponential backtracking. name is provided for debugging only.

Syntax

The four basic interfaces are grammar, define-grammar, and their unmemoized variants grammar/unmemoized and define-grammar/unmemoized. This is optimized for the common case - generally you want to memoize grammars, and may or may not want to memoize the smaller lexical components.

(grammar/unmemoized init (rule (clause [action]) ...) ...)

Describe an grammar for the given named rules and return the rule named init. The rules are parser combinators which match the first clause which succeeds, and returns the corresponding action. Each clause is an SRE parser as in parse-sre, which may include embdedded parser combinators with unquote (,). In particular, the rule itself and any other rules can be referenced in this way. The optional action, which defaults to the normal result of the clause parser, is a normal Scheme expression with all -> named expressions in clause bound to the corresponding result. Alternately, action can be of the form => receiver to send the results directly to a success continuation as in call-with-parse.

(grammar init (rule (clause [action]) ...) ...)

Equivalent to grammar but memoizes each clause. Parsers nested within each clause are not automatically memoized, so if necessary should be memoized explicitly or split out into separate rules.

(define-grammar/unmemoized name (rule (clause [action]) ...) ...)

Similar to grammar/unmemoized, instead of returning a single entry point parser defines each rule as its own parser. Also defines name as an alist mapping rule names to their values.

(define-grammar name (rule (clause [action]) ...) ...)

The memoized version of define-grammar/unmemoized.

Example:

(define-grammar calc
  (space ((* ,(parse-char char-whitespace?))))
  (number ((-> n (+ ,(parse-char char-numeric?)))
           (string->number (list->string n))))
  (simple ((-> n ,number) n)
          ((: "(" (=> e1 ,term) ")") e1))
  (term-op ("*" *)
           ("/" /)
           ("%" modulo))
  (term ((: (-> e1 ,simple) ,space (-> op ,term-op) ,space (-> e2 ,term))
         (op e1 e2))
        ((-> e1 ,simple)
         e1)))
(parse term "12 / (2*3)")
=> 2