Version 0.9.0
About
mpc is a lightweight and powerful Parser Combinator library for C.
Using mpc might be of interest to you if you are…
- Building a new programming language
- Building a new data format
- Parsing an existing programming language
- Parsing an existing data format
- Embedding a Domain Specific Language
- Implementing Greenspun’s Tenth Rule
Features
- Type-Generic
- Predictive, Recursive Descent
- Easy to Integrate (One Source File in ANSI C)
- Automatic Error Message Generation
- Regular Expression Parser Generator
- Language/Grammar Parser Generator
Alternatives
The current main alternative for a C based parser combinator library is a branch of Cesium3.
mpc provides a number of features that this project does not offer, and also overcomes a number of potential downsides:
- mpc Works for Generic Types
- mpc Doesn’t rely on Boehm-Demers-Weiser Garbage Collection
- mpc Doesn’t use
setjmp
andlongjmp
for errors - mpc Doesn’t pollute the namespace
Here is how one would use mpc to create a parser for a basic mathematical expression language.
” product :
” value : /[0-9]+/ | ‘(‘
” maths : /^/
Expr, Prod, Value, Maths, NULL);
mpc_result_t r;
if (mpc_parse(“input”, input, Maths, &r)) {
mpc_ast_print(r.output);
mpc_ast_delete(r.output);
} else {
mpc_err_print(r.error);
mpc_err_delete(r.error);
}
mpc_cleanup(4, Expr, Prod, Value, Maths);”>
mpc_parser_t *Expr = mpc_new("expression"); mpc_parser_t *Prod = mpc_new("product"); mpc_parser_t *Value = mpc_new("value"); mpc_parser_t *Maths = mpc_new("maths"); mpca_lang(MPCA_LANG_DEFAULT, " expression :(('+' | '-') )*; " " product :(('*' | '/') )*; " " value : /[0-9]+/ | '('')'; " " maths : /^//$/; " , Expr, Prod, Value, Maths, NULL); mpc_result_t r; if (mpc_parse("input", input, Maths, &r)) { mpc_ast_print(r.output); mpc_ast_delete(r.output); } else { mpc_err_print(r.error); mpc_err_delete(r.error); } mpc_cleanup(4, Expr, Prod, Value, Maths);
If you were to set input
to the string (4 * 2 * 11 + 2) - 5
, the printed output would look like this.
Parser Combinators are structures that encode how to parse particular languages. They can be combined using intuitive operators to create new parsers of increasing complexity. Using these operators detailed grammars and languages can be parsed and processed in a quick, efficient, and easy way.
The trick behind Parser Combinators is the observation that by structuring the library in a particular way, one can make building parser combinators look like writing a grammar itself. Therefore instead of describing how to parse a language, a user must only specify the language itself, and the library will work out how to parse it … as if by magic!
mpc can be used in this mode, or, as shown in the above example, you can specify the grammar directly as a string or in a file.
Basic Parsers
String Parsers
All the following functions construct new basic parsers of the type mpc_parser_t *
. All of those parsers return a newly allocated char *
with the character(s) they manage to match. If unsuccessful they will return an error. They have the following functionality.
mpc_parser_t *mpc_any(void);
Matches any individual character
mpc_parser_t *mpc_char(char c);
Matches a single given character c
mpc_parser_t *mpc_range(char s, char e);
Matches any single given character in the range s
to e
(inclusive)
mpc_parser_t *mpc_oneof(const char *s);
Matches any single given character in the string s
mpc_parser_t *mpc_noneof(const char *s);
Matches any single given character not in the string s
mpc_parser_t *mpc_satisfy(int(*f)(char));
Matches any single given character satisfying function f
mpc_parser_t *mpc_string(const char *s);
Matches exactly the string s
Other Parsers
Several other functions exist that construct parsers with some other special functionality.
mpc_parser_t *mpc_pass(void);
Consumes no input, always successful, returns NULL
mpc_parser_t *mpc_fail(const char *m); mpc_parser_t *mpc_failf(const char *fmt, ...);
Consumes no input, always fails with message m
or formatted string fmt
.
mpc_parser_t *mpc_lift(mpc_ctor_t f);
Consumes no input, always successful, returns the result of function f
mpc_parser_t *mpc_lift_val(mpc_val_t *x);
Consumes no input, always successful, returns x
mpc_parser_t *mpc_state(void);
Consumes no input, always successful, returns a copy of the parser state as a mpc_state_t *
. This state is newly allocated and so needs to be released with free
when finished with.
mpc_parser_t *mpc_anchor(int(*f)(char,char));
Consumes no input. Successful when function f
returns true. Always returns NULL
.
Function f
is a anchor function. It takes as input the last character parsed, and the next character in the input, and returns success or failure. This function can be set by the user to ensure some condition is met. For example to test that the input is at a boundary between words and non-words.
At the start of the input the first argument is set to ' '
. At the end of the input the second argument is set to ' '
.
Parsing
Once you’ve build a parser, you can run it on some input using one of the following functions. These functions return 1
on success and 0
on failure. They output either the result, or an error to a mpc_result_t
variable. This type is defined as follows.
typedef union { mpc_err_t *error; mpc_val_t *output; } mpc_result_t;
where mpc_val_t *
is synonymous with void *
and simply represents some pointer to data – the exact type of which is dependant on the parser.
int mpc_parse(const char *filename, const char *string, mpc_parser_t *p, mpc_result_t *r);
Run a parser on some string.
int mpc_parse_file(const char *filename, FILE *file, mpc_parser_t *p, mpc_result_t *r);
Run a parser on some file.
int mpc_parse_pipe(const char *filename, FILE *pipe, mpc_parser_t *p, mpc_result_t *r);
Run a parser on some pipe (such as stdin
).
int mpc_parse_contents(const char *filename, mpc_parser_t *p, mpc_result_t *r);
Run a parser on the contents of some file.
Combinators
Combinators are functions that take one or more parsers and return a new parser of some given functionality.
These combinators work independently of exactly what data type the parser(s) supplied as input return. In languages such as Haskell ensuring you don’t input one type of data into a parser requiring a different type is done by the compiler. But in C we don’t have that luxury. So it is at the discretion of the programmer to ensure that he or she deals correctly with the outputs of different parser types.
A second annoyance in C is that of manual memory management. Some parsers might get half-way and then fail. This means they need to clean up any partial result that has been collected in the parse. In Haskell this is handled by the Garbage Collector, but in C these combinators will need to take destructor functions as input, which say how clean up any partial data that has been collected.
Here are the main combinators and how to use then.
mpc_parser_t *mpc_expect(mpc_parser_t *a, const char *e); mpc_parser_t *mpc_expectf(mpc_parser_t *a, const char *fmt, ...);
Returns a parser that runs a
, and on success returns the result of a
, while on failure reports that e
was expected.
mpc_parser_t *mpc_apply(mpc_parser_t *a, mpc_apply_t f); mpc_parser_t *mpc_apply_to(mpc_parser_t *a, mpc_apply_to_t f, void *x);
Returns a parser that applies function f
(optionality taking extra input x
) to the result of parser a
.
mpc_parser_t *mpc_check(mpc_parser_t *a, mpc_dtor_t da, mpc_check_t f, const char *e); mpc_parser_t *mpc_check_with(mpc_parser_t *a, mpc_dtor_t da, mpc_check_with_t f, void *x, const char *e); mpc_parser_t *mpc_checkf(mpc_parser_t *a, mpc_dtor_t da, mpc_check_t f, const char *fmt, ...); mpc_parser_t *mpc_check_withf(mpc_parser_t *a, mpc_dtor_t da, mpc_check_with_t f, void *x, const char *fmt, ...);
Returns a parser that applies function f
(optionally taking extra input x
) to the result of parser a
. If f
returns non-zero, then the parser succeeds and returns the value of a
(possibly modified by f
). If f
returns zero, then the parser fails with message e
, and the result of a
is destroyed with the destructor da
.
mpc_parser_t *mpc_not(mpc_parser_t *a, mpc_dtor_t da); mpc_parser_t *mpc_not_lift(mpc_parser_t *a, mpc_dtor_t da, mpc_ctor_t lf);
Returns a parser with the following behaviour. If parser a
succeeds, then it fails and consumes no input. If parser a
fails, then it succeeds, consumes no input and returns NULL
(or the result of lift function lf
). Destructor da
is used to destroy the result of a
on success.
mpc_parser_t *mpc_maybe(mpc_parser_t *a); mpc_parser_t *mpc_maybe_lift(mpc_parser_t *a, mpc_ctor_t lf);
Returns a parser that runs a
. If a
is successful then it returns the result of a
. If a
is unsuccessful then it succeeds, but returns NULL
(or the result of lf
).
mpc_parser_t *mpc_many(mpc_fold_t f, mpc_parser_t *a);
Runs a
zero or more times until it fails. Results are combined using fold function f
. See the Function Types section for more details.
mpc_parser_t *mpc_many1(mpc_fold_t f, mpc_parser_t *a);
Runs a
one or more times until it fails. Results are combined with fold function f
.
mpc_parser_t *mpc_count(int n, mpc_fold_t f, mpc_parser_t *a, mpc_dtor_t da);
Runs a
exactly n
times. If this fails, any partial results are destructed with da
. If successful results of a
are combined using fold function f
.
mpc_parser_t *mpc_or(int n, ...);
Attempts to run n
parsers in sequence, returning the first one that succeeds. If all fail, returns an error.
mpc_parser_t *mpc_and(int n, mpc_fold_t f, ...);
Attempts to run n
parsers in sequence, returning the fold of the results using fold function f
. First parsers must be specified, followed by destructors for each parser, excluding the final parser. These are used in case of partial success. For example: mpc_and(3, mpcf_strfold, mpc_char('a'), mpc_char('b'), mpc_char('c'), free, free);
would attempt to match 'a'
followed by 'b'
followed by 'c'
, and if successful would concatenate them using mpcf_strfold
. Otherwise would use free
on the partial results.
mpc_parser_t *mpc_predictive(mpc_parser_t *a);
Returns a parser that runs a
with backtracking disabled. This means if a
consumes more than one character, it will not be reverted, even on failure. Turning backtracking off has good performance benefits for grammars which are LL(1)
. These are grammars where the first character completely determines the parse result – such as the decision of parsing either a C identifier, number, or string literal. This option should not be used for non LL(1)
grammars or it will produce incorrect results or crash the parser.
Another way to think of mpc_predictive
is that it can be applied to a parser (for a performance improvement) if either successfully parsing the first character will result in a completely successful parse, or all of the referenced sub-parsers are also LL(1)
.
Function Types
The combinator functions take a number of special function types as function pointers. Here is a short explanation of those types are how they are expected to behave. It is important that these behave correctly otherwise it is easy to introduce memory leaks or crashes into the system.
typedef void(*mpc_dtor_t)(mpc_val_t*);
Given some pointer to a data value it will ensure the memory it points to is freed correctly.
typedef mpc_val_t*(*mpc_ctor_t)(void);
Returns some data value when called. It can be used to create empty versions of data types when certain combinators have no known default value to return. For example it may be used to return a newly allocated empty string.
typedef mpc_val_t*(*mpc_apply_t)(mpc_val_t*); typedef mpc_val_t*(*mpc_apply_to_t)(mpc_val_t*,void*);
This takes in some pointer to data and outputs some new or modified pointer to data, ensuring to free the input data if it is no longer used. The apply_to
variation takes in an extra pointer to some data such as global state.
typedef int(*mpc_check_t)(mpc_val_t**); typedef int(*mpc_check_with_t)(mpc_val_t**,void*);
This takes in some pointer to data and outputs 0 if parsing should stop with an error. Additionally, this may change or free the input data. The check_with
variation takes in an extra pointer to some data such as global state.
typedef mpc_val_t*(*mpc_fold_t)(int,mpc_val_t**);
This takes a list of pointers to data values and must return some combined or folded version of these data values. It must ensure to free any input data that is no longer used once the combination has taken place.
Combinator Method
Using the above combinators we can create a parser that matches a C identifier.
When using the combinators we need to supply a function that says how to combine two char *
.
For this we build a fold function that will concatenate zero or more strings together. For this sake of this tutorial we will write it by hand, but this (as well as many other useful fold functions), are actually included in mpc under the mpcf_*
namespace, such as mpcf_strfold
.
mpc_val_t *strfold(int n, mpc_val_t **xs) { char *x = calloc(1, 1); int i; for (i = 0; i < n; i++) { x = realloc(x, strlen(x) + strlen(xs[i]) + 1); strcat(x, xs[i]); free(xs[i]); } return x; }
We can use this to specify a C identifier, making use of some combinators to say how the basic parsers are combined.
mpc_parser_t *alpha = mpc_or(2, mpc_range('a', 'z'), mpc_range('A', 'Z')); mpc_parser_t *digit = mpc_range('0', '9'); mpc_parser_t *underscore = mpc_char('_'); mpc_parser_t *ident = mpc_and(2, strfold, mpc_or(2, alpha, underscore), mpc_many(strfold, mpc_or(3, alpha, digit, underscore)), free); /* Do Some Parsing... */ mpc_delete(ident);
Notice that previous parsers are used as input to new parsers we construct from the combinators. Note that only the final parser ident
must be deleted. When we input a parser into a combinator we should consider it to be part of the output of that combinator.
Because of this we shouldn’t create a parser and input it into multiple places, or it will be doubly freed.
Regex Method
There is an easier way to do this than the above method. mpc comes with a handy regex function for constructing parsers using regex syntax. We can specify an identifier using a regex pattern as shown below.
mpc_parser_t *ident = mpc_re("[a-zA-Z_][a-zA-Z_0-9]*"); /* Do Some Parsing... */ mpc_delete(ident);
Library Method
Although if we really wanted to create a parser for C identifiers, a function for creating this parser comes included in mpc along with many other common parsers.
mpc_parser_t *ident = mpc_ident(); /* Do Some Parsing... */ mpc_delete(ident);
Building parsers in the above way can have issues with self-reference or cyclic-reference. To overcome this we can separate the construction of parsers into two different steps. Construction and Definition.
mpc_parser_t *mpc_new(const char *name);
This will construct a parser called name
which can then be used as input to others, including itself, without fear of being deleted. Any parser created using mpc_new
is said to be retained. This means it will behave differently to a normal parser when referenced. When deleting a parser that includes a retained parser, the retained parser will not be deleted along with it. To delete a retained parser mpc_delete
must be used on it directly.
A retained parser can then be defined using…
mpc_parser_t *mpc_define(mpc_parser_t *p, mpc_parser_t *a);
This assigns the contents of parser a
to p
, and deletes a
. With this technique parsers can now reference each other, as well as themselves, without trouble.
mpc_parser_t *mpc_undefine(mpc_parser_t *p);
A final step is required. Parsers that reference each other must all be undefined before they are deleted. It is important to do any undefining before deletion. The reason for this is that to delete a parser it must look at each sub-parser that is used by it. If any of these have already been deleted a segfault is unavoidable – even if they were retained beforehand.
void mpc_cleanup(int n, ...);
To ease the task of undefining and then deleting parsers mpc_cleanup
can be used. It takes n
parsers as input, and undefines them all, before deleting them all.
mpc_parser_t *mpc_copy(mpc_parser_t *a);
This function makes a copy of a parser a
. This can be useful when you want to
use a parser as input for some other parsers multiple times without retaining
it.
mpc_parser_t *mpc_re(const char *re); mpc_parser_t *mpc_re_mode(const char *re, int mode);
This function takes as input the regular expression re
and builds a parser
for it. With the mpc_re_mode
function optional mode flags can also be given.
Available flags are MPC_RE_MULTILINE
/ MPC_RE_M
where the start of input
character ^
also matches the beginning of new lines and the end of input $
character also matches new lines, and MPC_RE_DOTALL
/ MPC_RE_S
where the
any character token .
also matches newlines (by default it doesn’t).
Common Parsers
mpc_soi |
Matches only the start of input, returns NULL |
mpc_eoi |
Matches only the end of input, returns NULL |
mpc_boundary |
Matches only the boundary between words, returns NULL |
mpc_boundary_newline |
Matches the start of a new line, returns NULL |
mpc_whitespace |
Matches any whitespace character " fnrtv" |
mpc_whitespaces |
Matches zero or more whitespace characters |
mpc_blank |
Matches whitespaces and frees the result, returns NULL |
mpc_newline |
Matches 'n' |
mpc_tab |
Matches 't' |
mpc_escape |
Matches a backslash followed by any character |
mpc_digit |
Matches any character in the range '0' – '9' |
mpc_hexdigit |
Matches any character in the range '0 – '9' as well as 'A' – 'F' and 'a' – 'f' |
mpc_octdigit |
Matches any character in the range '0' – '7' |
mpc_digits |
Matches one or more digit |
mpc_hexdigits |
Matches one or more hexdigit |
mpc_octdigits |
Matches one or more octdigit |
mpc_lower |
Matches any lower case character |
mpc_upper |
Matches any upper case character |
mpc_alpha |
Matches any alphabet character |
mpc_underscore |
Matches '_' |
mpc_alphanum |
Matches any alphabet character, underscore or digit |
mpc_int |
Matches digits and returns an int* |
mpc_hex |
Matches hexdigits and returns an int* |
mpc_oct |
Matches octdigits and returns an int* |
mpc_number |
Matches mpc_int , mpc_hex or mpc_oct |
mpc_real |
Matches some floating point number as a string |
mpc_float |
Matches some floating point number and returns a float* |
mpc_char_lit |
Matches some character literal surrounded by ' |
mpc_string_lit |
Matches some string literal surrounded by " |
mpc_regex_lit |
Matches some regex literal surrounded by / |
mpc_ident |
Matches a C style identifier |
Useful Parsers
mpc_startswith(mpc_parser_t *a); |
Matches the start of input followed by a |
mpc_endswith(mpc_parser_t *a, mpc_dtor_t da); |
Matches a followed by the end of input |
mpc_whole(mpc_parser_t *a, mpc_dtor_t da); |
Matches the start of input, a , and the end of input |
mpc_stripl(mpc_parser_t *a); |
Matches a first consuming any whitespace to the left |
mpc_stripr(mpc_parser_t *a); |
Matches a then consumes any whitespace to the right |
mpc_strip(mpc_parser_t *a); |
Matches a consuming any surrounding whitespace |
mpc_tok(mpc_parser_t *a); |
Matches a and consumes any trailing whitespace |
mpc_sym(const char *s); |
Matches string s and consumes any trailing whitespace |
mpc_total(m |