Advice on writing parsers with sid

B. Advice on writing parsers with sid

B.1. Handling EOF
B.2. Adding common mistakes to the grammar
B.3. Throwing exceptions inside exception handlers
B.4. Continuing from an exception handler
B.5. Manifesting semantic checks as syntax errors
B.6. Implementing “panic mode”
B.7. Duplicating token values

This appendix lists a few points of advice for writing parsers using sid. This list is intended to cover a few of the areas which might be confusing (particularly to somebody coming from a background with another parser generator), and an explanation of generally what is considered good technique when using sid. Hopefully this list will expand over time.

In general, the examples included with sid are designed to help orientate the reader and illustrate common practises. Also look at the implementation of sid itself (both its .sid file and .act file parsers are written using itself), which demonstrates more complex uses. There are also several other programs using sid throughout the TenDRA repository which may be of interest, not least of which is the compiler itself.

B.1. Handling EOF

Given the action:

<is-eof>: () -> (eof:Boolean) = @{
	@eof = (@. == @$EOF);
@};

used in a grammar as a predicate, we can have an empty alternative which is only reached during EOF. This can be used as a “get-out” clause to end recursion; for example, in a shell-like language, the entry point may be expressed as:

script = {
	? = <is-eof>;
||
	list-of-statements;
};

%entry% script;

where list-of-statements is the usual recursive structure for expressing lists. Since predicates used in alternatives only permit parsing to continue if they return true, the <is-eof> alternative is only parsed if EOF has been reached. Otherwise, the predicate raises an exception.

Usually EOF would be considered an error if encountered during a more complex rule, and hence would not appear elsewhere in the grammar. Similarly a parser for a file format would expect an eof token at the end of its input stream. sid itself is a good example for this typical requirement; the last rule for its entry point reads simply:

{
	eof;
##
	<expected-eof>;
};

where <expected-eof> produces an error.

B.2. Adding common mistakes to the grammar

For a given language there is often a well-known set of common mistakes people make. One approach to producing more useful error messages is to add these well-known mistakes as rules in the grammar, so that these specific cases may be identified. (These production would of course go on to call actions that raise an error).

This technique is particularly appropriate for errors which are not simply “missing” items. Instead, those are best handled with empty alternatives, such as:

{
	semicolon;
##
	<err-expected-semicolon>;
}

The approach of encoding common mistakes into the grammar as productions also lends itself particularly well to warnings; for example in C a missing semicolon after a struct at the end of a header, or the semicolon in for (...); { ... }.

B.3. Throwing exceptions inside exception handlers

Error handling may be centralised to a higher-level point whilst keeping the specific errors which caused the parse to fail local to their respective areas. This allows for specific error messages without having to repeat the same handling mechanisms. One way to achieve this is to have local error handlers call actions which raise a new exception (by calling the @! command). This newly-raised exception may then be caught by an error handler further up the grammar.

B.4. Continuing from an exception handler

An exception handler in sid may contain rules just as any other alternative does, so that the parse may continue after an error. This is helpful for having the parser perform semantic checks for non-syntax errors such as overflows or type mismatches:

variable-declaration = {
	v = variable; equals; l = literal;
	<instantiate-variable>(v, l);
};

statement = {
	variable-declaration;
||
	...
};

list-of-statements = {
	statement;
||
	statement;
	semicolon;
	list-of-statements;
##
	list-of-statements;
};

Here the <instantiate-variable> action may raise an exception (say, that the literal used to instantiate the variable is of an incompatible type); this exception bubbles up in the grammar to the list-of-statements rule. Since the error is relatively minor (especially as it is syntactically valid), rather than abort at this point it may be more convenient to the user to continue parsing in order to identify any other possible errors later on in the input.

In the example above, the most sensible approach seems to be to recur into the list-of-statements rule when such errors are encountered. In this case, this permits additional valid lists of statements to follow.

It is also especially good for interpreters, where exiting the program would be inappropriate. In the case of the calculator example, which reads and executes expressions, after a user has entered something invalid the parser is recurred through an error handler all of the time for the remainder of execution.

B.5. Manifesting semantic checks as syntax errors

Often grammars express constructs which are syntactically identical, but who's validity depends on other context. For example, a language offering declarations of several optional items may consider re-declaring the same item as an error. One approach is to simply list these items in order in the grammar, and make each optional (as sid's grammar does for the %types%, %terminals% etc sections).

Another approach would be to consider the order of these irrelevant, as long as each item is present once at most. Hence in the following example, A and/or B may be present, in any order, but it is a syntax error to define either A or B twice:

rule = {
	? = <a-not-defined>;
	A;
||
	? = <b-not-defined>;
	B;
##
	rule;	/* TODO check this: to handle predicates raising on false */
};

The predicates responsible for the semantic checks direct the syntax accordingly, altering what is henceforth considered to be a valid parse. A following redefinition will be a syntax error according to the permissible rules in the grammar (now excluding the predicates' alternatives). It (perhaps) seems natural to present these redefinitions as syntax errors to the user.

B.6. Implementing “panic mode”

When implementing parsers for compilers, it is often more useful to the user to try to continue a little way rather than exit on the first error encountered. This can help show unrelated errors, so that the user may address those whilst debugging and not resort to tediously dealing with a single error for each compilation attempt.

The traditional scheme for providing this is to implement a “panic mode” for parsing which returns to a high-level parsing rule when a syntax or lexical error is detected. Commonly it is sensible to skip the erroneous region and re-synchronise the input stream at a statement rule (or whatever suits the grammar).

After a syntax error has been encountered, panic mode is provided for by skipping over tokens until a statement separator (commonly a semicolon) is found ^[a] and tokens are discarded until the separator is reached. The parse may then continue. This is not perfect, but often helpful enough to the user.

The exception handling mechanism of sid provides a means to change the parse for error situations (without resorting to non-local goto commands such as longjmp for C); this exception handle may be entered from any of the children of the given rule:

rule = {
	...
##
	<panic>;
}

In sid, panic mode is quite naturally implemented by providing an action which performs this token-skipping:

<panic>: () -> () = @{
	while (@. != @$SEMI && @. != @$EOF) {
		/* Advance to the next token */
		@>;
	}
@};

For larger languages, it may be sensible to skip until more appropriate tokens depending which point in the grammar has failed to parse. This works nicely with exception handlers strategically placed throughout the grammar, each calling an action to skip to an appropriate token. The parser for sid's .sid files serve as a good example for this.

B.7. Duplicating token values

Often a parser needs to retrieve the values lexed when producing a token - to obtain the spelling for identifiers, for example. Consider the following fragment:

rule = {
	w = word;
	command = <duplicate>(w);
	empty-list = <empty-list>;
	argument-list = list-of-words(empty-list);
	<do-execute>(execute, command, argument-list);
}

Here the terminal word produces a value (the spelling for a word, as stored by the lexer), which is passed to the <duplicate> action:

<duplicate>: (in :STRING) -> (out :STRING) = @{
	@out = strdup(@in);
	if (@out == NULL) {
		perror("strdup");
		exit(EXIT_FAILURE);
	}
@};

So the value from the token buffer which stored this spelling is duplicated in memory such that the buffer may be reused for the next token. Or at least, that was the intention.

As it turns out, what actually happens is that the one-token look-ahead for the LL(1) grammar causes the parser to advance one token before executing the <duplicate> action. So when <duplicate> is called, a new token has already been read into the buffer. This gives the effect of these tokens appearing to be “along one”, as each value is one ahead of where it ought to be.

To avoid this, we can have the word terminal perform this duplication itself, and instead simply call:

w = word;
empty-list = <empty-list>;
argument-list = list-of-words(empty-list);
<do-execute>(execute, w, argument-list);

This also reads more naturally as the need for a separate <duplicate> action is removed.

^[a]
Apparently this is why so many languages use semicolons.