The sid grammar file

3. The sid grammar file

3.1. Lexical conventions
3.2. The type declaration section
3.3. The terminal declaration section
3.4. The rule definition section
3.5. The grammar entry points section

The sid grammar file should always be the first input file specified on the command line. It provides an output language independent description of the language to be recognised. The file is split up into a number of different sections, each of which begins with a section header. All sections must be included, although it is possible to leave most of them empty. The following sections exist at present: type declaration, terminal declaration, rule definition, and grammar entry points. They must appear in that order. The sections are detailed below, after the lexical conventions.

3.1. Lexical conventions

Almost all non-printable characters (but definitely space, tab, newline and form-feed) are considered to be white space, and are ignored other than to separate other tokens. In addition, two comment forms are recognised: the C comment syntax, where comments begin with /*, and end with */, although unlike C these comments nest properly; and the C++ line comment syntax, where comments begin with //, and are terminated at the end of the line. Comments are treated in the same way as white space characters.

Section headers are enclosed in percent characters, and are case insensitive. The following section headers are currently recognised:

%types%
%terminals%
%productions%
%entry%

Identifiers must begin with a letter, an underscore or a hyphen. This may be followed by zero or more letters, digits, underscores and hyphens. Identifiers are case sensitive. The following are all legal identifiers:

expression mult-expr plus_expr expr-1

Identifiers are split into two namespaces: local names have one space; types, actions, rules, non-local names and terminals share the other namespace, so it is not possible to have an identifier that is a type as well as being a rule for example.

The following symbols are also used:

& ; = [ : ] ! , || $ ? { } ( ) < > -> :: ##

All other characters are illegal.

3.2. The type declaration section

The first section is the type declaration section. This begins with the types section header, followed by the names of all of the types used by the grammar. Each type name should be terminated by a semicolon. An example of the type declaration section follows:

%types%
NumberT ;
StringT ;

This declares two types: NumberT and StringT. There is no requirement for the type names to resemble names in the target language (in fact this should be avoided, as it is possible to use many different target languages). All types used in the grammar must be declared here. Similarly, all types declared here must be used in the grammar.

Types may be prefixed with ! if they are not to be used in the grammar. Such ignored types may however be present in ignored terminals and actions. This is intended to facilitate the use of types which are specific to actions only used by one grammar, where the actions are shared between several grammars.

3.3. The terminal declaration section

After the type declaration section comes the terminal declaration section. This section declares the terminals that will be used by the grammar. A terminal is a recogniser for a symbol in the input alphabet of the language to be recognised. It is possible to declare terminals that are not used by the grammar.

The section begins with its section header, followed by the declarations of the terminals. Each terminal declaration begins with the name of the terminal being defined, followed by its type, and terminated by a semicolon. If the terminal is not used in the grammar, the declaration should be preceded by a ! symbol.

A type (for terminals, rules and actions) is written as a colon, followed by the parameter tuple, followed by the -> symbol, followed by the result tuple. If the type is zero-tuple to zero-tuple, then the type may be omitted. A tuple consists of a comma separated sequence of name-type pairs (with the name and type being separated by a colon), surrounded by parentheses. For parameter tuples, the type may be suffixed by a & symbol, indicating that call by reference should be used (the default is call by copy). For declarations, the names should be omitted. For terminals, the parameter type must be the zero-tuple.

The simplest type of terminal declaration is as follows:

terminal1 ;

This means the same as:

terminal1 : () -> () ;

An example of a more complex terminal declaration is:

terminal2 : () -> ( :StringT ) ;

If these terminals were not to be used in the grammar, they should be declared as:

!terminal1 ;
!terminal2 : () -> ( :StringT ) ;

Terminals may optionally be quoted; this permits the use of non-alphanumeric characters, which is helpful if they don't have convenient names. It can also help make the grammar a little clearer to read, if the associated lexer is not to hand. For example:

!"->" ;
"*" ;
"{" : () -> ( :BraceT ) ;

As with other symbols, these are transformed in the generated parser according to the respective language-specific rules for producing legal identifiers in that language.

3.4. The rule definition section

The rule definition section follows the terminal declaration section. It begins with the section header, followed by the definitions and declarations of all of the rules used in the grammar, and the declarations of all of the actions used in the grammar.

Rule declarations look identical to terminal declarations, e.g.:

rule1 : ( :NumberT ) -> ( :NumberListT ) ;
rule2 ;

Action declarations are similar, although the names are surrounded by angle brackets, e.g.:

<action1> : ( :StringT & ) -> () ;
<action2> ;

Action declarations may be prefixed with ! if they are not to be used in the grammar. This is particularly useful for several grammars which share the same actions, where not all of the grammars make use of the actions provided.

A declaration (or a definition) may be prefixed with the :: symbol. This symbol forces the definition into the outermost scope. Scopes are described later on.

A rule definition (called a production) looks something like the following:

add-expr : () -> ( v : NumberT ) = {
	v1 = mul-expr ;
	plus ;
	v2 = expr ;
	v	= <add> ( v1, v2 ) ;
||
	v1 = mul-expr ;
	minus ;
	v2 = expr ;
	v	= <subtract> ( v1, v2 ) ;
##
	v = <ex> ;
} ;

The production begins with the rule name, followed by the parameter and result names and types (in this case, the rule name is add-expr, there are no parameters, and there is one result name v of type NumberT). This may optionally be followed by local declarations (there are none here - they are described later).

The left hand side of the rule is followed by the = symbol, a list of alternatives surrounded by curly braces, and is terminated by a semicolon. The alternatives are separated by the || symbol, and the last alternative may be separated from its predecessor (there must be one) using the ## symbol; if this is the case, then this alternative is the exception handler for the production (otherwise it has no exception handler).

An alternative may match the empty string, using the symbol $ and the terminator symbol ;, i.e.:

$ ;

unless it is an exception handler alternative (in which case it must do something), or a sequence of items. The empty string is only valid if the production has no results. If you want to match an empty string in a production that has a result, it is necessary to use an action (or identity) to provide a result.

An item is an identity, or a call to a (possibly anonymous) rule, a terminal, an action or a predicate. An identity looks like an assignment in conventional programming languages:

( a, b, c ) = ( d, e, f ) ;

Each tuple must contain the same number of names. In the case of a one-tuple, the parentheses may be dropped, e.g.:

a = d ;

Note that this is a binding operation, not an assignment. Each name on the left hand side must be a new name. It is illegal to redeclare a name that is already in scope. It is possible to assign to a name which is already in scope by prefixing that name with the & symbol, e.g.:

( a, &b, c ) = ( d, e, f ) ;

would assign to the name b, which must have been previously defined (it may be a parameter; if it is a call by reference parameter, then the change will propagate outside to the calling rule).

It is also possible to use the ! symbol in the result tuple to ignore results, e.g.:

( a, !, b, ! ) = ( c, d, e, f ) ;

This is not particularly useful in an identity, but may be more useful in a call to a rule, terminal or action. A call to a terminal or rule looks like a call to a function in a conventional programming language, e.g.:

( a, b ) = rule1 ( c, d ) ;
( e, f ) = terminal1 () ;

Calls to actions have the same form, except that action names are surrounded by angle brackets, e.g.:

( g, h, i ) = <action1> ( a, e ) ;

In addition, one of the names in the result tuple of the call to the action may be the predicate result symbol ?, in which case the action is used as a predicate (more details on predicates are given later).

When calling a rule, terminal or action, it is necessary to have declared it (or in the case of a rule declared or defined it) before the call.

If a rule or action is being invoked, and it takes one or more call by reference parameters, then the corresponding arguments should be prefixed by the & symbol, e.g.:

length = <string-length> ( &string ) ;

If the rule, terminal or action has the zero-tuple as a result, then only the right hand side of the definition is required, e.g.:

rule2 ( a, b ) ;
terminal2 () ;
<action2> ( c, d ) ;

If the rule, terminal or action has the zero-tuple as a parameter, then the parameter tuple may be omitted, e.g.:

( a, b ) = rule3 ;
terminal3 ;
c = <action3> ;

In older versions of sid, it used to be possible to have ambiguous items, e.g.:

a = b ;

where b was both a rule and a name. As local names may not shadow non-local and global names, then this is no longer a problem.

In each case, the data flow through the rule is indicated using names. In the previous example of a production, both alternatives have the same data flow: the call to mul-expr returns one value, which is stored in the name v1, and the call to expr returns one value, which is stored in the name v2. Both of these names are passed to the action (add in the first alternative, subtract in the second), which returns a value into the name v (presumably the sum or difference of the previous two values). The name v is the name of the result, so this will be returned as the result of the rule. The exception handler (which is invoked if something fails) calls the action ex to produce the result v.

It is necessary that the types of the data flow through the production are correct. It is also necessary to define all of the result names for the production in each of the alternatives in that production.

An anonymous rule is written in the same way as the body of a normal rule, e.g.:

list : () -> ( l : ListT ) = {
	n = number ;

	/* START ANONYMOUS RULE */ {
		? = <at-eof> ;
		l = <make-list> ( n ) ;
	||
		comma ;
		l1 = list ;
		l	= <cons> ( n, l1 ) ;
	##
		l = <com-or-eof> ( n ) ;
	} ; /* END ANONYMOUS RULE */
} ;

An anonymous rule is always inlined.

The rule name may be followed by a sequence of definitions and declarations surrounded by the [ and ] symbols (which are followed by the rest of the rule). In this case, the definitions are local to the rule, e.g.:

x-list [
	x = {
		terminal1 ;
		terminal2 ;
	||
		terminal3 ;
	} ;
] = {
	x ;
||
	x ;
	x-list ;
} ;

In this case, the rule x would be local to the rule x-list and no other rule would be able to use it. In error messages, the name of the rule x would be written as x-list::x. All declarations and definitions that occur inside of the [ and ] symbols have the scope of the enclosing rule, unless they are preceded by the :: symbol, in which case they have global scope. This is particularly necessary for actions, as actions can only be defined with global scope.

It is also possible to define non-local names. These are declared as an identifier (the name), followed by the : symbol, followed by another identifier (its type), in a similar manner to an entry in a type tuple. Non-local names are not allowed at the outermost level (so they may not be prefixed with the :: symbol either). When a non-local name is defined, it is in scope for all of the rules in its scope that are defined after it is, plus its defining rule.

Non-local names have their values saved on entry to their defining rule, and the value will be restored when the rule is exited. This includes exiting the rule tail recursively or because of an exception (if the rule has an exception handler, the non-local name will not be restored until the exception handler has exited). In almost all other respects non-local names are the same as local names. An example follows:

rule1 [
	name1 : Type1T ;
	rule1_1 = {
		<action1> ( name1 ) ;
		rule2 ( name1 ) ;
	} ;
] = {
	<action2> ( &name1 ) ;
	rule1_1 ;
} ;

It is possible to associate an initialiser with a non-local name, by following the type name with a = symbol and the action name in angle brackets, e.g.:

rule1 [
	name1 : Type1T = <action1> ;
] = {
	// ....
} ;

In this case the action is called at the start of the rule to initialise the non-local name. The action should return an object of the same type as the non-local name. Normally, the action takes no parameters, however it may take one parameter of the same type as the non-local name (or a reference to that type), in which case it will be given the saved value of the non-local name as an argument (this may be used to build a stack automatically for example).

3.5. The grammar entry points section

The final section lists the entry points to the grammar. It begins with the section header, followed by a comma separated list of rule names, terminated by a semicolon, e.g.:

%entry% expr ;

If you are going to use a rule as an entry point into the grammar (i.e. you wish to call the associated function), you must list it in the entry points list. If not, the function may not exist.