2. The Lexi Specification

  1. 2.1. Lexical Conventions
  2. 2.2. Comments
  3. 2.3. Identifiers
  4. 2.4. Function Calls
  5. 2.5. Character Sets and Sequences

There are several concepts common to multiple commands.

2.1. Lexical Conventions

The following productions are used throughout this document:

either-identifier   := sid-identifier | function-identifier | literal-identifier;
sid-identifier      := "$" + identifier ;
literal-identifier  := identifier ;
function-identifier := identifier + "(" + ")" ;

And for brevity , the following groups are defined:

GROUP alpha    = {A-Z} + {a-z} + "_" ;
GROUP digit    = {0-9} ;

2.2. Comments

Lexi permits comments in lexical analyser specifications. These are C style comments, opening /* and closing */. They may not nest, and they must close. Since version 2.0, Lexi supports // style comments.specifications. These are C++ style comments, opening // and extending to the end of the line. For example:

/* This is a comment */
/*
 * And so is this.
 */
// And this is a C++ style comment too

2.3. Identifiers

An identifier is a sequence of alphanumeric characters including the underscore beginning by an underscore or an alphabetic character. A $sid; identifier may also contain (but not begin with) a hyphen.

Literal identifiers have no prefix symbol; their names are used verbatim. They may be used for defining groups, keywords and tokens. They may also be used for conditional statements and for the names of functions to be called.

Sid identifiers are prefixed $. They may contain a hyphen. Since hyphens are not valid preprocessor or variable identifiers for C, the hyphens are substituted with _H. So for example the identifier $a-b would become a_Hb in the generated code, prefixed with -t to become lex_a_Hb.

Sid identifiers may be used for defining keywords and tokens only.

2.4. Function Calls

The parameters specified to functions are stated as empty (that is, an open bracket immediately followed by a close bracket).

Do not confuse function names with Sid Identifiers. For instance, $get-comment() is not a legal function.

It is rare that the character values read by a token would need to be identified by a function reading these characters. However, they are passed as arguments so that the characters are available. This is a convenience so that two similar tokens may share the same function:

TOKEN "/*" -> get_comment () ;
TOKEN "//"  -> get_comment () ;

In the above example the function is called as either get_comment('/', '*') or as get_comment('/', '/'), depending on which token was matched. In this way, get_comment() can decide which type of comment it is expected to retrieve, and therefore handle the differences appropiately.

The parameters passed to the function are the characters read to match the token, as separate int variables. So for the above example, get_comment() would be declared as:

void get_comment(int c0, int c1);

Calling a function from two tokens of different lengths is undefined behaviour.

2.5. Character Sets and Sequences

Sets and sequences of characters are used by all Lexi commands. They share the same syntax, but have different meanings: a set is an unordered intersection of all characters specified, whereas a sequence is a sequence which must be entirely present in the order given, as per a string comparison. The syntax is:

chars       := item + { "+" item } * ;
item        := string | range ;
range       := "{0-9}" | "{a-z}" | "{A-Z}" ;
string      := <"> + { CHAR | escapedchar | group } * + <"> ;
escapedchar := "\" + ( "n" | "t"  ) ;
group       := "[" + GROUPNAME + "]" ;

There are three available pre-defined ranges. Note that these ranges are fixed, and that start points or end points other than those given are invalid. These ranges are:

RangeContents
{0-9}The digits 0 to 9.
{a-z}The lower case characters a to z.
{A-Z}The upper case characters a to z.
Table 1. Predefined Character Ranges

Ranges are inclusive. The numeric values of the characters includes in ranges are of the character set of the C implementation used to compile Lexi (it is reasonable to assume this is ASCII). Note that the range production always represents a set (as opposed to a sequence), even when used as part of a TOKEN command.

Strings are sequences of ASCII characters delimited by double-quotes (").

The following special characters may be written by backslash escapes:

EscapeASCII ValueName
\t0x09Horizontal tab
\n0x0ANew line
\v0x0BVertical tab
\f0x0CNew page (form feed)
\r0x0DCarriage return
\"0x22Double quote
\[0x5BLeft bracket
\\0x5CBackslash
\eEnd of file (EOF)
Table 1. Escaped Characters

For example, a string specifying the character a, a newline, the character b, a backslash, a horizontal tab, and the character c is:

"a\nb\\\tc"

Strings must contain at least one character, or one character coded for by an escape sequence. Empty strings are not permitted.

These escape sequence may be used in any string. Note that "[" requires escaping to distinguish from the delimiter for a group name (see below).

A set of characters defined by a GROUP command may be included into a character set by specifying the group name between [ and ] within a string. For example:

"[xyz]"
"abc[def]hij[klm]nmo"

These include the characters defined by the xyz group to the first string, and from the def and klm groups to the second string. It makes no difference if the string is a set or a sequence; the groups included are always treated as a set, even though the rest of the string may be a sequence.

The group to be included must already have been defined at the point where it is referenced. Note that this does not permit circular references.

See the commands below for further examples of character sets and sequences

Strings and ranges may be concatenated by the plus operator. For example:

"abc" + "d" + "efg" + {0-9}

is equivalent to the sequence:

"abcdefg[digit]"

(where the group digit is suitably defined), and to the set:

"abcdefg0123456789"