5. Interface

  1. 5.1. Instantiation
  2. 5.2. Terminals
  3. 5.3. Functions
  4. 5.4. Character Groups
  5. 5.5. Keywords

Lexi generates and outputs C89 code implementing the specified lexical analyser. This code is intended to be linked as part of a larger C program by the user.

5.1. Instantiation

Input to the generated lexical analyser is by way of a single function (or macro, if desired) that the user is expected to provide to read a character:

int read_char(void);

Output from the lexical analyser is by way of a function provided by Lexi that returns the next token to the user's program:

int read_token(void);

This is the user's primary interface to the lexical analyser. Note that the type of characters is int, so that EOF may be expressed if neccessary.

Calling read_token() will result in the lexical analyser making as many calls to read_char() as are necessary.

5.2. Terminals

Lexi does not define the values of terminals in the generated code; these are expected to be specified by the user (most usually from a parser-generator such as Sid). For example, a token defined by:

TOKEN "--" -> $dash ;

Would return the value of lex_dash from read_token() when matched. The prefix of these identifier names may be specified with the the -l option. See the Sid documentation for further discussion of the C representation of Sid's terminals.

5.3. Functions

Within the C implementation of functions, the usual Lexi API functions may be called. For example, to call read_char(). This is especially useful for calling the functions defined to identify membership in groups. A common case is to read tokens of a variable length. This is especially suitable for reading identifiers. For example (where unread_char() is a user-defined function with the obvious effect):

GROUP identstart = {a-z} + {A-Z} + "_";
GROUP identbody = "[identstart]" + {0-9} + "-";
TOKEN "[identstart]" -> read_identifier();
int read_identifier(int c) {
	for (;;) {
		if (c == EOF) {
			return lex_eof;
		}

		if (!is_identbody(lookup_char(c))) {
			unread_char(c);
			return lex_identifier;
		}

		/* store character here */

		c = read_char();
	}
}

Functions called by tokens are passed each character forming the token. The example above would result in the call to:

get_identifier(c);

where c is the content of the token (that is, the character matched by "[identstart]". For multiple characters each character is passed:

TOKEN "abc" -> f();
get_identifier('a', 'b', 'c');

See §5 for further details of the C interface the generated lexer will call.

Note that it is undefined behaviour to have tokens of different lengths call the same function.

5.4. Character Groups

Should the user wish to check if a character is in a group, the generated code provides macros of the form is_groupname(). These are intended to be used as:

is_digit(lookup_char(c))

assuming a group named digit is defined. See the §3.2 and §3.3 white sections for further details on group names.

5.5. Keywords

Neither the keyword calls output by -k nor the lexical analyser itself depend on including any headers other than for the user's own code's requirements.