The Lexi Users' Guide

  1. i. Introduction
  2. 1. Lexical Analysis
    1. 1.1. Overview
    2. 1.2. Concepts Lexi Expresses
  3. 2. The Lexi Specification
    1. 2.1. Lexical Conventions
    2. 2.2. Comments
    3. 2.3. Identifiers
    4. 2.4. Function Calls
    5. 2.5. Character Sets and Sequences
  4. 3. Commands
    1. 3.1. Pre-pass Mappings
    2. 3.2. Character Group Definitions
    3. 3.3. Whitespace Definition
    4. 3.4. Token Definitions
    5. 3.5. Default definition
    6. 3.6. Keyword definitions
    7. 3.7. Zone Definitions
    8. 3.8. Type Definitions
    9. 3.9. Action prototypes
  5. 4. The lexi action file
    1. 4.1. Introduction
    2. 4.2. The header and trailer statements
    3. 4.3. Action definition statement
  6. 5. Interface
    1. 5.1. Instantiation
    2. 5.2. Terminals
    3. 5.3. Functions
    4. 5.4. Character Groups
    5. 5.5. Keywords
  7. A. Appendix
    1. A.1. Undefined Behaviour
    2. A.2. Obscure Features

First published .

Revision History

kate

Combined C99 and C90 code generation. The applicable API is now selected by `#if` on `__STDC_VERSION__`.

kate

Removed -a; I can't think of any reason why you might want to generate standard C code eliding assertions, as opposed to compiling with `-DNDEBUG`. So now assertions are always generated.

kate

The dependency on libexds is now resolved internally for convenience of building.

kévin

Removed ##.

kévin

Implemented group merge.

kévin

Added ARGUMENT name:ctype;

kate

EOF is no longer permitted in groups.

Empty groups are not present in the lookup table.

Groups may now contain character \0.

Group inversion is supported for including subsets: GROUP a = "[^b]";

kate

Removed legacy support for tokens mapping to functions (and in particular, literally-embedded fragments of source code specified as string literals for function arguments). These were written in the form:

TOKEN "abc" -> f(...);

with various arguments permitted inside the parenthesies. These have been superseded by SID-like actions defined in the .lct file, which look like:

TOKEN "abc" -> <f>(...);

These provide a language-agnostic manner to achieve the same effect, keeping all language-specific portions in the .lct. file.

Any existing .lxi files ought to be updated to use these with an accompanying .lct file, instead.

kévin

Added the “test” language.

kate

The .lct headers are now output after the generated lexer's header is included. This is a convenience to permit the .lct headers to make use of symbols from the generated header (notablly LEXI_EOF in static support functions, for example).

kate

Added a new command line option, -i, for specifying a prefix for user-supplied interface functions separately from the -p prefix. Currently the only such user-supplied interfaces are *getchar() and *unknown_token.

The rationalle for introducing this option is to permit these interfaces to be shared from a common source whilst keeping the -p prefixes separate, as would be the case for a .lct file shared between many lexers each with their own .lxi specification.

kévin

Added .lct actions files. These are an analogue to sid's .act files.

Four predefined types have been added: TERMINAL, INTEGER, STRING and CHARACTER. Scopes (instructions_list) can now retain local names with their types.

kévin

Added [...) range markers for zones. This is mostly useful for identifiers.

kate

Added Graphivz “Dot” output.

kate

lexi_group() now returns a bool for C99.

kate

Added -a; assertions are now disabled by default.

kate

Added C99 as an output language.

kate

Unknown tokens are now provided by lexi_unknown_token, which forms part of the generated API. The character passed is removed, as it serves little purpose.

kate

LEX_EOF is now known as LEXI_EOF.

kate

Moved the token buffer into lexi_state. This introduces state to the buffer-manipulation API, which breaks backwards compatibility.

kate

The state for zones is now maintained as part of a slightly more formal interface; unfortunately the contents of this struct (though intended to be private) are visible for the convenience of allocation, but defining an instance of this struct on the user's side of the generated API allows for multiple instances of Lexi to run concurrently.

The API is now consistent for both with and without zones.

kate

Added -l for the output language.

kate

The group-querying functions are now behind a single interface, to which an enumeration of group identifiers is passed. This breaks compatibility for is_*(), which no longer exist. This helps provide a static set of functions should somebody wish to wrap calls to lexi-generated code; the contents of the enumeration scale with groups, rather than the size of the API.

This change brings the lookup table behind the API, making it static. The type of the lookup table is also removed from the public interface.

kate

Reworked prefix generation. All generated symbols should now be prefixed, which breaks compatibility.

kate

Removed single-file output; the generated header is now mandatory.

kate

Implemented an alternate keywords mechanism. This introduces a new function, lexi_keyword(), which returns an int representing the keyword found. The intention is that this would share a token ID along with Lexi's main interface.

Removed the -k option.

kate

Removed lookup_char() in favour of passing the character directly to the is_*() macros. This is an API change.

kévin

Added the -p prefix option.

kévin

Added COPYRIGHT and the -C option for specifying copyright files on the command line. The origional “first comment” behaviour is kept, if neither of these are specified.

kévin

Added [^...] for complements of groups.

kévin

Added #n and a literal string for function arguments.

kévin

Added optional support for a generated C header file.

kévin

Added support for groups in zones.

kévin

Added zones.

kate

Merged in support for up to (and including) 31 groups. This is a rework of a patch origionally submitted by Rob Andrews, but not included in our tag for Lexi 1.2. Notably it has been adapted to use stdint.h instead of assuming the size of various C types.

kate

Lexi now maintains its own internal fixed-length buffer when reading tokens. This obsoletes the unread_char() function, which is now no longer called by lexi.

This change aims to be backwards-compatible; it should not affect existing programs, though their unread_char() functions may now be redundant and can be removed if not used by their own routines.

kate

Tagged Lexi 1.3.

kate

Wrote the Lexi Users' Guide.

This was mostly reverse-engineered from the source, and by experimenting with Lexi.

kate

Added examples.

kate

Added -h.

kate

Moved out Lexi to a standalone tool.

asmodai

Tagged Lexi 1.2.

This corresponds to Lexi 1.2, which was developed privately by Rob after the 4.1.2 TenDRA release. We are skipping version 1.2 to avoid confusion with his version.

DERA

Lexi 1.1; TenDRA 4.1.2 release.

i. Introduction

Lexi translates a description of a lexical analyser into C code implementing that analyser. It aims to provide simple, straightforward features for lexical analysis, and to leave more complex behaviour to its calling program.

This document describes how to use the Lexi lexer generator. It was written for Lexi version 2.0.

This version of Lexi (trunk pre 2.0) is the current working version. Once all new features have been added, we will tag a 2.0-proto-1 release. The 2.0 release will happen once we have frozen the syntax. Backward compatibility is not preserved with regards to 1.3. However, we are currently working on an option switch that would emulate old behavior.

This document is a modification of the userguide for Lexi 1.3. Lexi 1.3 was an interim release to mark a stable release before forthcoming API changes scheduled for the next release. This provides no features over 1.2 save for some minor points (listed under the change history below); it serves mostly to collate the restructuring of the codebase during development; 1.3 is the first release of Lexi as a stand-alone product, separate to the TenDRA distribution.

Any original documentation Lexi once had previous to version 1.3 is lost. Since there is no surviving documentation, this guide has been written from scratch, based from observations of the source code. It is difficult to distinguish between obscure features and undefined behaviour; we are taking the opportunity here to explicitly state which features are undefined behaviour and which are intentional. We try to maintain backwards compatibility with these features, based on the Lexi files present in the TenDRA repository.

1. Lexical Analysis

  1. 1.1. Overview
  2. 1.2. Concepts Lexi Expresses

1.1. Overview

Lexical analysis is the process of categorising an input stream into tokens. Each of these categories (called lexemes) are determined by the spelling of the contents of that token. For convenience of discussion, this guide assumes that tokens are formed of characters, although this is not necessarily the case.

Categories of tokens are arranged such that spellings the possible sets of spellings forming each category are unambiguous. For example, commonly floating point literals are distinguishable from integer literals because they must contain a decimal point. If two different concepts must be expressed using the same (or overlapping) sets of spellings, a further stage after lexical analysis would be required to distinguish them. For example, commonly an identifier may represent either a variable or a function name, and it is the role of a parser to distinguish between them based on the ordering of the tokens.

Usually it is the responsibility of the lexical analyser to skip whitespace (and often also comments). illustrates an example of tokenisation:

A vptr A typeid A ptr typeid A voff A vfunc1 vfunc2 vfuncn ... i n t i = 5 3 ; <type> "int" <ident> "i" <assign> <literal> "53" <sep> Parser
Figure 1. Forming a Token Stream

In this example, the lexer skips over whitespace characters (indicated in the input text with a shaded background) and categorises the remaining characters according to their spelling. For example, a series of digits is categorised into the <literal> type. Notice that whitespace is optional; there happens to be none between the 53 and ; tokens.

The spelling of the characters which form each token is passed on, but usually ignored for punctual tokens such as statement separators and assignment operators. It is more relevant for the spelling to be known for identifiers, literals and other non-punctual tokens.

The traditional use for a lexer is to feed this stream of categorised tokens to a parser; lexers are unaware of the order of tokens, since they are concerned with categorisation only. The process of parsing then asserts that these tokens are in an order permitted by the given grammar. Usually storing the spellings associated with each tokens is a side-effect of the parse, resulting in a symbol table or equivalent (this depends on the purpose of the parser). Parsing is beyond the scope of this document, and will not be discussed here - see the SID Users' Guide for details.

For simple uses, a parser may not be required at all.

1.2. Concepts Lexi Expresses

Keyword Definitions

Keywords are sequences of inseparable characters treated whole, indented to be passed back to the calling program as single tokens. These differ from most tokens in that their spellings are fixed, rather than a set of possible spellings.

Lexi itself does not perform processing of keywords; instead it outputs their definitions separately (see the -k) and these definitions are intended to be checked by a token calling a function.

Pre-pass Mappings

For mappings of sequences of characters to a single character, pre-pass substitutions offer a mechanism to implement effects similar to trigraphs in C.

Whitespace Definitions

Whitespace serves both to separate tokens who's concatenated spellings would be ambiguous (e.g. to distinguish abc and xyz from abcxyz), and primarily to improve readability of the source. Whitespace is not passed on in the generated token stream.

Group definitions

Group definitions define sets of characters which may be specified in place of a single character in tokens. The presence of a character in a group may also be tested via the generated API, much like C's ctype.h functions.

Token definitions

Token definitions specify the set of spellings which form a lexeme. These are central to lexical analysis.

Default definitions

Default definitions specify which action should we performed upon encountering a non recognized token.

Zone definitions

Zone definitions specify zones in which the lexical analysis is subject to different rules. Zones definition may contain token definitions, default definition, group definitions and whitespace definitions.

Prior to 2.0, zones and default did not exist and any other facilities was provided by matching the start of a token and passing control to an externally defined function, This separation provided a general mechanism to allow the user to handle situations unforeseen by the authors, without making Lexi itself over complex by attempting to provide built-in mechanisms to express every corner-case (for example, switching to a "comment-parsing" mode). See the function calls section for details. In 2.0, zones, at least once buffers are implemented, should allow to express most of these possibililities even ones not foreseen by the authors.

2. The Lexi Specification

  1. 2.1. Lexical Conventions
  2. 2.2. Comments
  3. 2.3. Identifiers
  4. 2.4. Function Calls
  5. 2.5. Character Sets and Sequences

There are several concepts common to multiple commands.

2.1. Lexical Conventions

The following productions are used throughout this document:

either-identifier   := sid-identifier | function-identifier | literal-identifier;
sid-identifier      := "$" + identifier ;
literal-identifier  := identifier ;
function-identifier := identifier + "(" + ")" ;

And for brevity , the following groups are defined:

GROUP alpha    = {A-Z} + {a-z} + "_" ;
GROUP digit    = {0-9} ;

2.2. Comments

Lexi permits comments in lexical analyser specifications. These are C style comments, opening /* and closing */. They may not nest, and they must close. Since version 2.0, Lexi supports // style comments.specifications. These are C++ style comments, opening // and extending to the end of the line. For example:

/* This is a comment */
/*
 * And so is this.
 */
// And this is a C++ style comment too

2.3. Identifiers

An identifier is a sequence of alphanumeric characters including the underscore beginning by an underscore or an alphabetic character. A $sid; identifier may also contain (but not begin with) a hyphen.

Literal identifiers have no prefix symbol; their names are used verbatim. They may be used for defining groups, keywords and tokens. They may also be used for conditional statements and for the names of functions to be called.

Sid identifiers are prefixed $. They may contain a hyphen. Since hyphens are not valid preprocessor or variable identifiers for C, the hyphens are substituted with _H. So for example the identifier $a-b would become a_Hb in the generated code, prefixed with -t to become lex_a_Hb.

Sid identifiers may be used for defining keywords and tokens only.

2.4. Function Calls

The parameters specified to functions are stated as empty (that is, an open bracket immediately followed by a close bracket).

Do not confuse function names with Sid Identifiers. For instance, $get-comment() is not a legal function.

It is rare that the character values read by a token would need to be identified by a function reading these characters. However, they are passed as arguments so that the characters are available. This is a convenience so that two similar tokens may share the same function:

TOKEN "/*" -> get_comment () ;
TOKEN "//"  -> get_comment () ;

In the above example the function is called as either get_comment('/', '*') or as get_comment('/', '/'), depending on which token was matched. In this way, get_comment() can decide which type of comment it is expected to retrieve, and therefore handle the differences appropiately.

The parameters passed to the function are the characters read to match the token, as separate int variables. So for the above example, get_comment() would be declared as:

void get_comment(int c0, int c1);

Calling a function from two tokens of different lengths is undefined behaviour.

2.5. Character Sets and Sequences

Sets and sequences of characters are used by all Lexi commands. They share the same syntax, but have different meanings: a set is an unordered intersection of all characters specified, whereas a sequence is a sequence which must be entirely present in the order given, as per a string comparison. The syntax is:

chars       := item + { "+" item } * ;
item        := string | range ;
range       := "{0-9}" | "{a-z}" | "{A-Z}" ;
string      := <"> + { CHAR | escapedchar | group } * + <"> ;
escapedchar := "\" + ( "n" | "t"  ) ;
group       := "[" + GROUPNAME + "]" ;

There are three available pre-defined ranges. Note that these ranges are fixed, and that start points or end points other than those given are invalid. These ranges are:

RangeContents
{0-9}The digits 0 to 9.
{a-z}The lower case characters a to z.
{A-Z}The upper case characters a to z.
Table 1. Predefined Character Ranges

Ranges are inclusive. The numeric values of the characters includes in ranges are of the character set of the C implementation used to compile Lexi (it is reasonable to assume this is ASCII). Note that the range production always represents a set (as opposed to a sequence), even when used as part of a TOKEN command.

Strings are sequences of ASCII characters delimited by double-quotes (").

The following special characters may be written by backslash escapes:

EscapeASCII ValueName
\t0x09Horizontal tab
\n0x0ANew line
\v0x0BVertical tab
\f0x0CNew page (form feed)
\r0x0DCarriage return
\"0x22Double quote
\[0x5BLeft bracket
\\0x5CBackslash
\eEnd of file (EOF)
Table 1. Escaped Characters

For example, a string specifying the character a, a newline, the character b, a backslash, a horizontal tab, and the character c is:

"a\nb\\\tc"

Strings must contain at least one character, or one character coded for by an escape sequence. Empty strings are not permitted.

These escape sequence may be used in any string. Note that "[" requires escaping to distinguish from the delimiter for a group name (see below).

A set of characters defined by a GROUP command may be included into a character set by specifying the group name between [ and ] within a string. For example:

"[xyz]"
"abc[def]hij[klm]nmo"

These include the characters defined by the xyz group to the first string, and from the def and klm groups to the second string. It makes no difference if the string is a set or a sequence; the groups included are always treated as a set, even though the rest of the string may be a sequence.

The group to be included must already have been defined at the point where it is referenced. Note that this does not permit circular references.

See the commands below for further examples of character sets and sequences

Strings and ranges may be concatenated by the plus operator. For example:

"abc" + "d" + "efg" + {0-9}

is equivalent to the sequence:

"abcdefg[digit]"

(where the group digit is suitably defined), and to the set:

"abcdefg0123456789"

3. Commands

  1. 3.1. Pre-pass Mappings
  2. 3.2. Character Group Definitions
  3. 3.3. Whitespace Definition
  4. 3.4. Token Definitions
  5. 3.5. Default definition
  6. 3.6. Keyword definitions
  7. 3.7. Zone Definitions
  8. 3.8. Type Definitions
  9. 3.9. Action prototypes

Lexical analysers are described to Lexi by a sequence of commands. This section provides an explanation of each possible command, and explains their respective intended uses.

3.1. Pre-pass Mappings

The lexical analysis runs in two passes. The first pass, or pre-pass stage permits replacements to be substituted before the main pass, under which tokenisation takes place. This gives a convenient mechanism for expressing trigraph-like substitutions as found in C. The syntax to define pre-pass replacements is:

MAPPING sequence + "->" + char ;

The string on the right (i.e. the value with which the matched string is replaced) may only contain one character, or an escape sequence which yields one character.

For example, to replace the trigraph ??= with a single #:

MAPPING "??=" -> "#" ;

This would replace instances of ??= with # before any tokenisation takes place. So the input a??=b would match the token definition:

TOKEN "a#b" -> a ;

(and so would simply a#b, as usual).

A group may be included in the character sequence to be replaced. For example:

MAPPING "[alpha]" -> " " ;

will replace any alphabetic character by a blank, assuming the alpha group is suitably defined at that point. See §2.5 for details of including groups in sequences.

It is possible to use groups to NOT match characters in the group.

MAPPING "[^alpha]" -> " " ;

will replace any non alpha character by a blank.

Mappings are substituted repeatedly until no further mappings match. The order of replacement for mappings matching strings of equal length is undefined, and so it is an error to define a mapping which produces a character used at the start of any mapping, including itself. For example:

MAPPING "???" -> "?" ;

is illegal. To see why, consider the input aab for the (illegal) mappings:

MAPPING "aa" -> "x" ;
MAPPING "xb" -> y ;

Since the order of substitution for mappings matching strings of equal length is undefined, it is unclear if this will result in xb or y. Notice that C does not demand a ??? trigraph - perhaps for this very reason (or perhaps simply because it is redundant). This restriction applies no matter how the string defining the characters to be mapped is formed: for example, it is also illegal to define a mapping which maps to a character present in a group included at the start of another mapping.

Mappings bind from left to right. For example:

MAPPING "ab" -> "d" ;
MAPPING "bc" -> "d" ;

For the input abc will produce db, not ad.

Mappings matching sequences of longer lengths are replaced with higher precedence than mappings matching shorter lengths of the same values beginning the longer sequences. For example:

MAPPING "abcdef" -> "x" ;
MAPPING "abcd" -> "y" ;

for the input abcdef will produce x, not yef.

3.2. Character Group Definitions

A group is an unordered sets of characters; groups name these sets for use in strings . The syntax of a group definition is:

group-defn := "GROUP" + identifier + "=" + chars ;

The identifier specified is the name by which the group may be referenced.

For example, the following are valid group definitions:

GROUP alpha    = {A-Z} + {a-z} + "_" ;
GROUP digit    = {0-9} ;
GROUP odds     = "13579" ;
GROUP even     = "02468" ;
GROUP vowels   = "aeiou" ;
GROUP anything = "atrf" + "HGMP" + {0-9} ;

Groups may include the sets of previously-defined groups. Any character in the referenced set will be included into the group definition. For example:

GROUP hexdigit = "[digit]ABCDEFabcdef" ;
GROUP alphanum = "[alpha]" + {0-9} ;

assuming the groups alpha and digit have already been defined at that point. See §2.5 for details of the syntax for the chars production.

Groups may not contain characters which are present in other groups (i.e. they may not overlap). See the §3.4 section for further discussion of why this is so.

Macros to test if a character belongs to a group are provided in the generated code, of the form is_groupname(). These must be passed the index into the look-up table containing the given character, obtained by lookup_char(c). For example:

is_alpha(lookup_char('a'))

would yield true, assuming the group alpha is suitably defined.

The group name white may not be used for groups other than the whitespace definition; see §3.3 for details.

A group name is unique amongst groups; groups may only be defined once.

3.3. Whitespace Definition

Consecutive whitespace characters outside of tokens are skipped by the lexical analyser before each token is recognised. Whitespace is treated with the semantics of a single token delimiter. Lexi specifies whitespace by the special group name white, which may not be used as an identifier to name other groups.

For special cases where whitespace has significance (a common example is inside string literals), token definitions may call user-defined functions which purposefully circumvent the whitespace-skipping features of Lexi.

The syntax is the same as for any §3.2, but with the special group name white:

white-defn := "GROUP" + "white" + "=" + chars ;

The whitespace definition may be omitted, in which case it defaults to space, horizontal tabulation and newline. Therefore it is always present, even if the declaration is implicit. As with any group, it may not be defined multiple times.

For example:

GROUP white = " \t\n\r";

Aside from the additional semantics explained above, the whitespace group is present as any other group: it is present in the API as is_white(), and may be included in §2.5 as [white].

It is illegal to define the whitespace group to contains characters which are present in token definitions, including groups which those tokens use.

3.4. Token Definitions

Tokens are sequences of characters read by the lexical analyser and produced as output. Each token as a unique identifier, which is passed to code calling Lexi, along with the characters read which form the token's spelling.

Tokens are usually the main component of a lexical analyser. In Lexi's case, the only situation in which there would be no token declarations is if the lexical analyser is to exclusively perform pre-pass mappings. The effect of specifying neither tokens nor pre-pass mappings is undefined.

The syntax for specifying tokens is:

token-defn := "TOKEN" + chars + "->" + action-list ;
command-list := "TOKEN" + chars + "->" command + "," + command-list ;

An command is either a return terminal command, a discard command, a function call command. Inline actions (sidlike) will be introduce in version 2.0 but are not ready in trunk yet.

The token discard command

It is represented by the token $$. It should be in the last position. If this is the only instruction, the token will be read and its spelling discarded.

The return terminal command

A return terminal command is either a Sid identifier or a non prefixed identifier.

return-terminal-command :=    $ + identifier ;
				| identifier ;

An example of these two forms:

TOKEN "token1" -> $sid-identifier ;
TOKEN "token2" -> nonsid_identifier; 

A Sid identifier is an identifier that will be prefixed by lex_ or the prefix given in the -t option. Non prefixed identifiers will not be prefixed at all. The non prefixed form is considered obsolete and might be remove without notice. The first form of token definition states that upon encountering token1 the lexer should return the terminal corresponding sid-identifier.

A return terminal command must appear last in the list of commands.

The function call command

A function call command has the form

function-call-command :=  identifier + "(" + argument-list + ")" ;

An argument-list is a comma separated list. Here are examples of arguments:

TOKEN "[alpha1][digit]" -> get_identifier1(##); // Old form, will probably be obsoleted calls get_identifier1(c0,c1)
TOKEN "[alpha2][digit]" -> get_identifier2(); // Old form equivalent to previous form will probably be obsoleted
TOKEN "[alpha3]" -> get_identifier3(#$); // calls get_identifier3() with no argument
TOKEN "[alpha4]a" -> get_identifier3(#); // calls get_identifier4()
TOKEN "[alpha4][digit]b" -> get_identifier4("globalbuffer", #1,#0); // calls get_identifier4(globalbuffer,c1,c0)
TOKEN "[alpha5]a" -> get_identifier5(#*); // should call get_identifier5(char*), not completely implemented yet.

where c0, c1, c2, ... matches the first, second and third character of the token. It is possible to combine the arguments (except for #$ which is only valid if it is the only argument of the list.

The return value of a function will be ignored unless this is the last function call in the command list, in which case it is expected to return a valid terminal. If you don't want it the last called function to return a terminal, you have to add a trailing discard $$:

TOKEN "[digit]" -> push_buffer(#0), $$; // do not return a terminal.
TOKEN "[alpha]" -> get_identifier(#0); // return a terminal
Inline action calls

An inline action call has the form

(/*commaseparatedoutputlist*/) = <action-name>( /* commaseparatedargumentlist*/);

> > < may or may not be removed from the syntax. The decision will happen prior to 2.0. The argumentlist can contain either local variables, # style of arguments. Actions and types must have been previously declared see

See §5.2 for the C representation of the terminals returned by read_token().

The second form states that the lexer should return the result of the call to the given function identifier. See §5.3 for details of the function call made in C.

In more complex cases (most notably where tokens need include whitespace), tokens are mapped to user-defined functions. For example, for comments in a C-style language, the lexical analyser is expected to discard characters until the end of the comment is found. In Lexi, this is specified as:

TOKEN "/*" -> get_comment() ;

Where get_comment() is an externally defined function which simply reads characters until the corresponding */ is found. See the functions section for further details of calling functions.

In most languages, keywords are usually a subset of identifiers. In order to handle these and simplify the user-defined read_identifier() function (or equivalent), Lexi provides the keywords mechanism discussed in §3.6.

Note that this example does not illustrating storing the characters read. A real-world case would usually store spellings in order to be useful to a later stage, such as parsing.

Unlike many lexical analysers, tokens in Lexi are not specified by regular expressions. However, as sequences of characters may contain references to groups (which are treated as sets), similar effects may be achieved for simple cases. For example:

TOKEN "[alpha]" ->  get_identifier ();
TOKEN "$[alpha]" -> get_sid_identifier ();

assuming the group alpha is defined. Another example:

TOKEN "A[digit]" -> $papersize ;

would match paper sizes such as such as A3, A4 and so on. A token may only be defined once, but different tokens may share the same terminal or call the same function. So to extend our (rather lax) implementation of ISO 216 paper sizes:

TOKEN "A[digit]" -> $papersize ;
TOKEN "A10" -> $papersize ;
TOKEN "C[digit]" -> $envelopesize ;
TOKEN "B[digit]" -> $envelopesize ;
TOKEN "DL" -> $envelopesize ;	/* BS 4264 specifies a transparent window for DL */
TOKEN "ID-[digit]" -> $identificationcardsize ;

Note that A10 codes for the same lexeme as single-digit A sizes. See §2.4 for further examples of multiple tokens sharing one function, and the for further examples of using sets within sequences.

Using groups in this way is especially useful in combination with functions for reading variable-length tokens. For example:

TOKEN "[alpha]" ->  get_identifier() ;
TOKEN "$[alpha]" -> get_sid_identifier() ;

3.5. Default definition

A new feature in 2.0 is the ability to specify actions (usually token return) by default. I.e:

TOKEN "[alpha]" ->  get_identifier() ;
TOKEN DEFAULT -> $unknown ;

specify that the terminal $unknown should be returned upon encountering a sequence of character that cannot be mapped to any other specified token.

3.6. Keyword definitions

The syntax of keywords resembles the syntax used for tokens:

keyword-defn := "KEYWORD" + string + "->" + either-identifier ;

For example:

KEYWORD "keyword" -> $key ;
KEYWORD "special" -> get_special () ;

Usually keywords are simply identifiers with a special meaning. Using the main pass to identify keyword with the token constructs is possible but awkward since the spelling of keywords is usually a subset of the much bigger set of identifiers. The keyword construct facilitates the identification of keywords after a token has been found; they have effect only for the -k and are otherwise not present in the output generated by Lexi. Therefore the only difference between keywords and tokens (and indeed their purpose) is the programmatic interface that they provide.[a]

Code generated by Lexi under the -k option consists of a succession of calls to define each keyword, one per KEYWORD statement:

MAKE_KEYWORD ( "keyword", "lex_key" )
MAKE_KEYWORD ( "special", "get_special ()" )

where the identifier has been transformed according to the rules for Sid identifiers. It is then left to the user to implement MAKE_KEYWORD, usually by way of a macro. The generated keyword code is intended to be included with a #include directive. Suppose that keyword.h contains the keyword code then building on existing token definitions, the intended use for keywords is as follows (for example with a lexer required to identify variable names):

KEYWORD "if" -> keyword_if ;
KEYWORD "else" -> keyword_else ;
TOKEN "[alpha]" -> get_variable() ;

Where and get_variable() checks to see if the given token is actually a keyword like so:

<type> get_variable(int c) {
	char *token;
	/* token is pointed to the characters read */
	...
	#define MAKE_KEYWORD(A, B)\
		if (!strcmp(token, (A))) return(B);
	#include "keyword.h"
		return(lex_variable);
	...
}

Here keyword.h was generated by Lexi's -k. If the variable name read by get_variable() and pointed at by char *token is a keyword, keyword.h's calls to MAKE_KEYWORD() will result in the string comparisons of token to each possible keyword in turn (that is, token is compared to "if" and "else". If either of these match, the identifiers keyword_if and keyword_else are returned, respectively. Otherwise, if no keywords match, the token is known to be a variable name, and so get_variable() falls through to return the lex_variable identifier.

Unlike functions associated with tokens, functions associated with keywords are generated to be called with no arguments passed:

KEYWORD "sx" -> fx() ;

Results in the generated call:

MAKE_KEYWORD("sx", fx());

And so f() should be declared to accept no parameters, that is of prototype:

<type> f(void);

3.7. Zone Definitions

Zones are the major new feature for 2.0. A zone is a part of a file where lexing rules change. The general syntax is

ZONE zonename : "entrytoken" [ -> $entry-terminal] ... "leavingtoken" [ ->$leaving terminal] {
 /*Zone body*/
}

Upon encountering the entry token, the lexer change the lexing rules and use the ones defined in the zone body until it encounters the leaving token. Optionally, a terminal may be returned on either zone entry or zone leaving (or both) In the zone body, one can have whitespace definitions, group definitions, token definitions, default token definitions, mappings and other zone definitions. Keyword definitions are not allowed inside zones for now but it is a planned feature.

Zones are used in place of user defined functions. Their goal is to allow a better expressivity in the lxi language. For example, we can define comment using zones:

ZONE comment: "/*" ... "*/"  {
	GROUP white = "";
	TOKEN DEFAULT -> $$; /* $$ means discard the token */
}

ZONE linecomment: "//" ... "\n"  {
	GROUP white = "";
	TOKEN DEFAULT -> $$; /* $$ means discard the token */
}

It is also possible to use zones to express strings:

ZONE string : "\"" -> clear_buffer(#$) [...] "\"" ->$string  {
	GROUP white = "";
	TOKEN DEFAULT -> push_buffer(#0), $$; 
}

It is planned to add buffers to lexi 2.1, removing the need for user function such as push_buffer. The range operators ... and [...] are equivalent. To express identifiers, one need to use the [...) for which the leaving token is not considered a part of the zone:

ZONE string : "[alpha]" -> clear_buffer(#$) [...) "[^alphanum]" ->$identifier  {
	GROUP white = "";
	TOKEN DEFAULT -> push_buffer(#0), $$; 
}

Syntactic sugar for identifiers, comments and string should be added in either 2.1 or 2.2.

3.8. Type Definitions

We can declare types in lexi. This will be used for type checking of instructions (inline actions mostly, see ) that must happen upon encountering a token. Types declaration must only happen in the global zone. Here are examples:

TYPE buffer;
TYPE integer;

There are several predefined types used by various action parameters; these need mapping to appropriate C types by typedef if used. The predefined types are:

TypeUse
CHARACTERTODO: $# etc
STRINGTODO
TERMINALTODO: $xyz etc
Table 1. Predefined Types

3.9. Action prototypes

Any inline actions called inside a list of instructions must have been previously prototypes.

action-decl := "ACTION" + identifier [ + ":" + "(" + typelist+")" + "->" + "(" + typelist+")";

Here is an example:

ACTION actionname : (:TYPE1,:TYPE2) ->(:TYPE3,:TYPE4);
  1. [a]

    This historic interface for producing keywords is expected to change drastically for the next version of Lexi.

4. The lexi action file

  1. 4.1. Introduction
  2. 4.2. The header and trailer statements
  3. 4.3. Action definition statement

4.1. Introduction

The lexi action file (.lct) is the second input file for lexi. In it you can specify header and trailer for both of the ouput files. Before the release of 2.0, it will also contains the definitions of the inline actions. Syntax is not frozen yet but will be prior to the 2.0 tag.

4.2. The header and trailer statements

These are optional statements. They are used to specify headers and trailers for both output files. The header is mostly used to add #include directives in the output files, to typedef mappings to lexi's types, and other preamble before the generated code.

HEADERS = @{
	/* header for the C file */
@}, @{
	/* Header for the h file */
@};
TRAILERS = @{
	/* Trailer for the C file */
@}, @{
	/* Trailer for the h file */
@};

4.3. Action definition statement

The syntax is accepted but the semantics are not fully coded yet. An action definition defines the inline code that will be inserted upon encountering a token whose instruction list contains this action call. Here is an example of the intended syntax

ACTION actionname: (inputvar1 :T1, inputvar2 :T2) -> (outputvar1 :T3) = @{
	/* code block */
@},

5. Interface

  1. 5.1. Instantiation
  2. 5.2. Terminals
  3. 5.3. Functions
  4. 5.4. Character Groups
  5. 5.5. Keywords

Lexi generates and outputs C89 code implementing the specified lexical analyser. This code is intended to be linked as part of a larger C program by the user.

5.1. Instantiation

Input to the generated lexical analyser is by way of a single function (or macro, if desired) that the user is expected to provide to read a character:

int read_char(void);

Output from the lexical analyser is by way of a function provided by Lexi that returns the next token to the user's program:

int read_token(void);

This is the user's primary interface to the lexical analyser. Note that the type of characters is int, so that EOF may be expressed if neccessary.

Calling read_token() will result in the lexical analyser making as many calls to read_char() as are necessary.

5.2. Terminals

Lexi does not define the values of terminals in the generated code; these are expected to be specified by the user (most usually from a parser-generator such as Sid). For example, a token defined by:

TOKEN "--" -> $dash ;

Would return the value of lex_dash from read_token() when matched. The prefix of these identifier names may be specified with the the -l option. See the Sid documentation for further discussion of the C representation of Sid's terminals.

5.3. Functions

Within the C implementation of functions, the usual Lexi API functions may be called. For example, to call read_char(). This is especially useful for calling the functions defined to identify membership in groups. A common case is to read tokens of a variable length. This is especially suitable for reading identifiers. For example (where unread_char() is a user-defined function with the obvious effect):

GROUP identstart = {a-z} + {A-Z} + "_";
GROUP identbody = "[identstart]" + {0-9} + "-";
TOKEN "[identstart]" -> read_identifier();
int read_identifier(int c) {
	for (;;) {
		if (c == EOF) {
			return lex_eof;
		}

		if (!is_identbody(lookup_char(c))) {
			unread_char(c);
			return lex_identifier;
		}

		/* store character here */

		c = read_char();
	}
}

Functions called by tokens are passed each character forming the token. The example above would result in the call to:

get_identifier(c);

where c is the content of the token (that is, the character matched by "[identstart]". For multiple characters each character is passed:

TOKEN "abc" -> f();
get_identifier('a', 'b', 'c');

See §5 for further details of the C interface the generated lexer will call.

Note that it is undefined behaviour to have tokens of different lengths call the same function.

5.4. Character Groups

Should the user wish to check if a character is in a group, the generated code provides macros of the form is_groupname(). These are intended to be used as:

is_digit(lookup_char(c))

assuming a group named digit is defined. See the §3.2 and §3.3 white sections for further details on group names.

5.5. Keywords

Neither the keyword calls output by -k nor the lexical analyser itself depend on including any headers other than for the user's own code's requirements.

A. Appendix

  1. A.1. Undefined Behaviour
  2. A.2. Obscure Features
    1. A.2.1. Ranges Within Tokens
    2. A.2.2. Whitespace Within Tokens and Keywords

A.1. Undefined Behaviour

Undefined behaviours are actions that generate output which may be invalid, nonsensical, undesired or simply legal but obscure.

The following constructs are syntactically legal input to Lexi, but produce effects which are undefined behaviour. They may be disallowed entirely in future versions and should be avoided. This particular release of Lexi permits these as undefined to offer a transition period only.

Mapping to the start of another mapping

To define a mapping which produces a character used at the start of any mapping, including itself. For example:

MAPPING "???" -> "?" ;
Calling one function from multiple different-length tokens

For example:

TOKEN "//" -> get_comment () ;
TOKEN "#"  -> get_comment () ;

One workaround is to call two macros of different prototypes which call the same function.

A.2. Obscure Features

This section describes legal constructs of doubtful interest.

A.2.1. Ranges Within Tokens

The grammar for Lexi permits strings formed of pre-defined character ranges to be used for token definitions:

TOKEN {A-Z} + "something" -> $x ;

Character ranges are intended to be used for sets (that is, in GROUP), not sequences. Since tokens are defined using sequences, this really means:

TOKEN "ABCDEFGHIJKLMNOPQRSTUVWXYZsomething" -> $x ;

This is probably not the intended effect. Using a group is suggested instead:

GROUP alpha = {A-Z} ;
TOKEN "[alpha]something" -> $x ;

A.2.2. Whitespace Within Tokens and Keywords

The white group may be used in tokens and keywords as any other group would:

TOKEN "a[white]" -> $something ;
TOKEN "[white]ab" -> $neverscanned ;

The above are both legal tokens definitions. The second one will never be scanned since white characters are discarded before tokens are matched.