Lexical Analysis

1. Lexical Analysis

1.1. Overview
1.2. Concepts Lexi Expresses

1.1. Overview

Lexical analysis is the process of categorising an input stream into tokens. Each of these categories (called lexemes) are determined by the spelling of the contents of that token. For convenience of discussion, this guide assumes that tokens are formed of characters, although this is not necessarily the case.

Categories of tokens are arranged such that spellings the possible sets of spellings forming each category are unambiguous. For example, commonly floating point literals are distinguishable from integer literals because they must contain a decimal point. If two different concepts must be expressed using the same (or overlapping) sets of spellings, a further stage after lexical analysis would be required to distinguish them. For example, commonly an identifier may represent either a variable or a function name, and it is the role of a parser to distinguish between them based on the ordering of the tokens.

Usually it is the responsibility of the lexical analyser to skip whitespace (and often also comments). illustrates an example of tokenisation:

Figure 1. Forming a Token Stream

In this example, the lexer skips over whitespace characters (indicated in the input text with a shaded background) and categorises the remaining characters according to their spelling. For example, a series of digits is categorised into the <literal> type. Notice that whitespace is optional; there happens to be none between the 53 and ; tokens.

The spelling of the characters which form each token is passed on, but usually ignored for punctual tokens such as statement separators and assignment operators. It is more relevant for the spelling to be known for identifiers, literals and other non-punctual tokens.

The traditional use for a lexer is to feed this stream of categorised tokens to a parser; lexers are unaware of the order of tokens, since they are concerned with categorisation only. The process of parsing then asserts that these tokens are in an order permitted by the given grammar. Usually storing the spellings associated with each tokens is a side-effect of the parse, resulting in a symbol table or equivalent (this depends on the purpose of the parser). Parsing is beyond the scope of this document, and will not be discussed here - see the SID Users' Guide for details.

For simple uses, a parser may not be required at all.

1.2. Concepts Lexi Expresses

Keyword Definitions

Keywords are sequences of inseparable characters treated whole, indented to be passed back to the calling program as single tokens. These differ from most tokens in that their spellings are fixed, rather than a set of possible spellings.

Lexi itself does not perform processing of keywords; instead it outputs their definitions separately (see the -k) and these definitions are intended to be checked by a token calling a function.

Pre-pass Mappings

For mappings of sequences of characters to a single character, pre-pass substitutions offer a mechanism to implement effects similar to trigraphs in C.

Whitespace Definitions

Whitespace serves both to separate tokens who's concatenated spellings would be ambiguous (e.g. to distinguish abc and xyz from abcxyz), and primarily to improve readability of the source. Whitespace is not passed on in the generated token stream.

Group definitions

Group definitions define sets of characters which may be specified in place of a single character in tokens. The presence of a character in a group may also be tested via the generated API, much like C's ctype.h functions.

Token definitions

Token definitions specify the set of spellings which form a lexeme. These are central to lexical analysis.

Default definitions

Default definitions specify which action should we performed upon encountering a non recognized token.

Zone definitions

Zone definitions specify zones in which the lexical analysis is subject to different rules. Zones definition may contain token definitions, default definition, group definitions and whitespace definitions.

Prior to 2.0, zones and default did not exist and any other facilities was provided by matching the start of a token and passing control to an externally defined function, This separation provided a general mechanism to allow the user to handle situations unforeseen by the authors, without making Lexi itself over complex by attempting to provide built-in mechanisms to express every corner-case (for example, switching to a "comment-parsing" mode). See the function calls section for details. In 2.0, zones, at least once buffers are implemented, should allow to express most of these possibililities even ones not foreseen by the authors.