Other implementation details

10. Other implementation details

10.1. Parsing C++
10.2. Undefined conversions
10.3. Integer division
10.4. Dynamic initialisation
10.5. The std namespace
10.6. Error catalogue

10.1. Parsing C++

The parser used in the C++ producer is generated using the sid tool. Because of the large size of the generated code (1.3MB), sid is instructed to split its output into a number of more manageable modules.

sid is designed as a parser for grammars which can be transformed into LL(1) grammars. The distinguishing feature of these grammars is that the parser can always decide what to do next based on the current terminal. This is not the case in C++; in some circumstances a potentially unlimited look-ahead is required to distinguish, for example, declaration statements from expression statements. In the technical phrase, C++ is an LL(k) grammar. Fortunately there are relatively few such situations, and sid provides a mechanism, predicates, for bypassing the normal parsing mechanism in these cases. Thus it is possible, although difficult, to express C++ as a sid grammar.

The sid grammar file, syntax.sid, is closely based on the ISO C++ grammar. In particular, the same production names have been used. The grammar has been extended slightly to allow common syntactic errors to be detected elegantly. Other parsing errors are handled by sid's exception mechanism. At present there is only limited recovery after such errors.

The lexical analysis routines in the C++ producer are hand-crafted, based on an initial version generated by the simple lexical analyser generator, lexi. lexi has been used more directly to generate the lexical analysers for certain of the other automatic code generating tools, including calculus, used in the producer.

The sid grammar contains a number of entry points. The most important is parse_file, which is used to parse a complete C++ translation unit. The syntax for the tdfc2pragma #pragma TenDRA directives is included within the same grammar with two entry points, parse_tendra in normal use, and parse_preproc for use in preprocessing mode. There are also entry points in the grammar for each of the kinds of token argument. The parsing routines for token and template arguments are largely hand-crafted, based on these primitives.

Certain parsing operations are performed before control passes to the sid grammar. As mentioned above, these include the processing of token and template applications. The other important case concerns nested name specifiers. For example, in:

class A {
    class B {
    static int c ;
    } ;
} ;

int A::B::c = 0 ;

the qualified identifier A::B::c is split into two terminals, a nested name specifier, A::B::, and an identifier, c, which is looked up in the corresponding namespace. Note that it is at this stage that name look-up occurs. An identifier can be mapped to one of a number of terminals, including keywords, type names, namespace names and other identifiers, according to the result of this look-up. If the look-up gives a macro then this is expanded at this stage.

10.2. Undefined conversions

Several conversions in C and C++ can only be represented by undefined TDF. For example, converting a pointer to an integer can only be represented in TDF by forming a union of the pointer and integer shapes, putting the pointer into the union and pulling the integer out. Such conversions are tokenised. Undefined conversions not mentioned below may be performed by combining those given with the standard, well-defined, conversions.

The token:

~ptr_to_ptr : ( ALIGNMENT a, ALIGNMENT b, EXP POINTER a ) -> EXP POINTER b

is used to convert between two incompatible pointer types. The first alignment describes the source pointer shape while the second describes the destination pointer shape. Note that if the destination alignment is greater than the source alignment then the source pointer can be used in most TDF constructs in place of the destination pointer, so the use of ~ptr_to_ptr can be omitted (the exception is pointer_test which requires equal alignments). Base class pointer conversions are examples of these well-behaved, alignment preserving conversions.

The tokens:

~f_to_pv : ( EXP PROC ) -> EXP pv
~pv_to_f : ( EXP pv ) -> EXP PROC

are used to convert pointers to functions to and from void * (these conversions are not allowed in ISO C/C++ but are in older dialects).

The tokens:

~i_to_p : ( VARIETY v, ALIGNMENT a, EXP INTEGER v ) -> EXP POINTER a
~p_to_i : ( ALIGNMENT a, VARIETY v, EXP POINTER a ) -> EXP INTEGER v
~i_to_pv : ( VARIETY v, EXP INTEGER v ) -> EXP pv
~pv_to_i : ( VARIETY v, EXP pv ) -> EXP INTEGER v

are used to convert integers to and from void * and other pointers.

10.3. Integer division

The precise form of the integer division and remainder operations in C and C++ is left unspecified with respect to the sign of the result if either operand is negative. The tokens:

~div : ( EXP INTEGER v, EXP INTEGER v ) -> EXP INTEGER v
~rem : ( EXP INTEGER v, EXP INTEGER v ) -> EXP INTEGER v

are used to represent integer division and remainder. They will map onto one of the pairs of TDF constructs, div0 and rem0, div1 and rem1 or div2 and rem2.

10.4. Dynamic initialisation

The dynamic initialisation of variables with static storage duration in C++ is implemented by means of the TDF initial_value construct. However in order for the producer to maintain control over the order of initialisation, rather than each variable being initialised separately using initial_value, a single expression is created which initialises all the variables in a module, and this initialiser expression is used to initialise a single dummy variable using initial_value. Note that, while this enables the variables within a single module to be initialised in the order in which they are defined, the order of initialisation between different modules is unspecified.

The implementation needs to keep a list of those variables with static storage duration which have been initialised so that it can call the destructors for these objects at the end of the program. This is done by declaring a variable of shape:

~cpp.destr.type : () -> SHAPE

for each such object with a non-trivial destructor. Each element of an array is considered a distinct object. Immediately after the variable has been initialised the token:

~cpp.destr.global : ( EXP pd, EXP POINTER c, EXP PROC ) -> EXP TOP

is called to add the variable to the list of objects to be destroyed. The first argument is the address of the dummy variable just declared, the second is the address of the object to be destroyed, and the third is the destructor to be used. In this way a list giving the objects to be destroyed, and the order in which to destroy them, is built up. Note that partially constructed objects are destroyed within their constructors (see §6.3) so that only completely constructed objects need to be considered.

The implementation also needs to ensure that it calls the destructors in this list at the end of the program, including calls of exit. This is done by calling the token:

~cpp.destr.init : () -> EXP TOP

at the start of each initial_value construct. In the default implementation this uses atexit to register a function, __TCPPLUS_term, which calls the destructors. To aid alternative implementations the token:

~cpp.start : () -> EXP TOP

is called at the start of the main function, however this has no effect in the default implementation.

10.5. The `std` namespace

Several classes declared in the std namespace arise naturally as part of the C++ language specification. These are as follows:

Type	Purpose
`std::type_info`	type of `typeid` construct
`std::bad_cast`	thrown by `dynamic_cast` construct
`std::bad_typeid`	thrown by `typeid` construct
`std::bad_alloc`	thrown by `new` construct
`std::bad_exception`	used in exception specifications

The definitions of these classes are found, when needed, by looking up the appropriate class name in the std namespace. Depending on the context, an error may be reported if the class is not found. It is possible to modify the namespace which is searched for these classes using the directive:

#pragma TenDRA++ set std namespace : scope-name

where scope-name can be an identifier giving a namespace name or ::, indicating the global namespace.

10.6. Error catalogue

This section describes the error catalogue which lies at the heart of the C++ producer's error reporting routines. The full error catalogue syntax is documented under make_err A typical entry in the catalogue is as follows:

class_union_deriv ( CLASS_TYPE: ct )
{
    USAGE:              serious
    PROPERTIES:         ansi
    KEY (ISO)           "9.5"
    KEY (STANDARD)      "The union '"ct"' can't have base classes"
}

This defines an error, class_union_deriv, which takes a single parameter ct of type CLASS_TYPE. The severity of this error is serious; that is to say, a constraint error. The error property ansi indicates that the error arises from the ISO C++ standard, the associated ISO key indicating section 9.5. Finally the text to be printed for this error, including a reference to ct, is given. Looking up section 9.5 in the ISO C++ standard reveals the corresponding constraint in paragraph 1:

A union shall not have base classes.

Each constraint within the ISO C++ standard has a corresponding error in this way. The errors are named in a systematic fashion using the section names used in the draft standard. For example, section 9.5 is called class.union, so all the constraint errors arising from this section have names of the form class_union_*. These error names can be used in the tdfc2pragma low level directives such as:

#pragma TenDRA++ error "class_union_deriv" allow

to modify the error severity. The effect of reducing the severity of a constraint error in this way is undefined.

In addition to the obvious error severity levels, serious, warning and none, the error catalogue specifies a list of optional severity levels along with their default values. For example, the entry:

link_incompat = serious

sets up an option named link_incompat which is a constraint error by default. Errors with this severity, such as:

dcl_stc_external ( LONG_ID: id, PTR_LOC: loc )
{
    USAGE:              link_incompat
    PROPERTIES:         ansi
    KEY (ISO)           "7.1.1"
    KEY (STANDARD)      "'"id"' previously declared with external
             linkage (at "loc")"
}

are therefore constraint errors. The severity associated with link_incompat can be modified either directly, using the directive:

#pragma TenDRA++ option "link_incompat" allow

or indirectly using the directive:

#pragma TenDRA incompatible linkage allow

the effect being to modify the severity of the associated error messages.

The error catalogue is processed by a simple tool, make_err, which generates C code which is compiled into the C++ producer. Each error in the catalogue is assigned a number (there are currently 873 errors in the catalogue) which gives an index into an automatically generated table of error information. It is this error number, together with a list of error arguments, which forms the associated ERROR object. make_err generates a macro for each error in the catalogue which takes arguments of the appropriate types (which may be statically checked) and creates an ERROR object. For example, for the entry above this macro takes the form:

ERROR ERR_class_union_deriv ( CLASS_TYPE ) ;

These macros hide the error catalogue numbers from the rest of the C++ producer.

It is also possible to join a number of simple ERROR objects to form a single composite ERROR. The severity of the composite error is the maximum of the severities of the component errors. To this purpose a dummy error severity level whatever is introduced which is less severe than any other level. This is intended for use with error messages which are only ever used to add information to existing errors, and which inherit their severity level from the main error.

The text of a simple error message can be found in the table of error information. The text contains certain escape sequences indicating where the error arguments are to be printed. For example, %1 indicates the second argument. The error argument sorts - what is referred to as the error signature - is also stored in the table of error information as an array of characters, each corresponding to an ERR_KEY_type macro. The producer defines printing routines for each of the types given by these values, and calls the appropriate routine to print the argument.

There are several command-line options which can be used to modify the form in which the error message is printed. The default format is as follows:

"file.C", line 42: Error: [ISO 9.5]: The union 'U' can't have base classes.

The ISO section number can be suppressed using -m-s. The -mc option causes the source code line giving rise to the error to be printed as part of the message, with !!!! marking the position of the error within the line. The -me option causes the error name, class_union_deriv, to be printed as part of the message. The -ml option causes the full file location, including the list of #include directives used in reaching the file, to be printed. The -mt option causes typedef names to be used when printing types, rather than expanding to the type definition.