Structural Organisation

2. Structural Organisation

This section describes the basic organisation of the source code for the C++ producer. This includes the division of the code into separate modules and the type system conventions.

2.1. Source code modules

For convenience, the source code is divided between a number of directories:

base/: The base directory contains only the module containing the main function, the basic type descriptions and the Makefile.
obj_c/
obj_tok/
obj_templ/: The directories obj_c and obj_tok contain respectively the C and #pragma token headers generated from the type algebra by calculus. The directory obj_templ contains certain calculus template files.
utility/: Routines for such utility operations as memory allocation and error reporting, including the error catalogue.
parse/: Routines concerned with parsing and preprocessing the input, including the sid grammar.
construct/: Routines for building up and analysing the internal representation of the parsed code.
output/: Routines for outputting the internal representation in various formats including as a TDF capsule (see TDF Specification), a C++ spec file (see C/C++ Producer Implementation), or a tdfc2dump symbol table dump file.

Each module consists of a C source file, file.c say, containing function definitions, and a corresponding header file file.h containing the declarations of these functions. The header is included within its corresponding source file to check these declarations; it is protected against multiple inclusions by a macro of the form file_INCLUDED. The header contains a brief comment describing the purpose of the module; each function in the source file contains a comment describing its purpose, its inputs and its output.

The following table lists all the source modules in the C++ producer with a brief description of the purpose of each:

Module	Directory	Purpose
access	construct	member access control
allocate	construct	`new` and `delete` expressions
assign	construct	assignment expressions
basetype	construct	basic type operations
buffer	utility	buffer reading and writing routines
c_class	obj_c	`calculus` support routines
capsule	output	top-level TDF encoding routines
cast	construct	cast expressions
catalog	utility	error catalogue definition
char	parse	character sets
check	construct	expression checking
chktype	construct	type checking
class	construct	class and enumeration definitions
compile	output	TDF tag definition encoding routines
constant	parse	integer constant evaluation
construct	construct	constructors and destructors
convert	construct	standard type conversions
copy	construct	expression copying
debug	utility	development aids
declare	construct	variable and function declarations
decode	output	bitstream reading routines
derive	construct	base class graphs; inherited members
destroy	construct	garbage collection routines
diag	output	TDF diagnostic output routines
dump	output	symbol table dump routines
encode	output	bitstream writing routines
error	utility	error output routines
exception	construct	exception handling
exp	output	TDF expression encoding routines
expression	construct	expression processing
file	parse	low-level I/O routines
function	construct	function definitions and calls
hash	parse	hash table and identifier name routines
identifier	construct	identifier expressions
init	output	TDF initialiser expression encoding routines
initialise	construct	variable initialisers
instance	construct	template instances and specialisations
inttype	construct	integer and floating point type routines
label	construct	labels and jumps
lex	parse	lexical analysis
literal	parse	integer and string literals
load	output	C++ spec reading routines
macro	parse	macro expansion
main	–	main routine; command-line arguments
mangle	output	identifier name mangling
member	construct	member selector expressions
merge	construct	intermodule merge routines
namespace	construct	namespaces; name look-up
operator	construct	overloaded operators
option	utility	compiler options
overload	construct	overload resolution
parse	parse	low-level parser routines
pragma	parse	`#pragma` directives
predict	parse	parser look-ahead routines
preproc	parse	preprocessing directives
print	utility	error argument printing routines
quality	construct	extra expression checks
redeclare	construct	variable and function redeclarations
rewrite	construct	inline member function definitions
save	output	C++ spec writing routines
shape	output	TDF shape encoding routines
statement	construct	statement processing
stmt	output	TDF statement encoding routines
struct	output	TDF structure encoding routines
syntax[0-9]*	parse	`sid` parser output
system	utility	system dependent routines
table	parse	portability table reading
template	construct	template declarations and checks
throw	output	TDF exception handling encoding routines
tok	output	TDF standard tokens encoding
tokdef	construct	token definitions
token	construct	token declarations and expansion
typeid	construct	run-time type information
unmangle	output	identifier name unmangling
variable	construct	variable analysis
virtual	construct	virtual functions
xalloc	utility	memory allocation routines

2.2. Type system

This section describes the type system used in the C++ producer. Unless otherwise stated the types are declared using the calculus tool as part of the algebra, c_class.alg. The design of this type algebra was clearly largely based on the concepts underlying the C++ language; however TDF provided an important influence, not merely as the intended target language, but also because of its clear presentation of essential language features.

2.2.1. Primitive types

The primitive types used within the algebra c_class are defined as follows:

int = "int" ;
unsigned = "unsigned" ;
string = "character *" ;
ulong_type (ulong) = "unsigned long" ;
BITSTREAM_P (bits) = "BITSTREAM *" ;
PPTOKEN_P (pptok) = "PPTOKEN *" ;

The integral types are self-explanatory. All string literals used in the C++ producer are based on the character type:

typedef unsigned char character ;

hence the definition of string. The remaining primitive give links to those portions of the type system which are defined outside of the algebra. The types BITSTREAM and PPTOKEN are described below.

2.2.2. `CV_SPEC`

The enumeration type CV_SPEC (short name cv) is used to represent a C++ type qualifier. It takes the form of a bitfield, the elements of which can be or-ed together to represent combinations of type qualifiers. The cv-qualifiers are represented by cv_const and cv_volatile in the obvious manner. The value cv_lvalue is used as a qualifier to indicate whether a type is an lvalue or an rvalue. Other values are used in function types to represent the function language linkage.

2.2.3. `BUILTIN_TYPE`

The enumeration type BUILTIN_TYPE (ntype) is used to represent the built-in C++ types (char, float, void etc.). It is used chiefly as an index into tables of type information.

2.2.4. `BASE_TYPE`

The enumeration type BASE_TYPE (btype) is used to represent a C++ simple type specifier such as signed, short or int. It takes the form of a bitfield, the elements of which can be or-ed together to represent combinations of type specifiers. Its chief use is when reading a type from the input file; the various simple type specifiers are combined to give a value of this type, which is then mapped to an actual C++ type.

2.2.5. `INT_TYPE`

The union type INT_TYPE (itype) is used to represent an integral or bitfield C++ type. The basic integral types are given by the basic field. Bitfield types are represented by the bitfield field. There are also fields representing target dependent integral promotion, arithmetic and integer literal types, plus VARIETY tokens. Only one INT_TYPE object is created for each integral type.

2.2.6. `FLOAT_TYPE`

The union type FLOAT_TYPE (ftype) is used to represent a floating point C++ type. The basic floating point types are given by the basic field. There are also fields representing target dependent argument promotion and arithmetic types, plus FLOAT tokens. Only one FLOAT_TYPE object is created for each floating point type.

2.2.7. `CLASS_INFO`

The enumeration type CLASS_INFO (cinfo) is used to represent information relating to a class or enumeration definition. It takes the form of a bitfield, the elements of which can be or-ed together to represent various combinations of properties.

2.2.8. `CLASS_USAGE`

The enumeration type CLASS_USAGE (cusage) is used to represent information relating to the way a class is used. It takes the form of a bitfield, the elements of which can be or-ed together to represent various combinations of properties.

2.2.9. `CLASS_TYPE`

The union type CLASS_TYPE (ctype) is used to represent a C++ class or union. The main components are an identifier giving the class name, class information and class usage fields, a namespace giving the class members, a graph representing the base class structure, and a virtual function table. Only one CLASS_TYPE object is created for each class or union.

Each class maintains a list, pals, of class and function identifiers which are declared as friends of that class. It also maintains a list, chums, of those class types which declare it to be a friend (this is what is actually used in the access checks). Similarly each function identifier maintains a list, chums, of those class types which declare it to be a friend.

Each class maintains a list of its constructors, destructors and conversion functions (included inherited conversion functions). It also maintains a list of its virtual base classes. This information can be obtained by other means but it is more convenient to record it within the class type itself.

2.2.10. `GRAPH`

The union type GRAPH (graph) is used to represent a directed acyclic graph arising from the base classes of a class. Each node of the graph has a head which is a class type, and several tails which give the base class graphs for that class. Each node has pointers, top, to the top of the graph (i.e. the most derived class), and up, to the node of which the current node is a direct base. Each node also has an access field which gives information on the base access, whether it is virtual or not, and so on, in the form of a DECL_SPEC. Virtual bases are handled by the equal field which defines an equivalence relation on the graph which identifies equivalent virtual bases.

2.2.11. `VIRTUAL`

The union type VIRTUAL (virt) is used to represent the virtual functions declared in a class. The table field is used to represent a virtual function table, and consists primarily of a list of VIRTUAL objects giving the virtual functions for the associated class. These virtual functions are of four kinds, each represented by a union field. A virtual function first declared in a class is represented by the simple field; a virtual function in a class which overrides an inherited virtual function is represented by the override field; an inherited, non-overridden virtual function which is not overridden in a base class is represented by the inherit field; a inherited, non-overridden virtual function which is overridden in some base class is represented by the complex field.

2.2.12. `ENUM_TYPE`

The union type ENUM_TYPE (etype) is used to represent a C++ enumeration type. This consists primarily of an identifier giving the enumeration name, a class information field, a type giving the underlying representation of the enumeration type, and a list of identifiers giving the enumerators comprising the enumeration.

2.2.13. `TYPE`

The union type TYPE (type) is used to represent a C++ type. Every type has an associated type qualifier, qual, which determines whether the type is const, volatile or an lvalue. A type may also have an associated identifier, name, giving the corresponding type name (the null identifier being used for unnamed types). The other type components are determined by the union tag. Each of the type constructs above has a corresponding field in the TYPE union: integer for integral types, floating for floating point types, bitfield for bitfield types, compound for class or union types, and enumerate for enumeration types. There are also fields top and bottom corresponding to void and bottom (the type used to represent values which never return).

Other fields of the TYPE union represent composite types; for example, the array field, representing array types, comprises a base type, sub, and an integer constant giving the array bound, size. These are generally simple, apart from func, representing a function type. This has the obvious components: a return type, ret, a list of parameter types, ptypes, and a flag indicating ellipsis functions, ellipsis. It also has an associated namespace, pars, in which the function parameters are declared. The parameter identifiers are extracted from this as a list, pids. Member function qualifiers and language linkage information are represented by a CV_QUAL, mqual. The implicit extra parameter for member functions is recorded in the list mtypes, which adds this extra type to the start of ptypes. Finally except gives any exception specifiers; the case where the exception specifier is absent being represented by the special value, univ_type_set.

2.2.14. `DECL_SPEC`

The enumeration type DECL_SPEC (dspec) is used to represent information on the declaration and usage of an identifier. It takes the form of a bitfield, the elements of which can be or-ed together to represent various combinations of properties. The 32 bits in this bitfield (the maximum which can be represented portably) are a significant restriction. This means that the same member of DECL_SPEC is often used to mean different things in different contexts. This can prove confusing on occasions.

2.2.15. `HASHID`

The union type HASHID (hashid) is used to represent a C++ identifier name. The simplest form of identifier name, name, consists of just a string of characters, such as foo. Extended identifier names, ename, are similar, but may contain Unicode characters. There are however other forms of identifier name in C++: conversion function names (conv ) such as operator int, overloaded operator names (op) such as operator+, constructor names (constr), and destructor names (destr). There are also names which are used for anonymous identifiers (anon).

Note the distinction between an identifier name and an actual identifier. The latter is a meaning associated with a name in a particular context. Every identifier name has an associated underlying meaning, id. This is used to handle keywords and macros, but for most identifier names this will be a dummy identifier. Nested underlying meanings (such as a macro hiding a keyword) are handled by linking the alias fields of the corresponding identifiers. Every identifier name also has a cache field which is used to record the look-up of this name as an unqualified identifier. This may be set to the null identifier to indicate that the look-up needs to be re-evaluated.

Identifier names are stored in one of a small number of hash tables, linked using their next field. Each name has only one entry in these tables, allowing equality of names to be implemented as EQ_hashid.

2.2.16. `QUALIFIER`

The enumeration type QUALIFIER (qual) is used to represent the various ways in which an identifier name can be qualified. For example, ::A::a is represented by qual_full. The value qual_mark is used in the representation of function identifier expressions to indicate that overload resolution has been performed.

2.2.17. `IDENTIFIER`

The union type IDENTIFIER (id) is used to represent the various kinds of C++ identifiers. Every identifier has an associated identifier name, a parent namespace, a declaration information field, and a location for its declaration or definition. Each identifier also has an alias field which is normally used to represent the aliasing which can occur in inheritance or using declarations.

The various fields of the IDENTIFIER union correspond to the various kinds of identifier which can arise in C++ - class names, functions, variables, class members, macros, keywords etc. Each field has appropriate components giving its type, its definition or whatever other information is required. For example, the variable field has a type and two expressions, giving the constructor and destructor values for the object.

Most of these identifier components are self-explanatory, however the treatment of overloaded functions bears discussion. The various fields representing functions have an over component which is used to link overloaded functions together. A set of overloaded functions is treated as if it were a single IDENTIFIER - the first in the list - for the purposes of storing in a namespace member; the other overloaded meanings are accessed by chasing down the over components. In other situations, whether a function identifier represents a single function or a set of overloaded functions can be worked out from the context. For example, in identifier expressions the identifier qualifier is used to mark whether overload resolution has taken place.

2.2.18. `MEMBER`

The union type MEMBER (member) is used to represent a member of a namespace. Each member contains two identifiers, id and alt. The id field gives the meaning associated with a particular name in this namespace; the alt field is used to represent a type name which may be hidden by a non-type name.

There are two kinds of member, small and large, corresponding to whether the namespace holds its members in a simple linked list or in a hash table.

2.2.19. `NAMESPACE`

The union type NAMESPACE (nspace) is used to represent the set of identifiers declared in a particular scope. For example, the members declared in a C++ class or namespace, the parameters declared in a function declarator and the local variables declared in a block all form scopes. The various kinds of scope are distinguished as different fields of the union, but there are basically two categories. The first, such as function blocks, which have relatively small numbers of elements, store their members as a simple linked lists. The second, such as classes, which have larger numbers of elements, store their members in hash tables. In both cases the elements are stored using the MEMBER type.

The key operation on a namespace is to look up a particular identifier name in its linked list or hash table of members to find the meaning, if any, associated with that name in the namespace. This can be a complex operation because of the need to take base classes and using directives (as stored in the use component) into account.

2.2.20. `NAT`

The union type NAT (nat) is used to represent an integer constant expression. Values are represented as lists of 16 bit 'digits'. Values which fit into a single digit are represented by the small field; larger values by the large field. Negated values can be represented by the neg field. Folding of integer constant expressions is performed in the producer, however the result can only be represented as described above if its value is target independent. Target dependent values are represented by the calc field which contains an expression describing how to calculate the value. The token field is used to represent NAT tokens.

Objects representing small integer constants are created at the start of the program and stored in a table for ease of access. Larger constants are created as and when they are required.

2.2.21. `FLOAT`

The union type FLOAT (flt) is used to represent a floating point constant expression. There is only one field, simple , which corresponds to a floating point literal. No folding of floating point constant expressions is attempted in the producer (it is virtually impossible to do so in a target independent manner).

Objects representing useful floating point constants (0.0, 1.0 etc.) are created for each floating point type and stored as part of the corresponding FLOAT_TYPE. Other values are created as and when they are required.

2.2.22. `STRING`

The union type STRING (str) is used to represent a string constant expression. There is only one field, simple, which corresponds to a character string literal, however the kind field can be used to modify the interpretation put on the characters appearing in the text field. By default, each character in text corresponds to a single character in the literal; however an alternative representation, in which text consists of a sequence of multibyte characters - one control character plus four value characters - is used in more complex cases.

All strings are stored in a hash table intended to ensure that the same STRING object is used for equal string literals. This not only saves space during the processing of the input file, but also facilitates the output of shared string literals in the TDF capsule.

Note that the terminal zero character does not form part of the STRING object. Instead information on this is stored as part of the type of a string literal expression. The text of the string literal is either truncated or padded with zeros until its length matches the size of the array bound in the type of the corresponding literal expression.

2.2.23. `NTEST`

The enumeration type NTEST (ntest) is used to represent the various C++ relational operators (==, !=, > etc.). The values correspond to the encoding of the TDF NTEST sort, which facilitates code generation. The values also have the property that the values for complementary operators (such as < and >=) always add up to the same value, ntest_negate, allowing operators to be complemented in a straightforward manner.

2.2.24. `RMODE`

The enumeration type RMODE (rmode) is used to represent the various C++ rounding modes (towards zero, towards smaller etc.). The values correspond to the encoding of the TDF RMODE sort, which facilitates code generation.

2.2.25. `EXP`

The union type EXP (exp) is used to represent a C++ expression or statement. Each expression has an associated type, type, but most of the information about an expression is stored in one of the large number of fields of the EXP union. Most of these fields are fairly simple. For example, there are fields corresponding to integer literals, floating point literals, string literals and identifiers. Composite expressions are formed in the normal way; for example, there are various binary operators comprising two argument expressions. The EXP fields corresponding to statements are slightly more complex. They each have a parent field which points to the enclosing statement. A couple of cases bear additional discussion.

The sequence field represents a compound statement or block. This contains a namespace, in which any local variables are declared, and a list of expressions, giving the statements comprising the block. The null namespace is used if the block does not constitute a scope. The first statement in the list is always a dummy to enable first and last pointers to be maintained to the start and end of the list without having to worry about null lists.

The solve_stmt field corresponds to the TDF labelled construct (in early versions of TDF this construct was called solve, hence the terminology). The problem is that C and C++ labels and gotos are totally unstructured, whereas the TDF label constructs are structured. Any statement which contains unstructured labels is enclosed in a solve_stmt construct, enclosing both the labelled statement and all jumps to it (in general this cannot be done until the end of the function). Any labels or variables which are bypassed by such unstructured jumps also need to be pulled out to the solve_stmt construct. It is not just explicit labels which can cause such problems; complex switch statements have the same effect.

2.2.26. `OFFSET`

The union type OFFSET (off) is used to represent an offset expression. This is used as an adjunct to the normal expression representation. The OFFSET union has fields corresponding to a type offset (used in pointer arithmetic), the offset of a member of a class and the offset of a base class. There are also simple operations on offsets, such as multiplication by an expression.

2.2.27. `TOKEN`

The union type TOKEN (tok) is used to represent one of a number of different categories within the C++ language. It corresponds to the sort of a token declared using the tdfc2pragma #pragma token syntax. Thus there are fields corresponding to expression, statement, integer constant, type, function, member and procedure tokens. The similarities between PROC tokens and templates have been remarked above; for example, the parameters of the template:

template < class T, int n > class A {
    T a [n] ;
    // ....
} ;

are essentially equivalent to those in the procedure token:

PROC ( TYPE T, EXP const : int : n ) ....

(recall that non-type template arguments are always constant expressions). Thus a field, templ, of the TOKEN union is used to represent lists of template parameters. Note that a further field, class, is also required to represent template template parameters. A template type is represented by a field, templ, of the union TYPE, which comprises a template sort and a sub-type expressed in terms of the template parameters.

In addition to representing token and template sorts in this way, the TOKEN union is used to represent token and template arguments. Each of the parameter sorts listed above has an appropriate value component which can store a value of that sort. Many of the union types in the algebra, including types and expressions, have a field of the form:

token -> {
    IDENTIFIER tok ;
    LIST TOKEN args ;
}

representing the given token identifier applied to the given list of arguments.

Template instances are represented slightly differently from token applications. Each instance of a template class or a template function gives rise to a new class or function identifier. This identifier has an underlying form giving the template identifier and the template arguments. This is expressed as a token member of the TYPE union (although it is not technically a type, this happens to be the most convenient representation). Each such form has an associated INSTANCE component which gives further information about the template instance. The form for a template function instance is stored in the form component of the corresponding identifier. The form for a template class instance is stored in the form component of the corresponding class type.

Members of instances of template classes also have a form type, but in this case the form is an instance type. This gives a link back to the corresponding member of the template class.

2.2.28. `INSTANCE`

The union type INSTANCE (inst) is used to represent a particular instance of a template or token. Each template sort has an associated list of all the instances of that template, which is used to ensure that the same template applied with the same arguments always has the same value. Information on partial or explicit specialisations and usage information are stored as part of the corresponding INSTANCE. Each template instance identifier has a link back to its corresponding INSTANCE via its form component.

2.2.29. `ERROR`

The union type ERROR (err) is used to represent an error arising during the compilation of a C++ program. Errors are first class objects within the producer and can be passed to and from procedures. Each error has an associated severity (serious, warning, none etc.). Simple errors are represented by the simple field, which consists of an index, number, into the error catalogue, plus a variable length list of error arguments. Errors can be combined into composite errors using the compound field, which represents the join of two errors - head followed by tail.

The chief operation on an error after it has been built up is to report it. Each error report consists of an error object and a file location indicating where the error occurred.

2.2.30. `VARIABLE`

The structure type VARIABLE (var) is used to represent a variable state and is used in the variable analysis checks.

2.2.31. `LOCATION`

The structure type LOCATION (loc) is used to represent a location in an input file. It comprises a pointer to an input file position, posn, modified by a line number, taking #line directives into account, line. Note that character positions within the line are not currently recorded.

2.2.32. `POSITION`

The structure type POSITION (posn) is used to represent a position in an input file. It consists of two file names, file taking #line directives into account, and input giving the actual file name, plus a line number offset, offset, which gives the difference between the line number taking #line directives into account and the actual line number. Other information stored includes the datestamp on the input file, datestamp, and a pointer to a file location which, for files included using #include, gives the location the file was included from.

2.2.33. `BITSTREAM`

The structure BITSTREAM is not part of the calculus type system. It is used to represent a sequence of bits such as is used, for example, in the encoding of TDF.

2.2.34. `BUFFER`

The structure BUFFER is not part of the calculus type system. It is used to represent a sequence of characters.

2.2.35. `OPTIONS`

The structure OPTIONS is not part of the calculus type system. It is used to represent the state of the tdfc2pragma compiler options at a particular point in the input file.

2.2.36. `PPTOKEN`

The structure PPTOKEN is not part of the calculus type system. It is used to represent a linked list of preprocessing tokens. Each token has an associated sid lexical token number, tok, plus additional data dependent on the token type. Each token also records a pointer to the current OPTIONS value.

2. Structural Organisation

2.1. Source code modules

2.2. Type system

2.2.1. Primitive types

2.2.2. CV_SPEC

2.2.3. BUILTIN_TYPE

2.2.4. BASE_TYPE

2.2.5. INT_TYPE

2.2.6. FLOAT_TYPE

2.2.7. CLASS_INFO

2.2.8. CLASS_USAGE

2.2.9. CLASS_TYPE

2.2.10. GRAPH

2.2.11. VIRTUAL

2.2.12. ENUM_TYPE

2.2.13. TYPE

2.2.14. DECL_SPEC

2.2.15. HASHID

2.2.16. QUALIFIER

2.2.17. IDENTIFIER

2.2.18. MEMBER

2.2.19. NAMESPACE

2.2.20. NAT

2.2.21. FLOAT

2.2.22. STRING

2.2.23. NTEST

2.2.24. RMODE

2.2.25. EXP

2.2.26. OFFSET

2.2.27. TOKEN

2.2.28. INSTANCE

2.2.29. ERROR

2.2.30. VARIABLE

2.2.31. LOCATION

2.2.32. POSITION

2.2.33. BITSTREAM

2.2.34. BUFFER

2.2.35. OPTIONS

2.2.36. PPTOKEN