Inlining languages for TenDRA

1. Introduction
2. Scheme proposed
3. Implementation

First published 2007.

Abstract

A design is proposed which provides the ability to inline foreign languages at specific points in the grammar of a language being compiled. A mechanism for languages (both inlined and otherwise) to provide machine-specific features is added to TDF. Together, these provide a system capable of expressing inlined assembly. An overview of implementation and a schedule are given.

1. Introduction

This proposal outlines a scheme by which we may work towards providing support for inlined assembly. This scheme consists of several independent features which can be used together collectively to provide this support. Within each feature, there are unresolved choices for various different options; the purpose of this proposal is not to decide these minutiae, but rather to lay out the scheme itself, which keeps the effects of these unresolved issues isolated from one another.

The syntax of such an extension is not discussed in this proposal. Rather, the focus here is on the underlying architecture, and the division of information between various components of the system.

Purity control: this proposal discusses the proposed implementation of a TenDRA-based system, not about tcc specifically. It is beyond the scope of this proposal to lay out where non-portable features should go, or if they might be cause to create a separate interface in conjunction to tcc. Such decisions depend on the direction of the project as a whole; this proposal deals only with the mechanism for the features under discussion, rather than proposing where those features are to be implemented.

This proposal does not discuss the syntax for assemblies supported; one of its aims is to provide a syntax-independent architecture. It is this architecture which is discussed.

The proposal serves to act as a placeholder representing various ways in which assembly may be provided, illustrating one specific scheme. This permits discussion of other issues to remain on-topic to those issues, without becoming distracted by otherwise unresolved points regarding possible mechanisms for embedding languages.

1.1. The current situation

Historically, programs did not inline assembly. Routines written in assembly were kept in separate source files, provided with the support code required so that those routines may be called and return data, and linked into a program as any other object. The main use-case for moving to inlined assembly appears to have been the convenience gained by eliminating the need for that supporting code by making use of the C compiler to set up and return from functions. The author may then concentrate on the (usually relatively few) lines of code in assembly which do the real work, and leave such menial tasks to the compiler.

Presumably as a result of this convenience, much code now exists which uses inlined assembly - particularly operating system kernels. TenDRA currently provides no support for the extensions which other compilers (most notably gcc) have implemented and popularised.

suggests implementations may provide inlined languages as a common extension. See sections J.5.9 The FORTRAN keyword and J.5.10 The asm keyword for details.

1.2. Feature requirements

This is a highly sought-after feature given interest in TenDRA as a compiler for operating systems. It's one of the famous features which appear on comparison lists when products are summarised; people expect it to be present. This document, however, does not serve to rationalise an argument for providing this support - it merely proposes an implementation should the support be required.

Implementation in the straightforward sense (of embedding a string of assembly into the IR) looks trivial at first glance, but on closer inspection turns out to be wholly unsatisfying. Some points to resolve are:

Which assembly syntax to use
How to relate platform support with TDF. Given the intention of portability, this is a large area to address.
Syntax analysis occurring during installation - whilst this is not an issue for compilers who's IR is non-portable, for TenDRA this would allow syntax errors to pass unnoticed until a user downloads and installs distributed TDF. In practise, testing by a developer would avoid this.
Minimising extensions to TDF. Adding new tokens should be avoided where possible; constantly extending TDF with sets of largely similar tokens for each architecture directly violates the original design goal of providing a language-neutral IR.
Maintainability both in terms of complexity, and scalability of feature additions relative to the number of developers involved with the project.

This proposal attempts to address the above points.

1.3. Use cases

The use-cases for inlining assembly are not particularly surprising. There appear to be two main groups:

Technical cases

These are the usual use-cases for inlining assembly for any compiler, and will not be repeated here.

The general motivation seems to be that assembly language is required for accessing machine-specific operations, for example to read a specific CPU register, to call a specific opcode for atomicity, or for syscalls. For such small tasks, it seems cumbersome to go through a function call sequence just for that single operation.

Marketing

Compilers are seen to be either system compilers (that is, supplied with an operating system, and used to compile that operating system), or as application compilers (that is, used to compile applications). In some historical cases (notably HP-UX), these have been separate products, often confusing and irritating users with incompatibilities. Hence it is desirable that the same compiler may be used for both.

While TenDRA was originally intended to be used for both system and application code (indeed two third-party usability analyses were undertaken), it was intended to do so for C only.

Given the technical motivations of convenience outlayed above, the ability to inline assembly language is one of the checkboxes people look for in feature support.

2. Scheme proposed

This proposal first generalises the problem, and then goes on to suggest a specific case for providing support for assembly languages. This scheme is broken down into two stages: §2.2, which is not assembly-specific, and §2.3.

Before addressing these stages, quick recap of TDF's approach as an IR: TDF is designed to express the semantics of the concepts provided by multiple languages, rather than expressing a superset of processor functionality. This is an important distinction, and runs contrary to the design of most other IRs. gives a deeper explanation.

Language-specific entities in TDF are categorised under two groups: LPI (Language Programming Interface) and API (Application Programming Interface). Introducing new LPIs is relatively rare, and is intended to provide a means for a specific producer to output higher-level code closer to its own concepts.

TDF APIs cover most of the requirements needed for language-specific features (for example, setjmp() in C). It is important to note that these APIs do not necessarily correlate to function calls within the language (although setjmp() happens to).

2.1. Token organisation

Language-specific concepts are grouped within TDF under their own API sections (for example, setjmp for C). Concepts used by many languages are available to all as generic tokens. gives a listing of these language-specific tokens. Tokens serve as parametric substitutions; these are strict functional applications. See for a detailed explanation of the purposes of tokens.

For the purposes of inlined languages, no tokens of the following classes are proposed to be added: Target dependency tokens (where all installers define all tokens), Basic mapping tokens (likewise), and TDF Interface tokens (which are support routines private to other TDF tokens). See for details of these classes.

Tokens are added as both a new LPI and a new API for inlined assembly. The details of these are discussed in §2.3. Portable inlined languages need neither.

2.2. Embedding a language

Within the region of embedded language (whatever the mechanism for introducing that region may be; here __lang() is given as an example), the compiler switches to a parser which can parse that particular language. Hence several such parsers may be provided, each providing an alternate language, or dialect of that language.

__lang("logo") {
	REPEAT 4 [ FORWARD 10 RIGHT 90 ]
}

That language's parser outputs its own tree of TDF, which is grafted into the host language's tree at the point where __lang() was encountered.

The parser for the embedded language generates output of that language as a tree of TDF tokens, which is of equal status to the TDF tokens output from the main parser. Hence from the perspective of the TDF generated, there needn't be any knowledge that alternate languages were used, other than by inspecting the contents of those tokens. If that embedded language happened to be FORTRAN, it would contain FORTRAN-specific TDF tokens. If it were C, it would contain C-specific TDF tokens, and so on.

After this tree grafting, the syntax for the embedded language is now gone; the content is expressed as TDF (even if it uses some language-specific tokens). Therefore installers will not be aware of the original syntax; this is discussed in §2.3. Note also that this permits the same TDF to be expressed by multiple syntaxes - hence for assembly, we could provide both AT&T and Intel syntax, using the same installers for both.

Externally-executed programs may be used to perform parsing of embedded regions, as long as they output TDF in some form.

Regarding the mechanism for entering embedded regions, __lang() could be enabled by a #pragma of some sort, or as a compiler option. Alternative suggestions include delimiting regions along the lines of:

#pragma TenDRA embed lisp
	...
#pragma TenDRA embed end

Other than the syntax used for defining regions of embedded languages, this scheme is equally applicable to host languages other than C.

2.3. Code generation

The problem of code generation now becomes we have a tree of multiple-language TDF: how do we output code for a specific machine? Obviously for portable languages this is no particular problem. Hence the remainder of this proposal focuses on extending the TDF token register with assembly-specific tokens, not embedding a string of assembly as a single token. The distinction is critical, since it implies that embedded languages were already parsed away at an earlier stage.

The proposal for extending TDF with an assembly language token section (TDF currently has several existing language-specific sections) involves expressing these as a superset of assembly machines, rather than as several distinct machines. For example, this would provide generic write to a register operations. When this comes to be translated into machine-specific assembly, it becomes the job of Trans to assert that the requested register actually exists. In this manner, TDF does not need to keep being extended with each new machine supported.

Under this proposal, the TDF Token Register is to be extended under two sections. Firstly, LPI tokens (chapter 6): There is no need for much to be added here, because our superset assembly consists almost entirely of function calls. However, it is used to provide fundamental concepts such as register (which would be passed in arguments to API calls), address, offset and so forth.

The author suggests a prefix of ~s. for LPI tokens, to correspond to .s files.

The contents of these tokens (for example, register names) are encoded opaquely either as BITSTREAM or BYTESTREAM as appropriate. These contents are understood by each Trans implementation supporting that token.

Secondly, the API (chapter 7) is to be extended to provide each assembly instruction. These are viewed as an analogue of standard library functions in other languages.

With this approach, the code generation for various machines (Trans implementations) is free to bail out if it is unable to generate code for the requested tokens. This would usually indicate that the code being installed was written for a different assembly language.

2.4. Questions

Why not use tnc preprocessing to get target-specific things?

Quoting section 6.1:

Target specific versions of this capsule are obtained by transformation, using the preprocessing action of the TDF tool tnc, with definitions of the target dependency and C mapping tokens that are provided with the target installer.

This is used (amongst other things) for providing machine-specific constants - for example, the maximum value that may be stored in an integer.

Unfortunately, this mechanism cannot be used to provide replacements such as making registers become machine-specific, as replacements can only occur by replacing tokens with more TDF, which must also be portable tokens (even if the values of those tokens are not).

What is the advantage of having a full set of assembly tokens as opposed to just one?

The advantage is twofold. Firstly, we need only extend the set if one assembly provides a feature which is not already present for other assemblies. For example, register windows are specific to the UltraSPARC.

Secondly, re-use means we can share code between Trans implementations. This possibly allows for simple peephole optimisations to be centralised for multiple Trans implementations. For example, most assemblies have a jump instruction, although the specifics may vary.

How would you handle special purpose registers?

It is tempting to create new tokens for general purpose instructions which have special purpose operands. However, under the scheme proposed, no new tokens need be defined; to restrict operations on specific registers, Trans implementations would carry their own assertions stating in essence you can't perform this instruction with that register semantic checks.

Note that each Trans implements only a subset of the available assembly tokens.

Therefore all these special-purpose features are restrictive, rather than permissive.

Would the frontend syntax follow platform conventions or TDF conventions?

The short answer here is that it doesn't matter: we can add as many parsers as we like for each syntax, without the added cost of separate backends.

By having the TDF assembly tokens be a processor superset, we are able to duplicate implementations without needing to implement additional sets of tokens for each syntax.

Likewise, each syntax need only implement a subset of its assembly - this is attractive as a pragmatic route to getting existing software to compile, based on the observation that most uses of assembly seem to be short snippets for specific purposes (for example, atomic operations for semaphores) - in this situation, only the few instructions used by that particular fragment need be implemented.

This gives the situation of being able to practically support many languages by way of subsets, and nicely avoids the which syntax dilemma by providing any and all syntaxes. Adding more syntaxes should be as trivial as providing a lexer and grammar for the new syntax; the actions underling the grammar may be shared by all syntaxes for that language.

Will platform specific tokens be logically separated in some form from portable tokens?

Optionally, with their own group of TDF tokens per platform; if there's no chance they'll be used elsewhere (i.e. they are a concept specific to one processor), then it makes no sense to group them with the generic assembly tokens.

Could inlined assembly be output by a producer to perform platform-specific optimisations?

Producers of any language should not be aware of platform-specific optimisations. Having them emit inlined assembly may seem tempting, but TDF already provides a simpler (and portable) mechanism to achieve the same: an LPI token may be defined for the language construct of interest, which is installed accordingly per machine.

These LPI tokens are replaced with machine-specific tokens by the installers. This way portability is maintained, since LPI tokens are expected to be present in all installers. Note that these replacements are also portable, but would vary from each other in terms of which algorithms suit particular architectures.

2.5. Conclusion

In summary:

Assembly is parsed out before it reaches Trans. What remains is a set of tokens which form an interface between components of the compiler. Trans implementations may therefore become as complex as required (both in terms of optimisation and in strictness) without affecting other Transes.
TDF is extended with tokens which are a processor superset - as opposed to TDF itself which is a language superset.
The proposal appears to be maintainable - i.e. it scales without needing to duplicate effort. It seems practical to implement.
Furthermore, it gives Trans the chance to know about side effects and other things or assembly instructions, as opposed to having to deal with opaque closures that can't be parallelised or otherwise integrated with high level languages.
The scheme seems elegant in that the various logic is in the component where it belongs: Trans knows about the semantics of assembly, but only for its own machine, and so forth.

3. Implementation

The implementation affects the frontend of the compiler (since it introduces a new concept to the language); this need not necessarily affect the parser or lexer for the host languages. For each embedded language, a parsing mechanism must be provided; these may be provided by external programs, with a little glue code. Each of these returns a TDF tree; these trees are grafted on to the tree produced by the host language. This grafting code also belongs in the compiler frontend. Additionally, for non-portable languages, the TDF Token Register and the Trans implementations are to be extended.

3.1. Implementation roadmap

The implementation proposed lends itself naturally to incremental development, grouped within each of the areas affected.^[a] These features should be able to be developed independently of each other (save for the dependencies they form), and are also intended to provide useful features at each step along the way.

This incremental development forms a clear roadmap, concentrating on one item at a time without affecting unrelated parts of the system. This minimised impact to the rest of the project. The order of implementation proposed is:

To get the framework for §2.2 in place, PL_TDF makes a perfect example. It does not require extending the token register, and a compiler to generate TDF is already implemented (provided by ANDFutils), and so those distractions may be avoided.

Hence work here is on dealing with the compiler's lexer and parser, in order to break out for embedded regions.
Furthering the implementation to additionally support tnc's TDF notation gives another portable language, permitting generalisation of the system for multiple languages. Again, since tnc generates only existing TDF tokens, the register is not to be extended at this stage. As for PL_TDF, an implementation of tnc is provided by ANDFutils.
With two languages in place, the system should be able to be generalised under a realistic API. Further implementations of languages are expected to be more complex than the initial two, and so having this API in place helps contain that complexity.
The tnc and PL_TDF implementations are converted to use the API formed.
We now have __lang (or whichever syntax) in place, and the rest of the framework required for §2.2. Now work may begin on §2.3, which begins by designing the extended tokens required for each language introduced. Care should be taken in their design, if we are going to reuse them well.

3.2. Features provided

The most notable features provided by this scheme are itemised below:

Various different languages may be inlined. These may be either alternate frontends to the same set of actions (for example, multiple alternate syntaxes for assembly), or they may be entirely distinct languages.
Inlined languages may optionally provide machine-specific features which would not otherwise be accessible to the C implementation.
Machine-specific features are implemented using an existing mechanism (namely that language), hence we do not need to invent new assembly languages.
Trans implementations need only implement a subset of the available features.
Areas of non-portability are minimised throughout the system. This reduces the set of various semantics which need to be kept track of (namely out of TDF).
Duplication is reduced, therefore developer effort is minimised, meaning the project can reasonably maintain support for multiple platforms

3.3. Side-effects

Side-effects of the implementation produce desirable features other than the original goal of inlined assembly. These features are inherent to the architecture of the scheme proposed, and may be capitalised on both to assist the implementation, and as features in their own rights.

Embedding other languages has various use-cases. PL_TDF would be rather suited for micro-optimisation in generated code, for example.
A modular system made accessible via a library interface could be attractive to developers of little languages; these would form DSLs for the host language.
Providing different syntaxes for assemblies is straightforward and low-cost, and widens the userbase.
Since it is our installers which render the assembly (rather than passing it through as-is), semantic analysis of assembly may be undertaken.
If semantic analysis of assembly is performed, embedded assembly may share the same optimisation as the usual generated code.
One use-case would be for TenDRA as a tool to convert assembly from one syntax to another.
With a little further framework, it would be possible to provide embedded languages as host languages in their own rights.

Higher-level languages may also be embedded under this scheme. For example, embedding SQL could be implemented by rendering the SQL syntax down to C TDF tokens (namely to call the SQL library) before Trans is reached. This makes for an interesting alternative to preprocessing source to embed SQL.

3.4. Axioms observed

Given the above scheme the following axioms hold:

Parsing of inlined languages is performed during compilation (as opposed to a string of assembly passed to installers).
Installers may be as complex as they need (with machine-specific details, that is), without bringing complexity to other areas of code.
Semantic checks are performed in the component which most naturally understands them (namely assembly is checked by installers).
TDF tokens may contain nonsensical items (such as references to registers which do not exist.
Nonsensical items present are checked for validity on a specific implementation during installation.
An installer need not implement all features present in the intermediate superset assembly...
...therefore we cannot statically check all possible assembly expressed by TDF unless there is an installer which implements all features. Such an installer would not represent a real processor, and would be unable to perform semantic checks.

Therefore we cannot statically check all possible assembly expressed by TDF...
...however the subset implemented by an installer may be checked for permissible semantics by that installer.

Therefore in order to check a subset which is being used, it needs to be run through the installer for that machine.

This subset may be the entire capabilities for that machine, but (as explained above) it cannot represent all possible assembly languages.
Features not implemented by installers but present in the TDF tree will cause errors during installation.

Usually these would either be attempting to install assembly written for a different machine or a user error in writing the assembly.

From these, it can be seen (perhaps unsurprisingly) that while we can identify all reasonable errors, static analysis of the correctness of assembly cannot be performed before installation (because the target machine is unknown), but rather only after installation, which restricts the analysis to the capabilities of one particular target machine.

Whether an installer actually performs this analysis is a decision of the authors of that installer.

^[a]
Perhaps as far as top-down versus bottom-up approaches go, this is inside-out, since the parsers and code generation are specifically not discussed.