libexds released
The first (and only) version of libexds has been released; see the Downloads page.
It provides a handful of I/O facilities and data structures built around an exception-handling mechanism based on abstracting longjmp, which I've mentioned here a couple of times. This code was part of the original TenDRA distribution, spread across SID and TLD. I've extracted those and collated them into one place, and that's what libexds is.
It's a bit of a failed experiment. I'm releasing it for historical purposes, since it seems a shame that the code behind it would vanish without a tangible waypoint. It seems that most of the facilities it provides would do equally well without the exception-handling mechanism behind it, as I mentioned in the previous post here about overengineering. So, I'll be working to remove that, starting with the error reporting facility, which is so complex as to warrant its own language!
So don't go and use it.
Overengineering
Here's a look at one of the over-engineering issues we face with TenDRA. I'm going to walk through how the error reporting mechanisms for SID and TLD work. This is a simple concept: print an error and exit. Printing useful messages is surprisingly difficult.
All applications need this, and so it makes sense to centralise the error reporting facility between them. We've started on this by bringing together a few things into /shared, which a few tools use already. Of the remaining tools, SID and TLD in particular have their own system, which is what I'll be looking at here, taking a couple of examples from SID.
These calls to report errors look like:
E_cannot_compute_first_set(rule);
This produces rather nice output along the lines of:
clarion% sid -l test ~/svn/projects/libcst/src/syntax.sid
sid: Error: cannot compute first set for production
247: () -> () = {
() = 247 ();
};
Generally these error messages illustrate the problem in question; in this case, the message is stating that the first set for a grammar production cannot be found, which is a requirement for LL(1) grammars. So our input grammar is at fault, and it is helpful to the user to show why. In that case, the first set cannot be computed because the rule recurs into itself infinitely, without an alternative which would terminate it.
SID and TLD have their various data structures abstracted away behind specific types in their adt/ directories. These types have their own APIs, including routines to print their contents. The error messages, like everything else, stick to this abstraction. One of these types, unsurprisingly, represents a rule in the grammar; that is what was passed in the call above.
In order to save humans the effort of writing many tedious E_*() functions, these are defined in a domain-specific language; the sources are .e files, containing inlined portions of C code which do the dirty work of printing out the various arguments which are passed. These look like:
(...)
SID, TLD, Lexi and libexds
SID, TLD and Lexi now depend on libexds. I thought I'd take a moment to talk about what's been going on in that direction.
SID recently gained persistent variables; these help towards thread-safety, automatically passing these variables through to each action, so that globals need not be used.
TLD was written by the same author as SID; he took much of the supporting data structure code and copied over when work on TLD was begun. Both then digressed over time.
That rather unique code has now been centralised, and is named libexds, which I've mentioned here on occasion. It consists of APIs formed around an exception-handling mechanism based on longjmp(). Strange, and I feel guilty for liking it, but it is surprisingly pleasant to work with.
Quite a lot of work has been done for Lexi, changing both the generated APIs and the syntax used to specify its input. Most recently, it's gained an .lct format for specifying language-dependant actions, similar to SID's .act files. This should leave the lexer definitions language-independent.
Looking back a little for Lexi, the original code inherited from DERA was documented with a user-guide reverse engineered from the source. This was written up, and Lexi 1.3 released. The intention has been to make these large changes now that release is out of the way, which is what we have been up to for Lexi.
Subversion via DAV
This is just a quick note to say the Subversion repository is now accessible through DAV in preference to svnserve.
What was svn://svn.tendra.org/tendra/tendra (don't ask) is now http://svn.tendra.org/ - see the development page for details.
You can switch over a working copy by (for example):
svn switch --relocate svn://svn.tendra.org/tendra/tendra \
http://svn.tendra.org/
Please do so, as svnserve will be removed soon.
New website
The website for http://www.tendra.org/ has now been merged with the development tracking. This makes visible the wiki, tickets, source browser and timeline, and this blog.
For the moment, http://docs.tendra.org/ remains as-is; this will be revamped at a later stage.
Error messages
Here's a quick tour of TenDRA's error messages.
int main(void) {
"abc"[1] = 'x';
return 0;
}
gives:
clarion% tcc -Xs a.c
"a.c", line 3: Warning:
[ISO 6.1.4]: String literals are not modifiable.
[ISO 6.3.16]: Left operand of '=' should be modifiable.
clarion%
Error messages quote the section of the standard (in this case, C90) for context.
clarion% tcc -Xs a.c
"a.c", line 4: Warning:
[ISO 6.3.2.2]: In call of function 'gets'.
[ISO 6.6.3]: Discarded function return.
/tmp/tcc00024749aa/a.o(.text+0x6): In function `main':
: warning: warning: this program uses gets(), which is unsafe.
clarion%
Here my system's linker produces a warning about gets(), which is perhaps the only safe place to identify if it is going to be used without doing it at runtime.
The level of checking is configurable, from -Xa for "lenient ANSI" to -Xp for "strict ANSI with some extra checks".
For example, not casting discard return values to (void) will warn only on the strictest setting.
These settings simply cherry-pick various pragmas for each mode; the pragmas enable or disable each check. I'll not go into the gritty details of these pragmas in this post, though.
TenDRA provides a concept of APIs:
clarion% tcc -Yansi a.c
"a.c", line 3: Error:
[ISO 6.3.1]: The identifier 'EINTR' hasn't been declared in this scope.
clarion* tcc -Yposix a.c
clarion%
Here the symbols and headers available are specific to each API. As a fictional example, let's say ANSI defines struct a { int x; } and POSIX defines struct a { int x; int y; }, then tcc -Yansi will error with "no such field" for a.y, whereas tcc -Yposix will be fine.
There is no #ifdef trickery behind this mechanism; we have abstract definitions of the APIs, and check the symbol table against them. Again this is performed by pragmas doing the work behind the scenes; each API definition is compiled into a set of pragmas asserting the presence and type of each symbol.
(...)
Feature requests separated from other tickets
There's a new ticket type specifically for feature requests. This is an attempt to keep proposals separate from other general tickets (enhancements or otherwise). Please use this new type when proposing things like use-facing syntax or semantics changes, so as not to get them lost in the various tickets for enhancements behind the scenes.
I've gone through and recategorised the existing feature requests under this new ticket type; you can see them with a custom query.
Visualising lexers
Lexi now supports outputting to multiple languages. The first addition is to output the generated lexer in Graphviz Dot format. This gives a representation of the lexer's structure in graph form, which may be rendered to visualise its trie forming token mappings. A simple example,
GROUP digit = {0-9};
TOKEN "hello abc" -> $abc;
TOKEN "hello def" -> $def;
TOKEN "hello" -> $hi;
TOKEN "$[digit]" -> $index;
is rendered as:
As a real-world example, the lexer for make_tdf looks like:
Keywords and such are not currently implemented; they will follow at a later point.
Just for fun, here's the lexer and visualisation for Tspec, and the lexer and visualisation for Calculus.
Repository reorganisation
The repository layout has been reorganised according to the proposal in [1793]. This removes the majority of svn:externals links, moving all projects under a common /trunk.
Branching and tagging still occurs per-project, but now takes the entire code base along with it, bringing along the various shared code and other interdependencies. Those dependencies are now expressed as symlinks within the trunk. When the next release of each project is taken, tar can fold in these links to produce a self-contained project.
The previous tags have been maintained where appropriate (and in some cases, fixed up give their now absent externals). These now live under /tags. Hopefully everything else is self-explanatory.
Lexi's output language
The -l option (previously used for the sid token prefix) is now known as -t. It is not SID-specific.
Meanwhile, -l is used for the output language (this corresponds to the option for setting language for other tools, in an attempt at consistency). This breaks backwards compatibility for people using token prefixes.
The default language is C90. C99 will certainly follow, as -a makes a good test case for language-specific options. Other languages may follow in the future; of particular interest would be a language to output the trie formed from tokens.
Lexi's groups
Lexi now presents groups as an enumeration, should you wish to use them elsewhere in your programs (other than seeing if a character belongs to a group).
For example, from lexi itself (zone contents omitted for brevity):
ZONE comment : "/*" ... "*/" {
GROUP white = "";
....
}
ZONE line_comment : "//" ... "\n" {
GROUP white = "";
...
}
GROUP white = " \t\n\r" ;
GROUP alpha = {A-Z} + {a-z} + "_" ;
GROUP digit = {0-9} ;
GROUP alphanum = "[alpha][digit]" ;
produces:
enum lexi_groups {
lexi_group_white = 0x01,
lexi_group_alpha = 0x02,
lexi_group_digit = 0x04,
lexi_group_alphanum = 0x08,
lexi_group_comment_white = 0x10,
lexi_group_line_comment_white = 0x20
};
You're not supposed to rely on the values, but unfortunately those are accidentally exposed by the API.
Exception-handling examples
libexds provides an exception-handling mechanism for C, based on setjmp() and longjmp(). It originally came from SID, then was later moved over to tld. Now libexds gives it a home so that it might be reusable in other projects, too.
Here's the interface. It provides an API which hides the gory details, and allows for nestable exceptions, with exceptions re-thrown from handlers.
A couple of days ago, I wrote up an example demonstrating how to use it.
Lexi's lookup_char()
As of [1696], lookup_char() is removed from lexi's generated output. This breaks compatibility; characters are now passed directly to the is_*() macros.
Unit tests
I'm using libexds as a good excuse to make a start on some unit testing for TenDRA. It has well thought-out interfaces, and so I intend to test each API. These tests should reflect the usual use a user would make - that is, they should not break any abstractions. In this regard, they also serve to illustrate any flaws in API design.
For unit testing, I'm using a minimal test harness from elsewhere. So far I've implemented unit tests for the dstring API, and I'm happy with how it went.
Running the tests looks as follows:
clarion% make test
harness .libs/*.so
SUCCESS: true where expected true for "empty"
SUCCESS: true where expected true for "appendc"
SUCCESS: true where expected true for "appends"
SUCCESS: true where expected true for "appendn"
SUCCESS: true where expected true for "lastchar"
SUCCESS: true where expected true for "tonstring"
SUCCESS: true where expected true for "tocstring"
SUCCESS: true where expected true for "destroyc"
clarion%
Those tests live at tests/unit/dstring.c.
libexds is underway
I've started on libexds, which is intended to centralise the exception-handling data structures from sid and tld. See RoadmapLibexds for an overview of where it's headed.
It should reduce the lines of code for both sid and tld, and act as a platform for trying out a few exception-handling proposals and various other experiments for C. One idea is to provide a memory pool allocation system (which would stand by itself, Apache style), and then to tie that in with the exception handling system. This would feel somewhat like PostgreSQL's exception-handling pool allocation system, to use.
That might suit really nicely alongside the error catalogues, when they are eventually moved out from tcc into a reusable system. That won't happen for a while, yet, but it will be interesting when it does.