The SID Developers' Guide

Kevin Santugini, The TenDRA Project
Katherine Flavel, The TenDRA Project (editor)

First published 2007.

Revision History

2008-02-04

kate

Moved out much of the datastructure comments to the code itself.

This reverse engineering better serves its purpose kept with the code, so that it may evolve as the code changes. The developer guide (now slightly spartan) can grow anew giving structual and organisational information rather than a field-by-field analysis.

2007-07-09

kate

Initial import (and conversion from LaTeX) of the SID Developer Guide, researched and written by Kevin.

sid is an LL(1) parser. It reads a grammar specification (documented in the sid users' guide) then outputs a parser.

In this documentation, we explain sid's internal representation of grammars. It was reverse engineered from the code itself and the accompanying comments. Most of the reverse engineering was done reading sid's own grammar file and most notably the actions in the parser.act file. This document does not explain in detail how sid parses the .sid and the .act files.

[TODO this guide has been gutted; what remains gives a skeletal outline which will serve as a tour of the use and contents of the internal structures, without concerning itself with the implementation minuate. (Comments by the structures and API routines serve more accurately there, and will not fall out of sync with the code so easily). Here we can state how they all fit together, and what one might do with the data. dumping implementation to graphviz would be great, too.]

ii. Prerequisites

Before reading this guide, you should have read the sid users' guide and be able to use sid to output a simple parser.

iii. Who Should Read This Document?

Anyone interested in the post-processing steps sid does to a grammar after having parsed the .sid and the .act files. In particular, if you want to work on the grammar transforms and improve them you need to know sid's internal representations of grammars or at least know the interface to the internal representation. You may not want to reverse engineer this representation yourself.

1. sid's Organisation

1.1. The main Function
1.2. Adding a New Output Language: main_language_list
1.3. Code Organisation and Conventions

When you call sid, the main operations it performs are

Reads the grammar .sid file and stores its internal representation.
Reads the grammar output language specific .act file and complete the representation of the grammars with the action code. (After this step, sid only works on the internal representation.)
Transforms and Optimises the Grammar. Most notably, it removes left recursion and tries to transform the context free grammar provided in an equivalent LL(1) grammar.
Outputs the parser.

1.1. The `main` Function

Let's look at some parts of the code in main.c. In main, you can see HANDLE and WITH: these are macros that emulate an exception mechanism in C. These come from libexds.

The function main itself doesn't do much: it initializes some structures, calls main_init to process the command line options then calls main_1 to do all the interesting work. Now look at the function main_1, it calls in order (forgetting about all the initialisation stuff and the error handling stuff)

sid_parse_grammar(): parses the .sid file and converts it into an internal representation.
grammar_check_complete(&grammar): verifies that all rules are accessible.
(*(main_language->input_proc))(output_closure, &grammar): parses the action file and completes the internal representation of the grammar.
grammar_remove_left_recursion(&grammar): TODO
grammar_compute_first_set(&grammar): computes the first set of each rule in the grammar.
grammar_factor(&grammar): TODO
grammar_simplify(&grammar): TODO
grammar_compute_inlining(&grammar): TODO
grammar_check_collisions(&grammar): TODO
grammar_recompute_alt_names(&grammar): TODO
(*(main_language->output_proc))(output_closure, &grammar): outputs the parser in the chosen language.

You may wonder what main_language is. We explain it in the next section.

1.2. Adding a New Output Language: `main_language_list`

The global variable main_language allows us to easily modify the output language. It's a pointer to a structure called LangListT. This pointer will always point to an element in the table main_language_list, which contains callbacks for the various stages of processing.

The first member indicates the option name. The second one is a pointer to the initialisation function. The third one contains a pointer to the top input routine for the action file. The fourth one is an integer indicating the number of input file (2 for outputting C, 1 for test). Then we have the top output language specific output function and finally the number of outputted file. Don't remove the last line: it serves as a guard. To add a new output language, add a line to main_language_list and implement the new top level functions.

1.3. Code Organisation and Conventions

If you read this guide, it is probably because you want to modify sid. In this section, we say how sid is organised and how one should modify the code to keep the code readable.

sid defines many types for the internal representation of a grammar. These types are defined in the header files, begins with a majuscule and ends with T, e.g. RuleT. If a type, MytypeT is declared in myfile.h, then any function that directly touches the members of MytypeT begins with mytype_ and is defined in file.c. No other function should touch MytypeT directly. If you want to access an object of a certain type, do not access its members directly. Instead, use the interface declared in the header (the same header where the type is declared).

2. The Internal Representation of a Grammar

2.1. The GrammarT Type and Structure
2.2. The TableT Type
2.3. Internal Representation of Actions
2.4. Internal Representation of Rules
2.5. Internal Representation of Non-Locals

We present the internal representation in a top down structure. Many functions are provided that hides the implementation; all access to these structures is performed through their respective APIs. Those interfaces are not documented here; rather, a brief functional tour of content is given instead.

The implementation details behind the interfaces are documented in their respective datastructures; see the various header files for details per-field. Virtually all of these are stored in the adt directory.

2.1. The `GrammarT` Type and Structure

sid stores grammars in a structure called GrammarT.

It contains a TableT hash map containing elements of type EntryT. We'll talk about EntryT later. For the moment remember that an EntryT is a structure that can be used to store either an action, a rule, a type, a terminal, or a nonlocal.

Please note that in the code terminals are referred as basics in the code, probably for historical reasons.

2.2. The `TableT` Type

It holds most of the information about a grammar, rules, actions, terminals, non-local names, types and local names along with their scopes.

Its definition is very simple, containing only a hash table of EntryT.

2.3. Internal Representation of Actions

Actions are stored in a ActionT. More exactly, they are stored in an EntryT whose union member is an ActionT.

2.4. Internal Representation of Rules

Look at an extremely simple sid rule:

basic-arithmetic = {
        number;plus;number;
    ||  number;minus;number;
};

It has two alternatives, each one holding three items. Alternatives have their special type: AltT.

It also contains a list of ItemT. In our previous example, the list of items of the first alternate would be number, plus and number.

An ItemT is therefore a rule, an action call, a terminal or TODO(rename, name) already declared in table.

And now the RuleT type; this structure is very long.

2.5. Internal Representation of Non-Locals

A non-local is an EntryT whose type is ET_NON_LOCAL and whose member u is also an EntryT*: the target of this pointer is a type giving the type of the non-local. The key member is of course the name of the non-local but also contains its scope. IE, non-locals are accessible to all rules contained in the rule for which the non-local is defined. The scope is indicated in the key string by outer-enclosing-rule::inner-enclosing-rule::name-of-non-local.

3. The Representation of an Action Code

3.1. The CCodeT Type

This depends on the output language but as only C is supported at the moment, we'll describe how the body of an action is represented.

3.1. The `CCodeT` Type

To represent the body of an action when the output language is C, sid uses the type CCodeT

4. The Scope System

The scope system is only used when reading the grammar definition file (the .sid file). The only trace of it once the scope system has been deleted are the key members of the scoped rules and the always scope non-locals that begins with outer-enclosing-rule1::inner-enclosing-rule.

5. API Organisation

5.1. Interface to TableT
5.2. Interface to EntryT
5.3. Interface to RuleT
5.4. Interface to BasicT
5.5. Interface to ActionT
5.6. Interface to NonLocalT

[TODO we'll automatically generate this using Puredocs]

Most of the implementation is hidden to the user. One should only use the following functions to access everything.

5.1. Interface to TableT

As a general rule, you add entries to a TableT by using the functions listed in adt/table.h.

5.2. Interface to EntryT

TODO

5.3. Interface to RuleT

TODO

5.4. Interface to BasicT

TODO

5.5. Interface to ActionT

TODO

5.6. Interface to NonLocalT

TODO

The SID Developers' Guide

Revision History

i. Introduction

ii. Prerequisites

iii. Who Should Read This Document?

1. sid's Organisation

1.1. The main Function

1.2. Adding a New Output Language: main_language_list

1.3. Code Organisation and Conventions

2. The Internal Representation of a Grammar

2.1. The GrammarT Type and Structure

2.2. The TableT Type

2.3. Internal Representation of Actions

2.4. Internal Representation of Rules

2.5. Internal Representation of Non-Locals

3. The Representation of an Action Code

3.1. The CCodeT Type

4. The Scope System

5. API Organisation

5.1. Interface to TableT

5.2. Interface to EntryT

5.3. Interface to RuleT

5.4. Interface to BasicT

5.5. Interface to ActionT

5.6. Interface to NonLocalT

1.1. The `main` Function

1.2. Adding a New Output Language: `main_language_list`

2.1. The `GrammarT` Type and Structure

2.2. The `TableT` Type

3.1. The `CCodeT` Type