Archive for November, 2008

OWL BASIC Control Flow Graph

November 23rd, 2008 1 comment

Since BBC BASIC does not use C-like nested scopes marked by braces to delineate loops and other blocks, the structure of such blocks is not captured in the Abstract Syntax Tree (AST); that is to say that the statements within a loop will not be child nodes of the opening loop statement such as FOR or WHILE. Furthermore, there is not necessarily a one-to-one correspondence between loop entry points such as FOR, and loop ends such as NEXT – a FOR loop may contain more than one NEXT statement, and what is more, in convoluted cases involving GOTOs, a single NEXT statement may be re-used by several FOR loops. Next please! Compiling Iteration Structures in BBC BASIC for more information.

It is therefore required to pair each closing loop statement (e.g. NEXT or ENDWHILE) with its corresponding opening loop statement(s), and this is turn requires that the possible paths of execution through the program can be determined statically (i.e. at compile time) so that for each closing statement we can trace backwards through the program execution to the previous opening statement. This also makes it possible to identify illegal constructs, such as ending a WHILE loop with a NEXT statement.

To achieve this, we must build a data structure called the Control Flow Graph (CFG). For each statement in the AST we determine all possible following statement. Ordinarily this is simply the next statement in the program, but we must also resolve line number references in GOTO, GOSUB, etc. and at each IF statement control flow may go in either of two directions. We do not resolve function or procedure calls when building the CFG, so each function or procedure becomes a separate connected-component in the resulting graph.

I’ve been working with the code to Acornsoft’s Sphinx Adventure from 1982 as a real-world BBC BASIC test program for the BBC Micro.

Acornsoft Sphinx Adventure

I don’t know who the author is, but I’m willing to bet they never imagined their by now long-forgotten code would undergo the sort of analysis its about to receive! Maybe the code in Sphinx Adventure is intentionally obfuscated, or maybe its just badly written, but some of the non-structured programming techniques such as jumping out of the middle of procedures never to return, are ‘interesting’ to say the least.

Control Flow Graph for Acornsoft Sphinx Adventure

Each node in this image is a single statement in the program, and the arrows between the nodes indicate the program flow. Each connected group of nodes is a separate procedure, and the largest group of nodes to the top-left of the chart is the main program starting from the first line through a long trail of DATA statements.

The Sphinx.graphml is browsable with yEd, which was also used to produce these diagrams.

Zooming in, we can see the CFG for a single procedure – in this case PROCF,

463 IFO?W <> L ANDO?W <> 1 THEN PROCR(7): PRINTA2$;" here.": GOTO465 ELSE IFO?W2 <> 1 THEN PROCR(8): PRINTA2$;".": GOTO465
464 IFW=27 ANDL=O?28 ANDWA=0 THEN PRINT"Your bottle is now full of water.":WA=1 ELSE IFW=27 ANDL=O?28 THEN PRINT"your bottle is already full." ELSE PROCR(21)

Control Flow Graph for Acornsoft Sphinx Adventure PROCF

The label in each node indicates its logical line number in the original code and the type of statement the node represents at the abstract syntax level.

Now we have the basic CFGs we can analyse them to match up corresponding loop statements, and optionally perform optimizations such as moving invariant code out of loops and eliminating dead (unreachable) code fragments.

Categories: .NET, 8 bit, computing, OWL BASIC, software Tags:

OWL BASIC parser and abstract syntax tree

November 23rd, 2008 No comments

Other personal projects, of both a software and familial nature have taken priority lately over my project to implement a BBC BASIC-a-like compiler for the .NET CLR, now called OWL BASIC in honour of the owl logo of the BBC’s Computer Literacy Project. Now I have some spare capacity to devote to the project, and my brother Ian, who is part-way through his Computing Science degree and is another long-time BBC/RISC OS/Acorn computing enthusiast has joined the project – he’s starting work on the optional front-end detokenizer. We have opted to handle tokenized source files by detokenizing into text, and then retokenizing with our lexer – the reason being that the tokenized BBC BASIC files do not contain all of the tokens we need for our grammar, and detokenizing even very large programs is mimimum overhead compared to the parsing and complilation.

In this post, I want to take a look at the structure of the PLY parser and the classes defining storage for the Abstract Syntax Tree that is the product of streaming the lexer tokens through the parser.

IronPython with PLY

PLY requires that the grammar rules are placed in Python docstrings within methods that follow a specific naming convention; these docstrings are then parsed during parser initialisation to build the required parsing tables. These tables are then cached in files for quicker start-up on subsequent runs. PLY stores is cached tables as large files of generated Python code containing some very long expressions. Unfortunately, these large expressions resulted in pathological behaviour of the IronPython parser, and stack overflow problems (although they were fine with regular CPython), and so some modifications to the PLY source was required to use Python pickling, rather than code generation, for storage and retrieval of the cached parsing tables. Performance is now much better, with a compiler startup time of many minutes reduced to a few seconds.

BBC BASIC grammar structure

I’ll start by pointing out that our current grammar definition is not perfect; even after the suitably caffeinated debugging session recommended in the PLY manual we still have 14 shift/reduce conflicts and 12 reduce/reduce conflicts. The former group are not serious, but the latter need to be addressed. There is no guarantee that it’s possible to parse BBC BASIC with a look-ahead left-to-right (LALR) grammar, owing the lack of a formal grammar definition for the language. The issues are resolvable, but may require stepping outside of the facilities provided by PLY – as we have done already for DATA statements.

The main structural part of our grammar is as follows:

program : statement_list;

statement_list : compound_statement
               | statement_list compound_statement;

compound_statement : lone_stmt_body stmt_terminator
                   | multi_statement stmt_terminator;
multi_statement : statement statements_tail;
statements_tail : statement_separator statement statements_tail
                | empty;

statement_separator : COLON;

stmt_terminator : EOL;
statement : stmt_body
          | empty;
lone_stmt_body : case_stmt;
stmt_body : beats_stmt
          | bput_stmt
          | call_stmt
          | circle_stmt
          | while_stmt
          | width_stmt
          | wait_stmt;

We have two recursively defined rules here: statement_list is defined in terms of itself, as is statements_tail. The former is used for collecting lists of statements or compound statements separated by line-breaks. The latter is used for accumulating statements forming compound statements separated by colons. The correct grammar for this was awkward to determine because many published grammars are for C-style languages which use statement terminators (e.g. semicolon) whereas BASIC uses Pascal-style statement separators, which can run in continguous sequences with empty statements. Some inspiration from a Pascal grammar and experimentation with UltraGram led to workable solution. A minority of statements in BBC BASIC, such as CASE must occupy their own line, which must be handled in the grammar as lone_stmt_body – the less fussy majority of statements make up stmt_body:

Abstract Syntax Tree construction

As the parser matches each rule, an Abstract Syntax Tree (AST) node is generated. I’ll use the BPUT rule as a simple example which shows many features:

def p_bput_stmt(p):
    '''bput_stmt : BPUT channel COMMA expr
                 | BPUT channel COMMA expr SEMICOLON'''
    if len(p) == 5:
        p[0] = Bput(channel = p[2], data = p[4], newline=True)
    elif len(p) == 6:
        p[0] = Bput(channel = p[2], data = p[4], newline=False)
    p[0].lineNum = p.lineno(1)

Grammar rule encoding

PLY encodes grammar rules in the doc-string of module level functions which accepts p, a list of the productions of other rules referenced such as channel, or terminals (i.e. tokens) such as COMMA. Our BPUT token represents the string “BPUT” in the source, without the following hash. The way the manual is written seems to suggest that the hash is part of the BPUT keyword, but simple experimentation with the BASIC interpreter, by inserting a space as in BPUT #f%, 65 shows that the hash is a prefix to the integer representing a channel number.

The BPUT command has two syntaxes: The first form accepts a channel number (an single value prefixed by a hash) and an expression. The second form accepts an expression followed by a semicolon. There is no mention here of the type of the expressions – type-checking is a phase distinct from parsing which comes later. These two forms of BPUT are not one-for-one equivalent with the two forms given in the BBC BASIC guide,

        (1) BPUT#factor,numeric-expression
        (2) BPUT#factor,string-expression[;]

since our first rule matches syntaxes 1 and 2 (without the semicolon) and our second rule matches only syntax 2 (with the semicolon), illustrating how the syntax information in the manual must be converted into a YACC grammar suitable for PLY.

Creating the AST node

Crucially, our two syntax rules are of different length allowing us to use the length to determine which rule was matched. By PLY convention, the rule must place its resulting AST node in p[0]. For either match we create a Bput AST node which is a Python object representing BPUT and configure it with keyword parameters for the channel number, the data to be output, and a flag indicating whether a newline should be output following the data (n.b. this also depends on the type of data – we only output a newline for string data – by we account for that later in the type-checker once we have determined the type.) Each of these items will itself be an the root of an AST substree of nodes representing anything from a fairly mundane literal integer, to a complex expression. As a final step in configuring our new AST node, we attach line number information which is only available from terminal tokens – in this case we use the line-number from the BPUT token itself.

Declaring the AST node classes

Our requirements for our AST node classes are as follows:

  • uniform interface to facilitate AST traversal for the later compiler stages
  • convenient syntax for declaring the structure of the many different types of AST nodes the langauge requires
  • conventient syntax for initializing new AST nodes as they are constructed by the grammar rules
  • attach formal type information and other information such as line numbers to the nodes

Each AST node class started being writting out long hand, but it was clear that this was going to be a lot of boilerplate code for each class in functions which could not easily be factored out into base classes, such as __init__ methods. What was needed was a code generator for producing the classes based upon a simple specification, and this is where Python metaclasses come into their own.

Our specification for BPUT is something like this,

  1. A sub-node expression evaluating to a channel integer
  2. A sub-node expression evaluating to a string or a number (i.e. a non-array type – a scalar
  3. A flag indicating whether a newline should be appended after the data

which we encode in Python using a declarative style as:

class AstNode(object):
    "AstNode - base class for all AST nodes"
    __metaclass__ = AstMeta
    formal_type = TypeOption()
    actual_type = TypeOption()
    line_num = IntegerOption()

    # ... Code omitted here ...

class AstStatement(AstNode):
    "AstStatement - base class for all statements"
    formal_type = TypeOption(VoidType) # All statements have a type of 'void'
    actual_type = formal_type

class Bput(AstStatement):
    channel = Node(formalType=ChannelType)
    data    = Node(formalType=IntegerType)
    newline = BoolOption(False)

The work in transforming this declarative style into the boilerplate code we need is done by the AstMeta metaclass which introspects each class as it is constructed and converts our declarative style class attributes above, such as into the getters, setters, properties and intializers and storage we need, in the resulting class. Our declared Nodes can accept AST subtrees, whereas our Option types store simple integer, string, or bool properties attached to the node.

Thus, the contents of the Bput that is generated are quite different to what is shown above. When the Bput class is instantiated, its metaclass actually produces something which behaves like this:

class AstNode(object):
    "AstNode - base class for all AST nodes"

    def __init__(self, lineNum = None):
        self.lineNum = lineNum

    # Line Number
    def _getLineNum(self):
        return self._getOption("line_num")

    def _setLineNum(self, line_num):
        self._setOption("line_num", line_num)

    formalType = property(getLineNum, setLineNum)

class AstStatement(AstNode):
    "AstStatement - base class for all statements"

    def __init__(self, formalType = None, actualType = None, **kwargs):
        self.formalType = None
        self.actualType = None
        super(AstStatement, self).__init__(kwargs)

    # Formal Type
    def _getFormalType(self):
        return self._getOption("formal_type")

    def _setFormalType(self, formal_type):
        self._setOption("formal_type", formal_type)

    formalType = property(getFormalType, setFormalType)

    # Actual Type
    def _getActualType(self):
        return self._getOption("actual_type")

    def _setActualType(self, actual_type):
        self._setActualType("actual_type", actual_type)

    actualType = property(getActualType, setActualType)

class Bput(AstStatement):

    # Channel
    def _getChannel(self):
        return self._getProperty("channel")

    def _setChannel(self, channel):
        self._setProperty("channel", channel)

    channel = property(getChannel, setChannel)

    # Data
    def _getData(self):
        return self._getProperty("data")

    def _setChannel(self, data):
        self._setProperty("data", data)

    data = property(getData, setData)

    # Newline
    def _getNewline(self):
        return self._getOption("newline")

    def _setNewline(self, data):
        self._setOption("newline", newline)

    newline = property(getNewline, setNewline)

This provides a uniform programming interface for the Bput object allowing simple manipulation of the Bput object itself, but also a standard interface to its child nodes allowing simple traversal by the various visitors which transform and analyse the AST in the later stages of the compiler.

Visiting the AST

Once the AST has been created by the parser, it can be analysed or manipulated by a visitor. One such visitor, which is an aid to debugging, transcribes each AST node into an XML representation. This is useful since it allows the AST to be conveniently examined using standard XML capable tools (e.g. a web browser) to aid debugging.

The following code, taken from line 374 of Acornsoft’s Sphinx Adventure,

378 IF ABS(L-26.5)<1 ANDO?34=26 THEN PROCR(27): GOTO378

results in the following AST when rendered as XML following various other visitor passes to perform some simplification and type-checking,

<If line_num="374" actual_type="Void" formal_type="Void">
    <None /> 
    <CallProcedure line_num="374" actual_type="Void" formal_type="Void" name="R">
        <LiteralInteger line_num="374" value="27" actual_type="Integer" /> 
    <Goto line_num="374" actual_type="Void" formal_type="Void">
      <target_logical_line formal_type="Integer">
        <LiteralInteger line_num="374" value="378" actual_type="Integer" /> 
  <condition formal_type="Integer">
    <And line_num="374" actual_type="Integer" formal_type="Integer">
      <lhs formal_type="Integer">
        <LessThan line_num="374" actual_type="Integer" formal_type="Integer">
            <AbsFunc line_num="374">
              <factor formal_type="Numeric">
                <Minus line_num="374" actual_type="Float">
                  <lhs formal_type="Numeric">
                    <Variable line_num="374" actual_type="Float" identifier="L" /> 
                  <rhs formal_type="Numeric">
                    <LiteralFloat line_num="374" value="26.5" actual_type="Float" /> 
            <LiteralInteger line_num="374" value="1" actual_type="Integer" /> 
      <rhs formal_type="Integer">
        <Equal line_num="374" actual_type="Integer" formal_type="Integer">
            <DyadicByteIndirection line_num="374" actual_type="Byte" formal_type="Byte">
              <base formal_type="Address">
                <Cast line_num="374" actual_type="Address" source_type="Float" target_type="Address" formal_type="Address">
                    <Variable line_num="374" actual_type="Float" identifier="O" /> 
              <offset formal_type="Integer">
                <LiteralInteger line_num="374" value="34" actual_type="Integer" /> 
            <LiteralInteger line_num="374" value="26" actual_type="Integer" /> 
Categories: .NET, computing, OWL BASIC, software Tags: