Archive

Archive for June, 2007

Compiling variant function return types in BBC BASIC

June 19th, 2007 4 comments

Some significant challenges to compiling BBC BASIC have been raised by Richard Russell, maintainer of BBC Basic for Windows. I have decided to explore the issues he raises through further posts here, rather than through responses to his earlier comments, since they will be quite long.

Today, we will look at the problem of the return type of functions not being known until run-time. Richard presented an example, which by his own admission was somewhat contrived, which nonetheless serves to illustrate the point. Of course, this also applies to the built in function EVAL which may return any of the standard BBC BASIC types.

Richard’s example was for FNunsuretype, reproduced below, which I’ve incorporated into a simple, runnable program:

a% = FNunsuretype
b$ = FNunsuretype
PRINT FNunsuretype
END

DEF FNunsuretype : LOCAL R%
R% = RND(2)
IF R% = 1 THEN = 12345 ELSE = “a string?

As we can see, the function gives a fifty-fifty chance of returning either an integer or a string. Obviously the program has a 75% chance of failing with a Type Mismatch error when run, but we’ll ignore that for now. Let’s take a look at the equivalent code in C#, which is most definitely a compiled language:

using System;

class Program
{
    static System.Random rnd = new System.Random();

    static void Main(string[] args)
    {
        int a = (int) UnsureType();
        string b = (string) UnsureType();
        Console.WriteLine(UnsureType());
    }

    static object UnsureType()
    {
        int r = rnd.Next(1);
        if (r == 0)
        {
            return 12345;
        }
        else
        {
            return "a string";
        }
    }
}

Here the function UnsureType() also returns either a number or a string with 50% probability, but we use the autoboxing feature of C# to wrap the integer or string in an object, which is what the function returns. Note that to unbox the type, we must use a cast, although not in the case of Console.WriteLine which calls the ToString method on the object, which in turns calls the ToString method on the boxed basic type. Note that this program also has a 25% chance of failing with an InvalidCastException, which is equivalent to a Type Mismatch in BBC BASIC.

Using static analysis techniques it should be possible to determine at compile time whether a BBC BASIC function needs to box its return value in an object. Most functions will have exit points where the return type is well-defined, or can be inferred from the type of the returned expression. Such functions can be compiled with a fixed return type. Those functions which have multiple but differing return types, or unknowable return types, such as returned by EVAL, will need to box their return values in an object. It should then be possible at the call-site of the function to determine whether an implicit cast is needed to convert the returned object to a useful value.

Categories: .NET, computing, OWL BASIC, software Tags:

BBC BASIC lexer with PLY

June 11th, 2007 No comments

So having decided to build a BBC BASIC compiler for the CLR, in IronPython using PLY, the first step is to build a functioning lexer to tokenize the text input of a BBC BASIC program. BBC BASIC is normally stored in a binary tokenized format, where each keyword is stored as a one- or two-byte value, although I won’t be supporting that format at this stage in the project. BBC Basic for Windows (BB4W) also uses a tokenized storage format, although this page suggests that the token formats are different. For now, we’ll stick to plain text.

The function of the lexer is to break down the stream of text into discrete tokens which can be passed onto the parser for syntactic analysis.

Identity Parade

My first attempt at building lexer was by the PLY book. Most of the tokens for operators were specified as simple strings, and the BASIC commands such as INPUT, PRINT, TAB and LEFT$ were handled in the manner described in the manual, by using a dictionary mapping keywords to token names, and using the token rule for identifiers to catch these:

reserved = {
    'ABS' : 'ABS',
    'ACS' : 'ACS',
    'ADVAL' : 'ADVAL',

    ...

    'STR$' : 'STR_STR',
    'STRING$' : 'STRING_STR'

    ...

    
    'VPOS' : 'VPOS',
    'WAIT' : 'WAIT'
}

def t_ID(t):
    r'[@a-zA-Z_`][a-zA-Z_0-9`]*[$%&]?'
    t.type = reserved.get(t.value, 'ID') # Check for reserved
    return t

Note how the regex specifing a valid identifier has been adjusted for BBC BASIC identifiers which can take the following forms: A% (static integer variable), bravo% (integer variable), charlie& (byte variable – from BB4W), delta (float variable) and echo$ (string variable). I was also surprised to learn that ASCII 96 is also valid in identifiers! On the BBC Micro this was thr £ symbol, but is more correctly the ` grave accent, including on RISC OS.

I soon discovered several problems with this simple scheme. In general, BBC BASIC keywords cannot form the first part of an identifer. For example, FOREVER will be lexed as ('FOR', 'FOR'), ('ID', 'EVER') where the first member of each tuple is the token type, and the second is the token value.

The PLY manual has the following to say on the topic:

Note: You should avoid writing individual rules for reserved words. For example, if you write rules like this,

t_FOR = r’for’
t_PRINT = r’print’

those rules will be triggered for identifiers that include those words as a prefix such as “forget” or “printed”. This is probably not what you want.

Au contraire!

In the case of BBC BASIC this is (almost) exactly what we want.

However, there is additional complexity. For obscure reasons, possibly to do with squeezing all this into 16 kB in the original ROM, there are 38 exceptions to this rule; for example TIMER is a valid identifer, even though it begins with the keyword TIME. And if this complexity wasn’t enough, some keywords form the prefix of other keywords. In fact this is surprisingly common: GET > GET$; END > ENDIF, ENDPROC, ENDCASE, ENDWHILE; INKEY > INKEY$, OF > OFF; TO > TOP. So we must ensure the lexer matches longer keywords before shorter ones.

By default, PLY sorts lexing rules specified as strings by decreasing length after all of the lexing rules specified by functions. Since our rule for identifiers must be a function, somewhat tediously, all of the keywords which cannot form the start of identifers must have tokens specified by functions before the rule for identifiers. Keywords which can form the start of identifers are handled using the original mechanism of looking up their value in a dictionary of values to token names.

Following the above changes, our lexer code is now structured like this:

# Nine letter keywords
def t_OTHERWISE(t):
    r'OTHERWISE'
    return t

def t_RECTANGLE(t):
    r'RECTANGLE'
    return t

# Eight letter keywords
def t_ENVELOPE(t):
    r'ENVELOPE'
    return t

# Seven letter keywords
def t_ELLIPSE(t):
    r'ELLIPSE'
    return t

...

def t_OR(t):
    r'OR'
    return t

def t_TO(t):
    r'TO'
    return t

# Reserved identifiers
reserved = {
    'ENDWHILE' : 'ENDWHILE',
    'ENDCASE' : 'ENDCASE',
...
    'BY' : 'BY',
    'OF' : 'OF',
    'PI' : 'PI'
            }

def t_ID(t):
    r'[@a-zA-Z_`][a-zA-Z_0-9`]*[$%&]?'
    t.type = reserved.get(t.value, 'ID') # Check for reserved identifiers
    return t

In this code, use of a dictionary is overkill since all the key and values are the same; a set would suffice.

Finally, some keywords are ‘decorated’ by trailing symbols. A quick flick through the BBC Microcomputer System User Guide [2.6 MB PDF] suggests that the trailing hash of BGET# and the trailing dollar of CHR$ are integral parts of the keywords. However, some investigation on the command line will reveal that all is not consistent!

BBC BASIC keyword tokens

Here we can see that the first use of BGET#3 is understood, since the interpreter tells us that we have supplied an invalid channel number. In apparent contradition to the User Guide, the second form BGET #3 with a space before the hash is also understood, as again the interpreter complains about the channel number we supplied. So in this case, and indeed for all the other keywords ‘suffixed’ with hash in the manual we find that the hash is not considered to be part of the keyword. Interestingly, if we omit the hash completely (not shown in the session above) BASIC complains with a error Missing #. This suggests that the hash is actually a prefix for the integer channel number, in effect it is an operator which transforms an integer into a channel.

The example with CHR$(96) illustrates that the dollar is an integral part of the keyword, since if a space is inserted before the dollar, BASIC thinks CHR is a variable identifier. Since CHR$ is a function, rather than a statement, the parentheses here are optional.

The final example is the most surprising, and caused me a good deal of confusion for a while. As we can see, both TAB and SPC work with parentheses, and SPC will happily work without them. Nonetheless, TAB insists on the parentheses being present, and furthermore, will not permit any space before the opening parenthesis! Therefore, the opening parenthesis is effectively part of the TAB( keyword token.

Further investigation reveals that several other keywords fall into the same category, some of them being decorated with more than one trailing symbol: RIGHT$( , STRING$( , INSTR( , LEFT$( , POINT( , MID$( and of course TAB(.

All of this needs to be incorporated into the lexer definition, so that the correct rule for RIGHT$( becomes,

def t_RIGHT_STR_LPAREN(t):
    r'RIGHT\$\('
    return t

Smooth operator

The rules for operators are standard, and can be implemented as token strings. PLY will sort them by decreasing length in order to get >>> to match before >>. I’ve decided to implement the extended set of operators supported by BB4W, such as *=.

# Operators
t_QUERY = r'\?'
t_PLING = r'\!'
t_PIPE = r'\|'
t_HASH = r'\#'
...
t_LBRAC = r'\['
t_RBRAC = r'\]'
t_CARET = r'\^'
t_TILDE = r'~'
t_DOT = r'\.'

Note that the hash from BGET# also makes it into this this as a ‘channel’ operator.

No comment

Comments in BASIC start with REM and continue until the next carriage return. Here we just look for REM

def t_COMMENT(t):
    r'REM[^\n]*'
    pass

I also decided to implement the line continuations supported by BB4W. These use a trailing backslash on one line, followed by a leading backslash on the subsequent line, in order to split long lines over several lines in the file. Again, this is easily dealt with using a regular expression.

# Define a rule so we can split lines with a
# trailing backslash and leading backslash
def t_CONTINUATION(t):
    r'\[ \t]*[\r\n][ \t]*\ '
    t.lexer.lineno += 1
    pass

Literally, how long is a piece of string?

The final piece of the puzzle, are literals, which in the case of BBC BASIC means strings, floats and integers in decimal, hex, or binary.

Strings represent a particular challenge. Unlike most programming language which use a backslash to escape special characters in literal strings, BBC BASIC supports escaping of only one character, the double-quote ASCII 34 using the convention of doubling. "Nat ""King"" Cole" gives the string Nat “King” Cole. This is identified using a cryptic regex, followed by replacement of doubled double-quotes with single double-quotes.

def t_LITERAL_STRING(t):
    r'"((?:[^"]+|"")*)"(?!")'
    t.value = t.value[1:-1].replace('""', '"')
    return t

This regex reads something like: A double-quote followed by zero or more non-double-quote characters or a pair of double-quotes, followed by a double quote, that is not followed by another double quote. Phew!

The literal numbers are straightforward:

def t_LITERAL_FLOAT(t):
    r'[+-]?\d*\.\d+(E([+-]?\d+))?'
    try:
        t.value = float(t.value)
    except ValueError:
        print "Number %s is too large!" % t.value
    return t

def t_LITERAL_INTEGER(t):
    r'\d+'
    try:
        t.value = int(t.value)
    except ValueError:
        print "Number %s is too large!" % t.value
        t.value = 0
    return t

def t_LITERAL_HEX_INTEGER(t):
    r'&[\dA-Fa-f]+'
    try:
        t.value = int(t.value[1:], 16)
        t.type = 'LITERAL_INTEGER'
    except ValueError:
        print "Number %s is too large!" % t.value
        t.value = 0
    return t

def t_LITERAL_BINARY_INTEGER(t):
    r'%[01]+'
    try:
        t.value = int(t.value[1:], 2)
        t.type = 'LITERAL_INTEGER'
    except ValueError:
        print "Number %s is too large!" % t.value
        t.value = 0
    return t

Note how the token types of the LITERAL_HEX_INTEGER and LITERAL_BINARY_INTEGER are converted to LITERAL_INTEGER.

Putting it all together…

Putting all this together, with some standard error handling, ignorable whitespace characters and newline detection, we have a functioning lexer. Using some boilerplate to read the contents of a text file and feed it into the lexer, we can test our construction on some genuine code. Using the BBC BASIC implementation of “Ninety-nine bottles of beer” shown below,

REM BBC BASIC version
REM by Stelio Passaris
REM http://www.stelio.net
REM Spacing around commands is not necessary.
:
a$=" bottle"
s$="s"
b$=" of beer"
c$=" on the wall"
t$="Take one down and pass it around:"
:
VDU14:REM enables paging mode; press shift to page down.
REM Substitute above line with "VDU2" to send text to printer.
FOR i%=99 TO 1 STEP -1
IF i%=1 THEN s$=""
IF i%<99 THEN PRINT;i%;a$;s$;b$;c$;"."’
PRINT;i%;a$;s$;b$;c$;","’i%;a$;s$;b$;"."’t$
NEXT i%
PRINT"No more";a$;"s";b$;c$;"."
VDU15:REM disables paging mode
REM Substitute above line with "VDU3" to send text to printer.

yields the following tokens,

> python bbc_lexer.py ninety_nine_bottles.bas'
LexToken(EOL,'\n',1,21)
LexToken(EOL,'\n',2,44)
LexToken(EOL,'\n',3,70)
LexToken(EOL,'\n',4,116)
LexToken(COLON,':',5,117)
LexToken(EOL,'\n',5,118)
LexToken(ID,'a$',6,119)
LexToken(EQ,'=',6,121)
LexToken(LITERAL_STRING,' bottle',6,122)
LexToken(EOL,'\n',6,131)
LexToken(ID,'s$',7,132)
LexToken(EQ,'=',7,134)
LexToken(LITERAL_STRING,'s',7,135)
LexToken(EOL,'\n',7,138)
LexToken(ID,'b$',8,139)
LexToken(EQ,'=',8,141)
LexToken(LITERAL_STRING,' of beer',8,142)
LexToken(EOL,'\n',8,152)
LexToken(ID,'c$',9,153)
LexToken(EQ,'=',9,155)
LexToken(LITERAL_STRING,' on the wall',9,156)
LexToken(EOL,'\n',9,170)
LexToken(ID,'t$',10,171)
LexToken(EQ,'=',10,173)
LexToken(LITERAL_STRING,'Take one down and pass it around:',10,174)
LexToken(EOL,'\n',10,209)
LexToken(COLON,':',11,210)
LexToken(EOL,'\n',11,211)
LexToken(VDU,'VDU',12,212)
LexToken(LITERAL_INTEGER,14,12,215)
LexToken(COLON,':',12,217)
LexToken(EOL,'\n',12,268)
LexToken(EOL,'\n',13,331)
LexToken(FOR,'FOR',14,332)
LexToken(ID,'i%',14,336)
LexToken(EQ,'=',14,338)
LexToken(LITERAL_INTEGER,99,14,339)
LexToken(TO,'TO',14,342)
LexToken(LITERAL_INTEGER,1,14,345)
LexToken(STEP,'STEP',14,347)
LexToken(MINUS,'-',14,352)
LexToken(LITERAL_INTEGER,1,14,353)
LexToken(EOL,'\n',14,354)
LexToken(IF,'IF',15,355)
LexToken(ID,'i%',15,358)
LexToken(EQ,'=',15,360)
LexToken(LITERAL_INTEGER,1,15,361)
LexToken(THEN,'THEN',15,363)
LexToken(ID,'s$',15,368)
LexToken(EQ,'=',15,370)
LexToken(LITERAL_STRING,'',15,371)
LexToken(EOL,'\n',15,373)
LexToken(IF,'IF',16,374)
LexToken(ID,'i%',16,377)
LexToken(LT,'<',16,379)
LexToken(LITERAL_INTEGER,99,16,380)
LexToken(THEN,'THEN',16,383)
LexToken(PRINT,'PRINT',16,388)
LexToken(SEMICOLON,';',16,393)
LexToken(ID,'i%',16,394)
LexToken(SEMICOLON,';',16,396)
LexToken(ID,'a$',16,397)
LexToken(SEMICOLON,';',16,399)
LexToken(ID,'s$',16,400)
LexToken(SEMICOLON,';',16,402)
LexToken(ID,'b$',16,403)
LexToken(SEMICOLON,';',16,405)
LexToken(ID,'c$',16,406)
LexToken(SEMICOLON,';',16,408)
LexToken(LITERAL_STRING,'.',16,409)
LexToken(APOSTROPHE,"'",16,412)
LexToken(EOL,'\n',16,413)
LexToken(PRINT,'PRINT',17,414)
LexToken(SEMICOLON,';',17,419)
LexToken(ID,'i%',17,420)
LexToken(SEMICOLON,';',17,422)
LexToken(ID,'a$',17,423)
LexToken(SEMICOLON,';',17,425)
LexToken(ID,'s$',17,426)
LexToken(SEMICOLON,';',17,428)
LexToken(ID,'b$',17,429)
LexToken(SEMICOLON,';',17,431)
LexToken(ID,'c$',17,432)
LexToken(SEMICOLON,';',17,434)
LexToken(LITERAL_STRING,',',17,435)
LexToken(APOSTROPHE,"'",17,438)
LexToken(ID,'i%',17,439)
LexToken(SEMICOLON,';',17,441)
LexToken(ID,'a$',17,442)
LexToken(SEMICOLON,';',17,444)
LexToken(ID,'s$',17,445)
LexToken(SEMICOLON,';',17,447)
LexToken(ID,'b$',17,448)
LexToken(SEMICOLON,';',17,450)
LexToken(LITERAL_STRING,'.',17,451)
LexToken(APOSTROPHE,"'",17,454)
LexToken(ID,'t$',17,455)
LexToken(EOL,'\n',17,457)
LexToken(NEXT,'NEXT',18,458)
LexToken(ID,'i%',18,463)
LexToken(EOL,'\n',18,465)
LexToken(PRINT,'PRINT',19,466)
LexToken(LITERAL_STRING,'No more',19,471)
LexToken(SEMICOLON,';',19,480)
LexToken(ID,'a$',19,481)
LexToken(SEMICOLON,';',19,483)
LexToken(LITERAL_STRING,'s',19,484)
LexToken(SEMICOLON,';',19,487)
LexToken(ID,'b$',19,488)
LexToken(SEMICOLON,';',19,490)
LexToken(ID,'c$',19,491)
LexToken(SEMICOLON,';',19,493)
LexToken(LITERAL_STRING,'.',19,494)
LexToken(EOL,'\n',19,497)
LexToken(VDU,'VDU',20,498)
LexToken(LITERAL_INTEGER,15,20,501)
LexToken(COLON,':',20,503)
LexToken(EOL,'\n',20,528)
LexToken(EOL,'\n',21,591)

which surely counts as a great first step towards our BBC BASIC compiler for the CLR. Next time we’ll take a look at the grammar of BBC BASIC, and begin constructing a parser.

Categories: .NET, computing, OWL BASIC, software Tags:

Writing a BBC BASIC compiler for the CLR

June 10th, 2007 29 comments

Of late, I’ve picked up an interest in the CLR through learning C#. After being impressed by the performance of the .NET runtime (less so for Mono, although that’s another story) I decided that an effective way to really get to grips with the system would be to write a compiler which targets the CLR.

During a nostalgic poke around the web relating to Acorn hardware, over 12 years since my last contact with BBC BASIC, I stumbled across Richard Russell’s BBC BASIC for Windows (BB4W) which reminded me of my formative years learning to code on the BBC Model B and later various Acorn Archimedes computers under RISC OS. This led me to start down the road of implementing a BBC BASIC compiler for the CLR, although the original BBC BASIC, and BB4W are interpreted. The original BBC BASIC II resided in only 16 kB of ROM, so it can’t be that much work. Can it?

The BBC Basic for Windows implementation, from Russell actually goes to long lengths to provide good backwards compatibility with software originally developed for the Beeb or RISC OS, such as providing a Teletext MODE 7 equivalent, and ‘emulated’ sound and graphics support.

I will be taking a different tack with this project. Rather than focus on getting old code reliant on obsolete hardware features to run, I will focus on making a viable BBC BASIC compiler for the CLR, which could be used to port existing programs to .NET/Mono. It is likely that OS or machine specific functionality such as sound and graphics will be implemented in significiantly different ways compared to the originals, for example, by giving good language-level integration with the System.Console class in .NET, rather than trying to emulate screen modes from Acorn hardware.

The toolset I settled on is as follows:

  • IronPython in which to write the compiler
  • PLY – the Python Lex/Yacc-a-like
  • BeebEm – BBC Micro emulator with BBC BASIC II
  • Arculator – Acorn Archimedes emulator with BBC BASIC V
  • Documentation in the form of BBC Micro and Archimedes user manuals, and various other information and examples on the web

Mostly its an experiment, and for the fun of writing a compiler. I don’t expect anybody to start creating large software systems in BBC BASIC as a result of my efforts. Far from it; there are much better tools today. However, I must admit to a secret aim of producing the fastest BBC BASIC implementation yet!

Categories: .NET, 8 bit, computing, OWL BASIC, software Tags: