Defining Languages

We are reaching the point where we need to identify the exact instruction set we want our machine to process. Associated with that decision, is the designing of a language for the machine. Since humans are very bad at working with numbers, we will use simple codes to represent instructions, and notation that looks familiar to most programmers to design our language.

This language is very close to the machine, and must be translated into a form the machine can understand. We call the language Assembly Language and the tool we will use to process this language is called an Assembler.

Designing a language (and a processing tool) is a fascinating area of computer science. We will take a brief look at how languages are specified, then use these techniques to design our assembly language.

Note

We do not have time to get into the generation of an assembler in this class. Instead, our language will be simple enough that translating our code by hand will be fairly easy.

Formal Languages

The topic we are headed into is called “Formal Language Theory”, something you will learn more about when you take a compiler course. For our purposes, we need to explore hoe to define a simple language, and then process files containing those languages.

Note

In some circles, this is called building “little languages” to help in building a bigger project.

Defining a Language

There is a simple language we can use to define a new language. It is called Backus-Naur Form, a simple notation for defining exactly what constitutes a legal sentence in some language. We will us a variation of this notation, which called Extended BNF, or *EBNF, for our project.

Here is the definition of EBNF, written in EBNF:

grammar ::= { rule } ;

rule ::= identifier "::=" expression ";" ;

identifier ::== letter { character } ;

expression ::= term { "|" term } ;

term ::= factor { factor } ;

factor ::= identifier | terminal | group | repetition | optional ;

option ::= "[" expression "]" ;

group ::= "(" expression ") ;

repetition ::= "{" expression "}" ;

terminal ::= "'" character { character } "'"
           | '"' character { character } '"' ;

character ::= letter | digit | "_" ;

letter ::= "A" | "B" | "C" | "D" | "E" | "F" | "G"
    | "H" | "I" | "J" | "K" | "L" | "M" | "N"
    | "O" | "P" | "Q" | "R" | "S" | "T" | "U"
    | "V" | "W" | "X" | "Y" | "Z" | "a" | "b"
    | "c" | "d" | "e" | "f" | "g" | "h" | "i"
    | "j" | "k" | "l" | "m" | "n" | "o" | "p"
    | "q" | "r" | "s" | "t" | "u" | "v" | "w"
    | "x" | "y" | "z" ;

digit ::= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8" | "9" ;

The notation formally defines the syntax of the language, but does not really explain what is going on here. We need to a add bit of extra detail to define the “semantics* of something written in this language. EBNF does not address this part of language design, and we will just use an informal scheme to explain things here:

  • ::= : is defined as
  • unquoted words : Non-terminal symbols (name another rule)
  • (…) : precedence grouping
  • […] : optional symbols
  • {…} : symbols repeated zero or more times
  • “|” : alternative (“or”)
  • Quoted text (“…” or ‘…’) : Terminal symbols (must appear exactly

Sometimes, you will see a graphical representation of a language defined using EBNF. Here is such a diagram for EBNF itself, produced using a neat online tool*railroad diagram* to see what a legal program looks like:

Here are the diagrams defining EBNF (from http://www.bottlecaps.de/rr/ui):

Note

This tool uses an alternative form of EBNF. Instead of “{“..”}”, it uses “(“..”)?”, and instead of “{“..”}”, it used “(“..”)*”. The end of the rule is not marked by a semicolon.

grammar:
grammar
rule:
rule
identifier:
identifier
expression:
expression
term:
term
factor:
factor
group:
group
option:
option
repetition:
repetition
terminal:
terminal
terminal:
character
letter:
letter
digit:
digit
symbol:
symbol

With a little practice, you should be able to use this nottion to define simple languages. We will use a simple scheme called recursive descent parsing to process our little languages.

Note

This technique is pretty old. In fact, one of my favorite books, dating back to 1976, showed how to build a full compiler for a Pascal-like language. The author was the designer of the Pascal Language: Niklaus Wirth ([Wir76]) This example was used almost directly to build many of the early compilers for the personal computer.

Processing Text Files

Once we have a formal definition of the syntax of a language, we can build a tool to process files written in that language. We will not be building a full compiler here, but we will introduce the basic parts of such a tool.

The basic processing of any program file follows this pattern:

  • source_file -> scanner -> lexer -> parser

Scanner

The scanner processes the source file as a stream of characters. It usually keeps track of things line the line number, and character position on a line for each character it handles. Scanners know nothing about the language being processed. They do know about the character set being used. (We will keep things simple, and just use ASCII).

Lexer

The lexer breaks up the stream of characters into identifiable chunks we will call tokens. Tokens include things like identifiers, numbers, special punctuation, etc.

In most programming languages whitespace is ignored. This includes spaces, tabs, and newlines. In a few languages, the white-space must be treated carefully. Python uses indentation as a significant part of the language, we we need to handle indents and dedents properly. This can get messy, so we will not be doing that in our work.

Parser

The parser’s job is to check the rules. The parser will accept a stream of tokens only if they fully obey all of the rules defined for the language. Writing a parser is something you need to do at some point. We will use a technique called recursive descent parsing to derive our parser code directly from the EBNF rules.

Warning

I am skipping a lot of things you will learn in that compiler course. Our languages will be very simple!

As the parser recognizes constructs from our language, it can be taught to generate output. In a real compiler, the parser will build an internal data structure representing the program processed, then analyze that structure to generate the final product of the compiling process. We will be integrating our parser into our simulator code, so it may not be producing any output directly. (We may generate some output for debugging purposes, though!)

EBNF Definition od a Machine

In our last lecture, we presented a simple machine definition file, written in a hardware description language that was not fully defined. Let’s fix that by setting up EBNF rules for that “language”. This will allow us to use some nice tools to process that system design file:

EBNF Definition of a Machine

We build a set of EBNF rules by working “top-down” from conceptial blocks to a fully defined set of ruless. This is exactly what we programmers are trained to do when we use “decomposition” to solve problems.

HDL ::=
    { component_block | connection_block } ;

component_block ::=
    "PART" { part_def } "END" ;

connection_block ::=
    "CONNECTION { connection_def { "END" ;

part_def ::=
    identifier ":" part_type ;

connection_def ::=
    identifier "." out_port "->"  identifier "." in_port ;

part_type ::= "Component" ;

out_port ::= "out" ;

in_port ::= "in" ;

It turns out that there are many tools available for taking a specification like this and turning it into a data file for a generator tool that can produce a compiler for your “language”. As an example, here is a Python program that will read this specification and analyze it. I am working with the Python Tatsu project to build an assembler for our simulator:

Here is the machine definition file we used in one of our lab projects:

machine.hdl
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
PARTS
    c1 : Component
    c2 : Component
END

CONNECTIONS
    c1.out -> c2.in
END


This is the EBNF I created to define the language used in constructing this file. This file has been set up to be processed by Tatsu.

hdl.ebnf
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
@@grammar::HDL


start 
    = 
    { component_block 
    | connection_block 
    }
    $
    ;


component_block 
    =
    'PARTS'
    { part_def }
    'END'
    ;


connection_block
    =
    'CONNECTIONS'
    { connection_def }
    'END'
    ;

part_def 
    =
    'PART" identifier
    'TYPE' component
    'PINS' { pins }
    'END'
    ;

pins
    =
    ( 'IN' | 'OUT' )
    identifier



connection_def
    =
    identifier
    '.'
    'out'
    '->'
    identifier
    '.'
    'in'
    ;

identifier                                                                 
    =                                                                   
    /[a-zA-Z][a-zA-Z0-9]*/ 
    ;


Finally, here is a simple example of a Python script that will generate a parser for our language, then run that parser and process our HDL file.

parser.py
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from json import dumps
import pprint
import json

import tatsu
from tatsu.util import asjson

def simple_parse():
    grammar = open('hdl.ebnf').read()
    code = open('machine.hdl').read()

    parser = tatsu.compile(grammar, asmodel=True)
    ast = parser.parse(code)
    print('# PPRINT')
    pprint.pprint(ast, indent=2, width=20)
    print()

    print('# JSON')
    print(json.dumps(asjson(ast), indent=2))
    print()

def main():
    simple_parse()


if __name__ == '__main__':
    main()

The result is just a dump of what the parserproduced, souwing that it accepted the input file and there were no syntax errors.

$ python3 parser.py
# PPRINT
[ [ 'PARTS',
    [ [ 'c1',
        ':',
        'Component'],
      [ 'c2',
        ':',
        'Component']],
    'END'],
  [ 'CONNECTIONS',
    [ [ 'c1',
        '.',
        'out',
        '->',
        'c2',
        '.',
        'in']],
    'END']]

We could take this parser generator and teach it to generate our machine if we were building this simulator in Python. We will not be doing that here. Instead, I will show an EBNF definition for our assembly language and we will generate the program data file we need to loead into out simulator to make it run a real “program”.