CS 334
Programming Languages
Spring 2002

Lecture 6


Programming language creates a virtual machine for programmer

Dijkstra: Originally we were obligated to write programs so that a computer could execute them. Now we write the programs and the computer has the obligation to understand and execute them.

Progress in programming language design marked by increasing support for abstraction.

Computer at lowest level is set of charged particles racing through wires w/ memory locations set to one and off - very hard to deal with.

In computer organization look at higher level of abstraction: interpret sequences of on/off as data (reals, integers, char's, etc) and as instructions.

Computer looks at current instruction and contents of memory, then does something to another chunk of memory (incl. registers, accumulators, program counter, etc.)

When write Pascal (or other language) program - work with different virtual machine.

Language creates the illusion of more sophisticated virtual machine.

Pure translators




Execution of program w/ compiler:


We will speak of virtual machine defined by a language implementation.

Machine language of virtual machine is set of instructions supported by translator for language.

Layers of virtual machines on Mac: Bare PowerPC chip, OpSys virtual machine, Lightspeed Pascal machine, application program's virtual machine.

We will describe language in terms of virtual machine

Slight problem:

May lead to different implementations of same language - even on same machine.

Problem : How can you ensure different implementations result in same semantics?

Sometimes virtual machines made explicit:

Compilers and Interpreters

While exist few special purpose chips which execute high-level languages (LISP machine) most have to be translated into machine language.

Two extreme solutions:

Pure interpreter: Simulate virtual machine (our approach to run-time semantics)

        Get next statement
        Determine action(s) to be executed
        Call routine to perform action
    UNTIL done

Pure Compiler:

  1. Translate all units of program into object code (say, in machine language)

  2. Link into single relocatable machine code

  3. Load into memory

Comparison of Compilation vs Interpretation

compiler interpreter
Only translate each statement once Translate only if executed
Speed of execution Error messages tied to source
More supportive environment
Only object code in memory when executing. May take more space because of expansion Must have interp. in memory when executing (but source may be more compact)

Rarely have pure compiler or interpreter.

Can go farther and compile into intermediate code (e.g., P-code and JVML) and then interpret.

In FORTRAN, Format statements (I/O) are always interpreted.

Overview of structure of a compiler

Two primary phases:
Break into lexical items, build parse tree, generate simple intermediate code (type checking)

Optimization (look at instructions in context), code generation, linking and loading.

Lexical analysis:
Break source program into lexical items, e.g. identifiers, operation symbols, key words, punctuation, comments, etc. Enter id's into symbol table. Convert lexical items into internal form - often pair of kind of item and actual item (for id, symbol table reference)

Syntactical analysis:
Use formal grammar to parse program and build tree (either explicitly or implicitly through stack)

Semantic analysis:
Update symbol table (e.g., by adding type info). Insert implicit info (e.g., resolve overloaded ops's - like "+"), error detection - type-checking, jumps into loops, etc. Traverse tree generating intermediate code

Catch adjacent store-reload pairs, eval common sub-expressions, move static code out of loops, allocate registers, optimize array accesses, etc.
   for i := .. do ...
       for j:= 1 to n do
           A[i,j] := ....
Code generation:
Generate real assembly or machine code (now sometimes generate C code)

Linking & loading:
Get object code for all pieces of program (incl separately compiled modules, libraries, etc.). Resolve all external references - get locations relative to beginning location of program. Load program into memory at some start location - do all addressing relative to base address.

Symbol table: Contains all identifier names, kind of id (vble, array name, proc name, formal parameter), type of value, where visible, etc. Used to check for errors and generate code. Often thrown away at end of compilation, but may be held for error reporting or if names generated dynamically.

Like to have easily portable compilers

Formal Syntax and Parsing


Readable, writable, easy to translate, unambiguous, ...

Formal Grammars:

Backus & Naur, Chomsky

First used in ALGOL 60 Report - formal description

Generative description of language.

Language is set of strings. (E.g. all legal ALGOL 60 programs)


    <expression> ->  <term> | <expression> <addop> <term>   
    <term>       ->  <factor> | <term> <multop> <factor>   
    <factor>     ->  <identifier> | <literal> | (<expression>)   
    <identifier> ->  a | b | c | d   
    <literal>    ->  <digit> | <digit> <literal>   
    <digit>      ->  0 | 1 | 2 | ... | 9   
    <addop>      ->  + | - | or   
    <multop>     ->  * | / | div | mod | and
The items in pointy brackets (<...>) are called non-terminals of the grammar. Each non-terminal has one or more associated productions where it is to the left of the arrow. Where a non-terminal can be rewritten according to several different rules, the different right hand sides may be separated by "|". To generate an expression, start with <expression> and replace it by one of the right sides of its rule. Continue replacing non-terminals by a right side of its production until no non-terminals are left. The result is said to be generated from the initial non-terminal.

For example, <expression> generates: a + b * c + b. See the text for pictures of parse trees.

The grammar gives precedence and which direction that op's associate.

Extended BNF handy:

item enclosed in square brackets is optional

    <conditional> -> if <expression> then <statement> [ else <statement> ]
item enclosed in curly brackets means zero or more occurrences
    <literal>-> <digit> { <digit> }

Syntax diagrams - alternative to BNF. See section 9.6 of Ullman for ML syntax diagrams.
Syntax diagrams are never recursive, use "loops" instead.

Problems with Ambiguity

Suppose given grammar:
	<statement> -> <unconditional> | <conditional>

<unconditional> -> <assignment> | <for loop> |

"{" { <statement> } "}"

<conditional> -> if (<expression>) <statement> |

if (<expression>) <statement> else <statement>

How do you parse:
  if (exp1) 
    if (exp2) 

Could be

  1.   if (exp1) 
        if (exp2) 

  2.   if (exp1) 
        if (exp2) 

I.e. What happens if exp1 is true and exp2 is false?


Pascal, C, and Java rule: else attached to nearest then

To get second form, include "{,}" as shown above..

MODULA-2 and ALGOL 68 require "end" to terminate conditional:

  1. if exp1 then if exp2 then stat1 else stat2 end end

  2. if exp1 then if exp2 then stat1 end else stat2 end

(Algol 68 actually uses fi instead of end)

Why isn't it a problem in ML?

Ambiguity in general is undecidable

Chomsky developed mathematical theory of programming languages:

BNF (or syntax diagrams) = context-free can be recognized by push-down automata

Not all aspects of programming language syntax are context-free.

Formal description of syntax allows:
  1. programmer to generate syntactically correct programs

  2. parser to recognize syntactically correct programs

Parser-generators (also lexical analysis)- LEX, YACC (available in C and many other languages, e.g., ML)


The formal specification of the syntax of a language can be used to help write lexical scanners and parsers for a programming language. The lexical scanner and parser that I provided for the last homework is an example of how this can be done. (Ignore the signature and structure information -- we will talk about that later.)

Evaluating lex file first opens a file, inputs the contents, and then explode's it into a list of characters. The function gettokens converts the list of characters into a list of tokens with type

datatype token = ID of string | NUM of int | Plus | Minus | Mult | Div | Neg
  | LParen | RParen | EOF;

The code for gettokens is straightforward, so I will simply let you take a look at it.

Section 4.6 of the text, starting on page 4-20, provides an extended discussion of recursive descent parsing, with an extended example of parsing arithmetic expressions in C. Please read that and compare the parser in ML with that given in C (note that the C program also evaluates the arithmetic expression, while you of course wrote a separate function to evaluate the expressions).

Parsing XML

Here we will examine parsing using XML as our example. XML is a "hot" topic these days. XML stands for extensible markup language. HTML for making web pages is an example of a standard generalized mark-up language (SGML), while XML can be seen as being more general than HTML, but not as rich as SGL. The key difference between XML and HTML is that XML is a meta-language (the tags are not fixed - either in syntax or semantics) that is designed so that richly structured documents could be sent over the web, whether or not those documents were designed to be displayed. Thus an XML document might be used to transfer information across the internet from one computer to another in such a way that a program on one computer could generate the data, while the other could read it and operate on it.

The hope is that standards will be created for different kinds of information to be transmitted. Then anyone wishing to transfer that kind of information will use the same format.

Back to:
  • CS 334 home page
  • Kim Bruce's home page
  • CS Department home page
  • kim@cs.williams.edu