辅导Inf2A、讲解Java程序语言、讲解Micro-Haskell留学生、辅导Java设计

2018.11.02 - 首页 >> Java编程

Inf2A 2018–19: Assignment 1

The Language Processing Pipeline for Micro-Haskell

Issued 12 October 2018

The deadline for this assignment is 4pm, Tuesday 30 October 2018 (the night before

Hallowe’en).

Overview

The objective of this practical is to illustrate the language processing pipeline in the

case of a small but interesting fragment of the Haskell programming language, which we

shall call Micro-Haskell (or MH). The practical is self-contained, and prior knowledge of

Haskell itself is not assumed; however, those with no previous experience of Haskell may

need to invest a little more time to understand what is required. The brief introduction

to Micro-Haskell to be given in Lecture 14 may also be helpful.

The practical illustrates all stages of the language processing pipeline for programming

languages, taking us from a source file, written in MH, to execution of the program.

In our case, the pipeline has four stages: lexing, parsing, typechecking and evaluation.

Your task is to provide language-specific material to assist with the first three stages:

lexing, covered in Part A of the practical; parsing, covered in Part B; and typechecking,

covered in Part C. A simple evaluator is provided for you.

The code implementing all four stages will itself be written in Java. (So you are

warned at the outset that you’ll need to think in two languages at once!) Several

general-purpose components are provided; your task is to supply the parts specific to

MH, by applying your understanding of the course material. Once you have finished,

you’ll be able to hook up your implementations to the given evaluator to obtain a

complete working implementation of MH.

Of course, many libraries of lexing and parsing tools in Java are available on the

Internet, but the point of this practical is to build things up from first principles in

order to understand how they work. You should therefore not attempt to use any

lexer or parser utilities you may find online, nor any tools such as StringTokenizer or

StreamTokenizer that might otherwise appear tempting.

Besides writing your own code, you are advised to spend some time understanding

what the provided parts of the code are doing, at least in broad terms. This will help

you to understand what you need to implement, and also to see how the various stages

of the pipeline all fit together.

The version of Java for this practical is 1.8.0 181 (the version currently installed on

the DICE machines). If you have another version installed on a machine of your own,

you may still be able to complete the practical with this, but it’s your responsibility

to check that the final version of your code compiles and runs under the above version

before you submit it.

Instructions

To begin, download the code file Inf2A_Prac1_Files.zip from the Informatics 2A

Assignments webpage. On a DICE machine, this can be unpacked using

unzip Inf2A_Prac1_Files.zip

This will build a subdirectory Inf2A_Prac1_Files inside your working directory (folder),

within which you will find all source files needed for the assignment. Look first at the

file MH_example.txt, which provides a small sample of Micro-Haskell.

The assignment comprises four parts: A–D. In each of Parts A–C, you have to

complete a template Java file that is provided for you. In doing this, fill in only the

parts of code that you are instructed to write. Do not make any other alterations

to the existing code — this will invalidate your solution! You are required to

submit your solutions from a DICE machine, using the submit command, as specified

below.

Part Submit command Marks

A submit inf2a cw1 MH_Lexer.java 35

B submit inf2a cw1 MH_Parser.java 30

C submit inf2a cw1 MH_Typechecker.java 25

D no further submission required 10

It is also possible to submit your solutions to parts A–C simultaneously:

submit inf2a cw1 MH_Lexer.java MH_Parser.java MH_Typechecker.java

Part D combines Parts A–C with the evaluator to create a full working implementation

of MH; it does not require any submission of its own.

Part A: Lexing in Micro-Haskell [35 marks]

In this part, we construct a lexical analyser for MH. A general-purpose longest-match

lexer is already provided. Your task is to supply deterministic finite-state machines that

serve as recognizers for the various lexical classes in the language.

Look at the provided file GenLexer.java. It begins with some Java functions that

define certain useful classes of characters: letter ,small, large, digit,symbolic, whitespace,

newline. Next comes a Java interface DFA which defines the functionality that any fi-

nite state machine has to provide. Some of this is provided in the class Acceptor which

follows, but notice that this class contains stubs for five ‘abstract’ methods whose implementation

will be specific to the particular DFA in question. There then follow three examples

of how to construct implementations of particular DFAs: EvenLetterAcceptor,

AndAcceptor and SpaceAcceptor. Later in the file, the class DemoLexer illustrates how

these DFAs may be combined to yield a lexer for a simple (and silly) language, and the

class LexerDemo gives you a simple way of trying this out (the comments in the source

file explain how).

Notice that states are represented by integers, with 0 as the initial state. Besides

the transition operation and the set of accepting states, our DFAs here must also be

equipped with a designated dead (or “garbage”) state: that is, a non-accepting state

from which no sequence of transitions can lead to an accepting state. Note also that our

DFAs must include the method String lexClass(), which provides the name of the

lexical class they are associated with. This is done because we wish our lexer to output

a stream of tokens each tagged with their lexical class.

Your objective in Part A is to implement DFAs in the same style corresponding to

the lexical classes of Micro-Haskell. This is to be done in the file MH_Lexer.java, which

currently provides a template containing some gaps for you to fill in. For the first six of

these gaps, follow the pattern of the examples in GenLexer.java to construct DFAs for

the following lexical classes defined by regular expressions. (These correspond closely to

lexical classes of actual Haskell, except that we’ve chosen a slightly different definition

of the class NUM.)

A class VAR of variables, defined by

small (small + large + digit+')

A class NUM of numeric literals, defined by

0 + nonZeroDigit digit ?

where nonZeroDigit means what you think it does.

A class BOOLEAN of boolean literals, defined by

True + False

A class SYM of symbolic tokens, defined by

symbolic symbolic?

A class of whitespace elements, defined by

whitespace whitespace?

A class of comments, defined by

- - -

(nonSymbolNewline nonNewline? + ε)

where nonSymbolNewline is the set of all characters except those of symbolic or

newline, and nonNewline is the set of all characters except those of newline. Note

that - - -

effectively means ‘two or more dashes’.

The names of the last two classes, implemented by the lexClass() method, should both

be the empty string. This will notify the lexer that tokens of these classes should be

discarded.

In addition to these classes, keywords such as if and special symbols such as ; will

require ‘singleton’ (i.e. one-element) lexical classes all to themselves. For this purpose,

we will provide a class tokAcceptor which, for any string tok that we supply, can

create a special DFA that accepts the string tok and no other strings. For instance, the

constructor call tokAcceptor("if") should create a DFA that accepts only the string

"if". Fill in the gap in the implementation of this class so as to achieve this behaviour.

(Note that this is quite different from the other classes above, in that we will be able to

create several objects of class tokAcceptor each implementing a different DFA.) Here

the name of the lexical class should be identical to the string itself — this will serve to

make the specific token we are dealing with visible to the parser.

The lexical classes we require for MH are now the six lexical classes listed above,

together with singleton classes for the five keywords Integer, Bool, if, then, else and

for the three special symbols (, ), ;.

Following the example of class DemoLexer in the file GenLexer.java, add a few

lines of code to construct acceptors for these fourteen classes, and put these together

in an array called MH_acceptors. The acceptors should be listed in order of priority,

with highest priority first, which should be sensibly chosen so that keywords like if are

assigned an appropriate lexical class (so as to emulate the behaviour of an actual Haskell

implementation).

The MH_acceptors array is fed to a general-purpose routine that performs longestmatch

lexing (also known as maximal munch) using the method described in Lecture

7. Take a brief look at the code for this in GenLexer.java, and check that you broadly

understand what it is doing.

You should now be able to compile GenLexer.java and your file MH_Lexer.java to

create a lexer for MH. To test your lexer, you might wish to adapt the LexerDemo code

in GenLexer.java; this will allow you to create a simple command-line driven lexical

analyser for MH. You are not required to submit this test code, however.

Before leaving the topic of lexing, take a quick glance at the code provided in

CheckedSymbolLexer.java. This performs some mild post-processing on the stream

of lexical tokens: symbolic tokens are checked to ensure that they are among the tokens

that feature in MH:

:: -> = == <= + -

If they are, then the lexical classname SYM is replaced with the token itself, just as for

keywords and (, ), ;.

Submission: Submit your answer to part A, from a DICE machine, using the

command: submit inf2a cw1 MH_Lexer.java

Part B: An LL(1) parser for Micro-Haskell [30 marks]

Take a look at the provided file GenParser.java. This begins with an interface TREE

and a class STree for representing syntax trees (for any context-free grammar). The

class GenParser then provides an implementation of the general LL(1) parsing algorithm

as described in lectures (again, check that you broadly understand it).

In order to complete this and obtain a working parser, some grammar-specific ingredients

must be provided: a parse table and a choice of start symbol. The class

EvenAndParser gives a simple example of how to do this, for an artificial language that

uses the lexical classes defined in GenLexer.java. Note in particular the convention that

the names of nonterminals are identified by adding the symbol # (we can get away with

this because # doesn’t itself feature in any lexical tokens of MH). You can try out this

parser on the sample input file EvenAnd_example.txt, by compiling GenParser.java

and then typing

Prog → | Decl Prog

Decl → TypeDecl TermDecl

TypeDecl → VAR :: Type ;

Type → Type0 TypeRest

TypeRest → | -> Type

Type0 → Integer | Bool | ( Type )

TermDecl → VAR Args = Exp ;

Args → | VAR Args

Exp → Exp0 | if Exp then Exp else Exp

Exp0 → Exp1 Rest0

Rest0 → | == Exp1 | <= Exp1

Exp1 → Exp2 Rest1

Rest1 → | + Exp2 Rest1 | - Exp2 Rest1

Exp2 → Exp3 Rest2

Rest2 → | Exp3 Rest2

Exp3 → VAR | NUM | BOOLEAN | ( Exp )

Figure 1: Grammar for Micro-Haskell

java ParserDemo EvenAnd_example.txt

Your task is to implement a similar working parser for the language MH, in the file

MH_Parser.java (which is discussed below), following the pattern of EvenAndParser.

Now we consider the grammar of MH itself. The terminal symbols are the names of

lexical classes in tokens output by CheckedSymbolLexer. The complete list of these is

as follows:

VAR NUM BOOLEAN Integer Bool if then else

( ) ; :: -> = == <= + -

The start symbol of the grammar is Prog, and the productions are as given in Figure 1.

(To reduce clutter, we omit the # symbol here, instead distinguishing nonterminals by

choice of font.)

If this grammar looks daunting at first, the following observations may be helpful:

The grammar for types (i.e. the rules for Type, TypeRest, Type0 ) is a self-contained

sub-grammar that can be understood in isolation; see Lecture 14.

The grammar for expressions (the rules for all nonterminals from Exp onwards) is

another self-contained sub-grammar, and is broadly similar in its workings to the

LL(1) grammar for arithmetic expressions from Lecture 13. Note that the productions

for Exp2 and Rest2 are intended to cater for multiple function applications,

such as f x y.

It may be helpful to study the sample program in MH_example.txt in conjunction

with the grammar rules.

Once you feel you have assimilated the grammar, find yourself a large sheet of paper

and work out the complete LL(1) parse table. (Most of the entries will be blank, so

don’t panic!) You may find that some calculations of First and Follow sets (as presented

in Lecture 12) help you to do this; however, you will not be required to submit these

calculations or the written-out parse table you construct.

Now open the file MH_Parser.java. You will see that the right hand sides of all the

grammar rules have already been constructed for your convenience, so all you have to do

is to supply an implementation of the parse table itself in the style of EvenAndParser.

You may make use of auxiliary definitions and other reasonable devices to reduce the

amount of code you need to write, provided that your code remains clearly readable and

its correspondence to the parse table you have drawn up remains transparent.

After completing and compiling this, you will now be able to try out your parser on

the sample source file provided:

java MH_ParserDemo MH_example.txt

If this reports successful parsing, it’s an encouraging sign that your parser is largely

correct and will obtain a reasonable mark. However, to ensure it is completely correct,

you will have to do some further testing of your own, since (a) there are possible parsing

scenarios not represented by this small example, and (b) you will also need to ensure

that your parser rejects incorrect programs and that the error reports it produces are

reasonable.

Submission: Submit your answer to part B, from a DICE machine, using the

command: submit inf2a cw1 MH_Parser.java

Part C: Typechecking for Micro-Haskell [25 marks]

In this section, you will implement critical parts of a typechecker for MH.

The LL(1) grammar we have been using serves to disambiguate inputs and make

them readily parseable; but once these issues have been got out of the way, it’s much

more convenient to work with simpler trees known as abstract syntax trees (ASTs) in

which extraneous detail has been stripped away. For example, as in Lecture 14, types

in MH are conceptually just trees for the grammar:

Type → Integer | Bool | Type -> Type

Look at the file MH_Type_Impl.java, which defines a Java representation of MH types

in this stripped-down form. The interface MH_TYPE declares various operations one can

perform on such types (check that you understand what they are intended to do), while

further down, the class MH_Type_Impl provides predefined constants for the MH types

Integer and Bool, as well as a constructor for building a function type (i.e. an arrow

type) from two previously existing MH types. In the typechecking code you will be

writing, these may be utilized as follows:

MH_Type_Impl.IntegerType ; // AST for Integer

MH_Type_Impl.BoolType; // AST for Bool

new MH_Type_Impl (t1,t2); // AST for (t1->t2)

Clearly, we will need a way to convert syntax trees as produced by the parser

into ASTs of this kind. This is done by the provided methods convertType and

convertType1 in MH_Type_Impl.java. A good warm-up to your own task would be

to try and understand the workings of convertType and convertType1 with the help

of the comments provided.

A similar notion of abstract syntax trees is also required for expressions (trees with

topmost label #Exp). In effect, ASTs for expressions are just trees for the simplified

grammar:

where infix ranges over ==, <=, +, -. Look in the file MH_Exp_Impl.java at the interface

MH_EXP, which declares various operations that can be performed on such trees.

The intended meanings of these operations are all you need to understand from this

file (and you can ignore isLAMBDA and isREF). Their workings are further explained

by commented examples in the file MH_Exp_Impl.java immediately below the MH_EXP

interface. However, you don’t need to get to grips with the class MH_Exp_Impl, which

contains (among other things) some code for converting trees returned by the parser

into ASTs for expressions.

Assuming the conversions to ASTs have already been done, your task is to write a

typechecker for ASTs, by completing the body of the method computeType in the file

MH_Typechecker.java. More precisely, your code should compute the MH type of an

expression given as an AST exp of Java type MH_EXP, returning the result as an AST of

Java type MH_TYPE. If the expression is not correctly typed, your code should flag up a

type error, which you can do by means of the command:

throw new TypeError ("blah blah") ;

Each time you include such a command, the string you supply should provide a brief

description of the nature of the type error in question. Such error messages should

be helpful to MH programmers who need to identify and correct type errors in their

programs.

There is one other important ingredient to be explained. The type of an expression

such as if x then y else z, or even whether it is well-typed at all, will depend on

the types ascribed to the variables x, y, z. In general, then, an expression exp will

have to be typechecked relative to a type environment which maps certain variables to

certain types associated with them. This is the purpose of env, the second argument to

computeType. You may access the type associated with the variable x, for instance, by

calling env.typeOf("x").

The definition of the type of an expression (if it has one) is given compositionally:

that is, it is computed from the types of its subexpressions. This will be reflected in the

recursive nature of your implementation of computeType: it should compute the type of

any subexpressions in order to obtain the type of the whole expression. Here are some

hints on how this should work:

1. The types of NUMs and BOOLEANs are what you think they are.

2. You should assume that each of the infix operations accepts only integer arguments;

however, the type of the resulting expression will depend on the infix

operation in question.

3. In an application expression e1 e2, the type of e2 should match the argument type

expected by e1, and you should think about what the type of the whole expression

will be.

4. An expression if e1 then e2 else e3 may in principle have any type; however, you

should consider what the types of e1, e2, e3 respectively will need to be in order for

the whole expression to have type t.

A final hint on the Java side. To test whether two given type ASTs are equal, you

should use the equals method from the interface MH_TYPE, not the == operator.

As a rough guideline, the model implementation of computeType consists of about

40 lines of Java code.

When you have finished, compile the files MH_Type_Impl.java, MH_Exp_Impl.java

and MH_Typechecker.java (in that order), and try executing

java MH_Typechecker MH_example.txt

Once your typechecker works, this will report that the parse, type conversion and typecheck

have all been successful. To see what it is doing, look again at MH_example.txt.

The system is using your code to check that the right hand side of each function defi-

nition has the expected type relative to an environment that can be inferred from the

types specified in the MH code. (All this is managed by the remaining code in the file

MH_Typechecker.java.) You should also try out your typechecker on other MH programs

— including some that contain type errors to check that your code catches them

correctly and supplies appropriate error messages.

Submission: Submit your answer to part C, from a DICE machine, using the

command: submit inf2a cw1 MH_Typechecker.java

Part D: Execution of Micro-Haskell [10 marks]

This part of the practical puts all the pieces together to obtain a full working implementation

of Micro-Haskell.

The file MH_Evaluator.java contains a rudimentary evaluator for Micro-Haskell.

It uses the other parts of the practical to lex, parse and type-check a program, after

which, the resulting abstract syntax tree for the program is executed using a small-step

operational semantics (as will be covered in Lecture 28). This results in an implementation

that’s pretty slow compared with a real-world implementation of Haskell,1 but

its purpose is to illustrate how the basic principles of language processing feed into the

construction of a compiler/interpreter/evaluator. You are encouraged to take a brief

look at the evaluator source code to get a rough idea of how it works (this may become

clearer after Lecture 28). You will notice that the code here is relatively short: the bulk

of the work has already been done at earlier stages of the language processing pipeline.

If you’re interested in how to produce a more efficient implementation, take the UG3 course on

Compiling Techniques next year.

To use the evaluator, once you have completed the rest of the practical, compile

MH_Evaluator.java and then run it on the source file of your choice, e.g.:

java MH_Evaluator MH_example.txt

This will load and typecheck the MH program, and display a prompt MH>. Type in an

expression you would like to evaluate, and hit return. The expression may involve the

functions declared in your MH program. Do this as many times as you like; e.g.:

MH> 3+6

...

MH> fib 7

...

MH> gcd 104 (fib 12)

...

To quit, hit CTRL-c.

This last part of the assignment does not require you to do any further coding. Your

solutions to Parts A–C will be combined with the evaluator, and the resulting complete

implementation of MH will be marked for the correctness of its behaviour on a test

suite of five MH programs (different from those in MH_example.txt). However, in order

to gain assurance that your solutions to A–C will indeed combine to yield a correct

implementation of Micro-Haskell, you should spend a little time testing your complete

system on some small MH programs of your own devising (there is a certain amount of

fun to be had from this in any case). You should also be sure to test your system on a

sample of incorrect programs, to ensure that an appropriate error message is raised by

the lexer, parser or typechecker as appropriate.

Support for this assignment

The optional lab sessions in Weeks 5 and 6 will offer support for this assignment: lab

demonstrators will be on hand to help you, particularly with any Java programming

issues.

You may also post queries to the Piazza forum, but we would ask you to do so rather

sparingly: in some previous years, the volume of queries has been overwhelming for staff

and students alike! Please respect the following guidelines:

1. Do not post a query to the forum unless you have already spent a significant time

(say 40 minutes) trying to answer it yourself — by studying this handout or the

lecture slides, by searching the Web for relevant help, etc.

2. Take care not to post any part of an actual solution — not even one line of code!

3. Feel free to answer queries from other students (whilst observing 2). This is much

appreciated by staff and will often mean the question gets answered more promptly.

4. Don’t post a query without first checking whether the same query has already

been posted and answered.2

5. Do keep your postings polite and respectful.3

Note on marking

Your submission will be marked by a human marker with the assistance of some autotesting

programs that will run your code on various test examples. If your code passes

all or most of these tests, the human marker’s job will be easy; but if not, the marker

may need to inspect your code to assess your level of understanding. It is therefore in

your interests to include at least a few comments in your code to show that you know

what you are doing (and commenting your code is good practice in any case).

Most critically, please, please ensure that your code compiles correctly (in conjunction

with the other source files provided) before you submit it! Very few marks will be

available for submissions that do not compile.

Mary Cryan, October 2018

It’s in everyone’s interests to keep the forum traffic manageable: in previous years, the number of

postings was so daunting as to discourage students from scanning them all — they would then post

duplicate queries and the problem would snowball.

3Even though you are posting anonymously to your classmates, Shay and I can request information

about posters if the facility is abused!