辅导Scala programs、讲解Java程序语言、辅导ListBuffers、讲解Java编程
- 首页 >> Java编程Coursework 7 (Scala)
This coursework is worth 10%. The first and second part are due on 22 November
at 11pm; the third, more advanced part, is due on 21 December at 11pm.
You are asked to implement Scala programs for measuring similarity in texts,
and for recommending movies according to a ratings list. Note the second part
might include material you have not yet seen in the first two lectures.
Important:
Make sure the files you submit can be processed by just calling
scala <<filename.scala>> on the commandline.1 Use the template files
provided and do not make any changes to arguments of functions or to
any types. You are free to implement any auxiliary function you might
need.
Do not leave any test cases running in your code because this might slow
down your program! Comment test cases out before submission, otherwise
you might hit a time-out.
Do not use any mutable data structures in your submissions! They are
not needed. This means you cannot create new Arrays or ListBuffers,
for example.
Do not use return in your code! It has a different meaning in Scala than
in Java.
Do not use var! This declares a mutable variable. Only use val!
Do not use any parallel collections! No .par therefore! Our testing and
marking infrastructure is not set up for it.
Also note that the running time of each part will be restricted to a maximum of
30 seconds on my laptop.
Disclaimer
It should be understood that the work you submit represents your own effort!
You have not copied from anyone else. An exception is the Scala code I showed
during the lectures or uploaded to KEATS, which you can freely use.
1All major OSes, including Windows, have a commandline. So there is no good reason to not
download Scala, install it and run it on your own computer. Just do it!
1
Reference Implementation
Like the C++ assignments, the Scala assignments will work like this: you push
your files to GitHub and receive (after sometimes a long delay) some automated
feedback. In the end we take a snapshot of the submi?ed files and apply an
automated marking script to them.
In addition, the Scala assignments come with a reference implementation
in form of a jar-file. This allows you to run any test cases on your own computer.
For example you can call Scala on the command line with the option -cp
docdiff.jar and then query any function from the template file. Say you want
to find out what the function occurrences produces: for this you just need to
prefix it with the object name CW7a (and CW7b respectively for danube.jar). If
you want to find out what these functions produce for the list List("a", "b",
"b"), you would type something like:
$ scala -cp docdiff.jar
scala> CW7a.occurrences(List("a", "b", "b"))
...
Hints
For Part 1: useful operations involving regular expressions:
reg.findAllIn(s).toList
finds all substrings in s according to a regular regular expression reg; useful
list operations: .distinct removing duplicates from a list, .count counts the
number of elements in a list that satisfy some condition, .toMap transfers a list
of pairs into a Map, .sum adds up a list of integers, .max calculates the maximum
of a list.
For Part 2 + 3: use .split(",").toList for spli?ing strings according to commas
(similarly \n), .getOrElse(..,..) allows to querry a Map, but also gives
a default value if the Map is not defined, a Map can be ‘updated’ by using +,
.contains and .filter can test whether an element is included in a list, and
respectively filter out elements in a list, .sortBy(_._2) sorts a list of pairs according
to the second elements in the pairs—the sorting is done from smallest
to highest, .take(n) for taking some elements in a list (takes fewer if the list
contains less than n elements).
2
Part 1 (4 Marks, file docdiff.scala)
It seems source code plagiarism—stealing and submi?ing someone else’s code—
is a serious problem at other universities.2 Detecting such plagiarism is timeconsuming
and disheartening for lecturers at those universities. To aid these
poor souls, let’s implement in this part a program that determines the similarity
between two documents (be they source code or texts in English). A document
will be represented as a list of strings.
Tasks
(1) Implement a function that ‘cleans’ a string by finding all (proper) words in
this string. For this use the regular expression \w+ for recognising word
characters and the library function findAllIn. The function should return
a document (a list of strings).
[1 Mark]
(2) In order to compute the overlap between two documents, we associate
each document with a Map. This Map represents the strings in a document
and how many times these strings occur in the document. A simple
(though slightly inefficient) method for counting the number of stringoccurrences
in a document is as follows: remove all duplicates from the
document; for each of these (unique) strings, count how many times they
occur in the original document. Return a Map associating strings with
occurrences. For example
occurrences(List("a", "b", "b", "c", "d"))
produces Map(a -> 1, b -> 2, c -> 1, d -> 1) and
occurrences(List("d", "b", "d", "b", "d"))
produces Map(d -> 3, b -> 2). [1 Mark]
(3) You can think of the Maps calculated under (2) as memory-efficient representations
of sparse “vectors”. In this subtask you need to implement
the product of two such vectors, sometimes also called dot product of two
vectors.3
For this dot product, implement a function that takes two documents
(List[String]) as arguments. The function first calculates the (unique)
strings in both. For each string, it multiplies the corresponding occurrences
in each document. If a string does not occur in one of the documents,
then the product for this string is zero. At the end you need to
2Surely, King’s students, after all their instructions and warnings, would never commit such an
offence. Yes?
3https://en.wikipedia.org/wiki/Dot_product
3
add up all products. For the two documents in (2) the dot product is 7,
because
1 ? 0
|{z}
”a”
+ 2 2
|{z}
”b”
+ 1 0
|{z}
”c”
+ 1 3
|{z}
”d”
= 7
[1 Mark]
(4) Implement first a function that calculates the overlap between two documents,
say d1 and d2, according to the formula
overlap(d1, d2) = d1 · d2
max
You can expect this function to return a Double between 0 and 1. The
overlap between the lists in (2) is 0.5384615384615384.
Second, implement a function that calculates the similarity of two strings,
by first extracting the substrings using the clean function from (1) and
then calculating the overlap of the resulting documents.
[1 Mark]
Part 2 (2 Marks, file danube.scala)
You are creating Danube.co.uk which you hope will be the next big thing in
online movie renting. You know that you can save money by anticipating what
movies people will rent; you will pass these savings on to your users by offering
a discount if they rent movies that Danube.co.uk recommends.
Your task is to generate two movie recommendations for every movie a user
rents. To do this, you calculate what other renters, who also watched this
movie, suggest by giving positive ratings. Of course, some suggestions are
more popular than others. You need to find the two most-frequently suggested
movies. Return fewer recommendations, if there are fewer movies suggested.
The calculations will be based on the small datasets which the research lab
GroupLens provides for education and development purposes.
https://grouplens.org/datasets/movielens/
The slightly adapted CSV-files should be downloaded in your Scala file from
the URLs:
https://nms.kcl.ac.uk/christian.urban/ratings.csv (940 KByte)
https://nms.kcl.ac.uk/christian.urban/movies.csv (280 KByte)
4
The ratings.csv file is organised as userID, movieID, and rating (which is between
0 and 5, with positive ratings being 4 and 5). The file movie.csv is organised
as movieID and full movie name. Both files still contain the usual CSV-file
header (first line). In this part you are asked to implement functions that process
these files. If bandwidth is an issue for you, download the files locally, but
in the submi?ed version use Source.fromURL instead of Source.fromFile.
Tasks
(1) Implement the function get_csv_url which takes an URL-string as argument
and requests the corresponding file. The two URLs of interest are
ratings_url and movies_url, which correspond to CSV-files mentioned
above. The function should return the CSV-file appropriately broken up
into lines, and the first line should be dropped (that is omit the header of
the CSV-file). The result is a list of strings (the lines in the file). In case
the url does not produce a file, return the empty list.
[1 Mark]
(2) Implement two functions that process the (broken up) CSV-files from (1).
The process_ratings function filters out all ratings below 4 and returns
a list of (userID, movieID) pairs. The process_movies function returns a
list of (movieID, title) pairs.
[1 Mark]
Part 3 (4 Marks, file danube.scala)
Tasks
(3) Implement a kind of grouping function that calculates a Map containing
the userIDs and all the corresponding recommendations for this user (list
of movieIDs). This should be implemented in a tail recursive fashion using
a Map as accumulator. This Map is set to Map() at the beginning of
the calculation. For example
val lst = List(("1", "a"), ("1", "b"),
("2", "x"), ("3", "a"),
("2", "y"), ("3", "c"))
groupById(lst, Map())
returns the ratings map
Map(1 -> List(b, a), 2 -> List(y, x), 3 -> List(c, a)).
In which order the elements of the list are given is unimportant.
[1 Mark]
5
(4) Implement a function that takes a ratings map and a movieID as