讲解CS2230、辅导Java编程、讲解Query processor、辅导Java语言
- 首页 >> Java编程CS2230 Computer Science II: Data
Structures
Homework 5
Query processor using iterators
35 points
Goals for this assignment
Learn how to write a variety of iterators using the Iterator<T> interface
Learn how to use higher order functions in Java
Learn how to use Java generic types
Debug programs using JUnit tests
Prep materials: reading, knowledge checks, lectures about generic types, iterators, interfaces,
prelabs/labs on interfaces, iterators, and git/github.
Description
In this project, you will build a Query Processor that can answer questions about data from
Lists, text files, and CSV files (that is, spreadsheets). A "query" is sequence of Iterators chained
together to process the input data.
This project has a setup part and 7 parts. The best way to approach the project is in the order of
these parts. You should pass all the tests specified in Part 1 before continuing to Part 2, and so
forth.
A submission that passes all the tests in Part 3 and doesn’t
implement anything else will receive a higher grade than a
submission that attempts all the parts and passes zero tests.
Part 0: Setup the project in IntelliJ
1. Follow all of the directions in getting_hw5.pdf.
When you are done with the directions, you should see something like the following in IntelliJ:
2. Run some tests to make sure the project setup is working. Right click AddXTests.java |
Run file (AddXTests.java is in Test Packages/iterators).
JUnit should finish running and report failed tests.
Background on Queries
A query is a question (or a transformation) on some input data. Input data might consist of a list
of numbers, a list of strings, a text file, or a spreadsheet. The way we will implement a query is
with a chain of Iterators, each one doing something to transform the input in a particular way.
Here is an example.
Let's suppose we have a list of numbers [10,50,1,400]. This list will be our input data.
A query might say "what are the first 2 elements of the input?". The chain of Iterators to
answer this question would be
The Limit(2) Iterator reads one value at a time from List.iterator by calling the List.Iterator's
next() method. Whoever wants to read the output of the query will call the next() method on
the last Iterator in the chain; in this case, the last Iterator is Limit(2) Iterator.
Limit(2) Iterator List.iterator
Here is the sequence of next() and hasNext() calls that would occur to process this query.
Limit(2) Iterator List.iterator
hasNext()...
hasNext() returns true
returns true
next()...
next() returns 10
returns 10
hasNext()...
hasNext() returns true
returns true
next()...
next() returns 50
returns 50
hasNext() returns false
Notice that for Limit(2) Iterator to return a value from its next(), it must call its input Iterator's
next() method. Also notice that the Iterators process just one element each time next() is
called.
Run LimitTest.java. All the tests should pass without any changes (the above example can be
found in moreTest). Take a look at Limit.java to see the implementation of the Limit Iterator.
Answer the following question for yourself: How does Limit produce the next() element? You
will use Limit.java as an example to help you build additional Iterators to run more interesting
queries.
Part 1: AddX
End result of this part:
pass all tests in these files
o AddXTest.java
Here is a motivating query. Let's suppose we have a list of numbers [10,50,1,400]. This list will
be our input data. The query is "add 1 to each element of the input". The chain of Iterators to
answer this question would be
AddX(1)
Iterator
List.iterator
Here is the sequence of next() calls that would occur to process this query (we are omitting the
hasNext() calls).
Add1Iterator List.iterator
next()...
next() returns 10
returns 11
next()...
next() returns 50
returns 51
next()...
next() returns 1
returns 2
next()...
next() returns 400
returns 401
What you need to do
Fill in the implementation of the AddX class in AddX.java. We've already provided the fields and
the constructor to get you started. You need to fill in the hasNext() and next() methods. Keep in
mind: according to the Iterator interface, hasNext() may be called 0 or more times between
every call to next(). You'll see that the tests check for this property.
Testing your code
Run the tests in AddXTest.java by right clicking that file and choosing "Run file". As in HW3,
don't be alarmed by lots of failing tests. Start by trying to fix the simpler tests first, then move
on to the more complicated ones.
Remember, the code inside of one of these Test methods is not magical. It is creating an input
list, building a query, then getting outputs from the query by calling next and hasNext.
Part 2: IntApply
End result of this part:
pass all tests in these files
o IntApplyTest.java
Here is a motivating query. Let's suppose we have a list of numbers [10,50,1,400]. This list will
be our input data. The query is "multiply each element of the input by 2". The chain of Iterators
to answer this question would be
The TimesTwoIterator reads one value at a time from List.iterator by calling the List.Iterator's
next() method. Whoever wants to read the output of the query will call the next() method on
the last Iterator in the chain, which is TimesTwoIterator.
Here is the sequence of next() calls that would occur to process this query (we are omitting the
hasNext() calls).
TimesTwoIterator List.iterator
next()...
next() returns 10
returns 20
next()...
next() returns 50
returns 100
next()...
next() returns 1
returns 2
next()...
next() returns 400
returns 800
Background: Higher Order Functions
To make the Iterators in this query engine more reusable, you are going to incorporate the idea
of higher order functions. Functions that take arguments as functions or return functions are
called higher-order functions. This sounds abstract, but it is very useful in practice as we will see
in later questions.
Refer to course materials for examples of higher order functions (abbreviated sometimes as
“HOFs”) and how to use them. And, here's another resource with some examples
https://flyingbytes.github.io/programming/java8/functional/part1/2017/01/23/Java8-
Part1.html.
TimesTwoIterator List.iterator
The IntApply iterator
A common operation in query processing is to simply call some function, ??, on the input
element to get an output element. This operation is known as "Apply". In fact, the
TimesTwoIterator above is a special case of Apply, where ??just multiplies the input by two.
Here is our example query written in Java. We've used IntApply instead of TimesTwoIterator.
Integer[] inputValues = {10,50,1,400};
List<Integer> input = Arrays.asList(inputValues);
IntApply op = new IntApply(new TimesTwo(), input.iterator());
The IntApply constructor takes two arguments: the first is an object of type IntApplyFunction
and the second is the input iterator.
IntApplyFunction is an interface with one method, apply(). Any class that "implements"
IntApplyFunction must provide an implementation of apply().
public interface IntApplyFunction {
public int apply(int x);
}
In IntApplyTest.java you can see an example of how IntApply is used. We define a class
TimesTwo that defines an apply() that multiplies its input by 2.
private class TimesTwo implements IntApplyFunction {
@Override
public int apply(int x) {
return x*2;
}
}
What you need to do
Fill in the implementation of the IntApply class in IntApply.java. You need to complete the
constructor, hasNext, and next.
Testing your code
Run the tests in IntApply.java.
Part 3: Apply
End result of this part:
pass all tests in these files
o ApplyTest.java
run the following Queries
o TextQuery1a.java
o TextQuery1b.java
The Apply iterator
Wouldn't it be nice if the IntApply operator worked for data that wasn't just integers? In fact, at
the end of this part, you will run a query on some Text data. To get there, you will write a
generic version of IntApply that can deal with any type of data.
Take a look at Apply.java. You'll see that it looks very similar to IntApply.java except that there
are generic types InT and OutT. InT indicates the type of the input data. OutT indicates the type
of the output data.
Notice that instead of an IntApplyFunction, Apply uses an ApplyFunction<InT, OutT>. This
interface provides the generic apply() method that can take any input type and return any
output type.
public interface ApplyFunction<InT, OutT> {
public OutT apply(InT x);
}
Finally, when we want to use Apply, we will implement ApplyFunction. In ApplyTest.java you
will see a few examples of generic apply functions. One of them is the TimesTwo class rewritten
to implement ApplyFunction. Notice that both generic types are Integer to say that our
multiply-by-2 operation takes an integer as input and produces an integer as output.
private class TimesTwo implements ApplyFunction<Integer, Integer> {
@Override
public Integer apply(Integer x) {
return x*2;
}
}
What you need to do
Fill in the implementation of the Apply class in Apply.java. We've already provided the fields
and the constructor to get you started.
Testing your code
Run the tests in ApplyTest.java.
Try queries on real data!
Once you pass all those tests, try running TextQuery1a.java, which runs a query on text files in
the provided sci.space/ folder. This data comes from a collection of newsgroup discussions from
19971
.
The query uses an iterator we've provided called TextFileReader (in src/readers) that reads all
text files in the provided folder of text files. TextFileReader.next() returns objects of type
Pair<String,String>. The "left" element in the Pair is the filename and the "right" element in the
Pair is the entire contents of that file. The default query just uses Apply to take the "right"
element from a Pair, that is, return the file contents.
The while loop calls next() on the last Iterator and prints out the elements (i.e., the contents of
all the text files).
Here is an illustration of the Iterators in TextQuery1a
Now, you need to implement a query yourself, in TextQuery1b.java. This query is very similar,
except instead of the Limit, it takes the 12th word from every file. All you need to do to finish
the query is fill in the TwelfthWord class. When you run the query, you'll see just the twelfth
word from every file.
For your reference, you can find the expected output of the query in
expected/TextQuery1b.txt.
1 http://qwone.com/~jason/20Newsgroups/
While loop Limit(1) Apply(TakeRight) TextFileReader
While loop Apply(TwelfthWord) Apply(TakeRight) TextFileReader
Part 4: FlatApply
End result of this part:
pass all tests in these files
o FlatApplyTest.java
run the following Queries
o TextQuery2.java
Here is a motivating query. Let's suppose we have a list of numbers [10,50,1,400]. This list will
be our input data. The query is "repeat each element 2 times".
This query will produce eight output elements: 10,10,50,50,1,1,400,400. At first glance, it seems
like we may be able to accomplish this with Apply and a well chosen ??. However, Apply only
allows us to output exactly 1 element for each input element. We could certainly define an ??
whose return type is List<Integer>, but that would give us an output of 4 elements
[10,10],[50,50],[1,1],[400,400], which is not quite what we want.
To write the query, we need a new Iterator called FlatApply. FlatApply is a generalization of
Apply. It produces 0 or more output elements for each input element. This fact makes the
implementation of FlatApply a bit more complicated than Apply. Since a single input may create
many outputs, you'll need to keep a queue of pending elements that future calls to next() will
return. When this queue becomes empty, you must fill it up by calling next() again on the input
Iterator.
What you must do
Fill in the FlatApply class. Notice that its constructor takes a FlatApplyFunction. This interface
has an apply method very similar to ApplyFunction, except it returns a List, which may have any
number of elements in it.
IMPORTANT HINT: we suggest maintaining the following invariant in your FlatApply class.
Between calls to FlatApply.next() either:
a) the queue is not empty
b) or, the queue is empty and input.hasNext() is false.
By "between calls", we mean this invariant should always be true after the constructor finishes,
except that it may be violated while executing the FlatApply.next() method.
Testing your code
Run FlatApplyTest.java. As before, try to work one at a time, starting with the simpler tests
(emptyTest and oneTest).
Try queries on real data!
TextQuery2.java contains the query "return all the words longer than 24 characters". You must
complete the query by implementing the LongerThan class (see code for details). HINT: the
String.length method will be useful.
Part 5: Filter
End result of this part:
pass all tests in these files
o FilterTest.java
run the following Queries
o TextQuery3.java
Here is a motivating query. Let's suppose we have a list of numbers [10,50,1,400]. This list will
be our input data. The query is "return the elements greater than 40". The output for this
example would be the elements 50, 400.
As you saw in TextQuery2.java, we can use FlatApply to remove elements that don't pass a
check (e.g., only return words longer than 12 characters). This removal of data based on a
checking some property is commonly known as "filtering". Filtering is so common that it is
worth implementing another Iterator called Filter.
What you need to do
Implement the Filter class. Make sure you uncomment the “implements Iterator” before you
start.
The first argument to the Filter constructor is a Predicate<T>. The Predicate interface's one
method, Predicate.check, is like ApplyFunction.apply, except it returns a boolean value. True
indicates return the element; false indicates ignore the element.
Try queries on real data!
TextQuery3.java runs the query "return all filenames that contain the word 'Mars' " (now things
are getting interesting!). Your job is to create a chain of Apply and Filter Iterators to implement
this query.
Part 6: Fold (you're almost there!)
End result of this part:
pass tests in these files
o Fold.java
run the following Queries
o TextQuery4.java
Here is a motivating query. Let's suppose we have a list of numbers [10,50,1,400]. This list will
be our input data. The query is "return the sum of the elements". The output of this query is
always just one element long; for our example list, the output is 461.
We'll define an Iterator called Fold that takes all the input and combines it to produce one
output.
Fold is a bit different from the other Iterators in the sense that it waits until it has seen all of
the input before it returns an output. Also, a Fold Iterator will return exactly one element
before hasNext starts returning false.
Fold depends on two functions defined by the FoldFunction interface.
combine: takes the accumulated value (soFar) and a new input (x) and returns a new
accumulated value.
initialValue: returns the initial accumulated value.
So for example, in the case of our sum example above, we would define a class that implements
FoldFunction, where the method initialValue returns 0 and method combine returns
soFar+x.
NOTE: the type of soFar and x could be different. FoldTest.java provides an example in the class
MaxAsString.
Try queries on real data!
TextQuery4.java contains the query "count the number of occurrences of the word 'Mars' ".
Borrow what you need from previous queries.
HINT: Fold should be the last Iterator in this query. Create an inner class that implements
FoldFunction, defined to count inputs.
Part 7: Processing spreadsheets
End result of this part:
? run the following Queries
o FlightsQuery.java
We have provided a comma separated values (CSV) file (flights-tiny.csv2
) that contains
spreadsheet data. Each line in the file represents one airline flight. A line contains several fields,
which are integer and string values separated by commas. Each field is like a column in a
spreadsheet.
The names of the fields in the csv are:
year month day of month airline flight
number
origin city dest city cancelled time
In FlightsQueries.java, you'll find a partially written query, where the Iterator called records
returns elements which are FlightRecords, a simple object with the fields above.
Your job is to finish the query so that it computes "the number of flights that occurred in the
year 2015".
Optionally run on more data
If you get the right answer on flights-tiny.csv, you can try running the query on flights-small.csv,
Download flights-small.csv from the “Homework 5” Assignment on ICON. Put the file in the
same directory where flights-tiny.csv is. Make sure to comment out the correct line in
FlightsQuery.java:
//Iterator<String> lines = new LineFileReader("flights-small.csv"); // expects answer: 520718
Iterator<String> lines = new LineFileReader("flights-tiny.csv"); // expects answer: 5
Helpful tips
You will need to define classes that implement the ApplyFunction, FlatApplyFunction,
Predicate, or FoldFunction interfaces in the queries. Make them inner classes within the
same Java file and make them "static". See TextQuery1a.java for an example.
Various problems related to Iterators
o hasNext() crashing when called multiple times
o calling hasNext() multiple times causes the Iterator to skip elements
o next() crashing or returning bad values even when hasNext() returned true
o hasNext() returns false too early / never returns false
Try to identify an invariant for the Iterator you are working on. Then make sure that
invariant holds before and after calls to hasNext() and next().
2
from Bureau of Transportation statistics
https://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&DB_Short_Name=On-Time
? Test xyz isn't passing! As in HW3, do some investigation to narrow down the source of
the problem. Which line(s) of code are doing something different than you expected?
Then, if you are still stuck, bring your question (and 1-3 hypotheses about what might be
wrong) to a peer, Piazza discussion board, or a staff member in person.
Extra credit (up to 2 points)
You may only attempt this section, if you've finished the rest of the assignment.
These queries are challenging but should not require new Iterator classes.
Turn in an additional file named ExtraCreditFlightsQuery2.java where the main method
runs the following query.
o Return the percentage (as a decimal number < 1) of flights between Los Angeles
CA and New York NY that were cancelled.
Turn in an additional file named ExtraCreditFlightsQuery3.java where the main method
runs the following query.
o Return the airline, origin city, dest city, and time (as a String[] with 3 elements)
for the flight with the highest time.
Extra credit (up to 2 points)
You may only attempt this section, if you've finished the rest of the assignment.
Write an interesting query on 1 of the 2 CSV files from Application Activity 2: Lists. To read the
CSV file, you can either use FlightsQuery as an example or you can use the CSV library used in
the Application Activity. If you use the CSV library, you should write a new reader and put it in
the readers/ folder. Put the query, itself, in the queries/ folder and call it CensusQuery.java. You
must also write the query in English as a comment in your java file.
What is an interesting query?
The query should have a small answer that a human could understand. That means that
it will require at least one Filter, FlatApply, or Fold iterator, not just Apply.