代写Assignment 2: Apache Spark Programming

2023.05.30 - 首页 >> Database

Assignment 2: Apache Spark Programming and Optimisation
Group Assignment (15%) 10/05/2023
Introduction
This is the second part of the practical assignment of DATA3404 in which you have to translate some
SQL queries, respectively data analysis tasks into Apache Spark Python programs, and optimise
them in Databricks.
You find links to online documentation, data, and hints on tools and schema needed for this
assignment in the ’Assignment’ section in Canvas modules.
Data Set Description
This assignment is based on the same Aviation On-time scenario than Assignment 1 which includes
data about airports, airlines, aircrafts, and flights. We will provide a slightly more cleaned data set
this time, but besides the schema is the same (primary keys are underlined):
airport_code
airport_name
city
state
country
Airports
tail_number
manufacturer
model
aircraft_type
year
Aircrafts
carrier_code
name
country
Airlinesflight_id
carrier_code
flight_number
flight_date
origin
destination
tail_number
scheduled_departure_time
scheduled_arrival_time
actual_departure_time
actual_arrival_time
distance
Flights
You will be provided with scaffold Jupyter notebook files for setting up your Databricks workspace
similar to the Assignment 1. Note that this scaffold will download and prepare new files which have
been pre-processed to have less data anomalies. As such you can apply a more stringent schema.
Downloading Datasets
As part of the tutorial in Week 11, we have provided an Assignment 2 Bootstrap notebook to
handle downloading data into DBFS. Note that this dataset has been cleaned so it is different from
Assignment 1.
1
Questions
There are two group tasks to be solved:
In the first task, you are provided with some SQL code that is similar to an Assignment 1
question. You are asked to rewrite that as Python Spark code and then optimize it.
In the second task, you are provided with a question which you must answer with some
Python Spark Code that you then also must optimize.
Task 1: SQL Code conversion
Rewrite the following SQL query as Python Spark code, using its Python Dataframe API. Then
optimise your code using Spark’s explain() functionality.
1. Determine the names of the top-5 US airlines with the most Boeing aircrafts.
SELECT A . name , COUNT(B . ta i l number )
FROM A i r l i n e s A JOIN F l i g h t s F USING ( c a r r i e r c o d e )
JOIN A i r c r a f t s B USING ( ta i l number )
WHERE A . country = ’USA ’ AND B . manufacturer = ’ Boeing ’
GROUP BY A . name
ORDER BY COUNT(B . ta i l number ) DESC
LIMIT 5
Conversion: rewrite as Apache Spark Python code (pySpark)
Optimisation: Optimise this query. Compare the optimisations on small/medium/large
datasets, both in terms of the query plan, and in terms of runtime (you can use the sparkmea-
sure package for this, cf. Tutorial Week 11).
Task 2: PySpark Code to determine ”Airports with largest total delay, and their
top-5 delayed airlines”
In this second task, write an Apache Spark program that determines for a given year and country
(both user-specified) the top-10 airports with the largest total annual departure delay (in minutes)
and lists them in descending order of the total delay. For each of those airports, list further the
five airlines contributing most to this overall departure delay in the given year, together with the
percentage of how much they contributed to this delay. List those airlines as comma-separated
string that is enclosed in ’[’ and ’]’ (indicating a list), each airline listed with both airline name and
the percentage of their delay contribution in descending order of the percentage.
The output should have the following tab-delimited (\t) format as follows:
airport name \t total departure delay \t [(airline1, percentage), (airline2, percentage), ...,
(airline5, percentage)]
Deliverable: write as an Apache Spark Python code (pySpark)
Optimisation: Optimise your program. Compare the optimisations on small/medium/large datasets
both in terms of the query plan, and in terms of runtime performance (e.g. using StageMetrics from
the sparkmeasure package).
2
Optimisation Recommendations and Report contents
You may use the following approaches as guidelines for your analysis and report:
Analyse the operation of your queries using the techniques covered for SQL and Spark
Identify something that can be improved, and change the query to support that and thus to
be more efficient
Ensure the output is unchanged – and that you anticipate it would be unchanged for any valid
inputs, i.e. do not rely on the specific data you have available for testing.
If you are unable to identify an improvement, you should write something that is equivalent
in output but involves a different execution.
Observe an improvement both from the SQL execution plan, the Spark DAG, and the Spark-
Measure analysis.
You may wish to use the larger flights datasets to check this, particularly the large dataset –
this will make it clear whether there is a speed improvement. Note that these versions have
not been cleaned as the flights small cleaned dataset has. To achieve similar results, dropna
should be applied to the dataframe when it is loaded, the column names should be corrected,
and inner joins should be used when connecting with airline/airport/aircraft data.
In the event that any bugs in the provided queries are discovered, they should be treated as
working – that is, your modification should retain any bugged behaviour rather than try to
correct it to the stated question.
The cache clearing in the Spark code should be retained – do not use caching that requires
the queries to be run multiple times to achieve efficiency. The efficiency improvement should
come from the query being run for the first time. Caching within the query itself is fine.
Note that each task should be approached by the group as a whole. Each task answer should have
the authors who contributed to that version indicated in a comment.
Note also that there will be quite a lot that you might identify and change – you do not need
to optimise these queries to be perfectly efficient! All you need to do is to identify one aspect of
inefficiency in each question and correct it. However, there are a couple of things that are the same
for both tasks, particularly with Spark – you will need to use a different approach for the two rather
than cutting and pasting an improvement that applies to both tasks for your improvement to count.
Deliverables and Submission Details
There are three deliverables per group:
1. a brief report/documentation outlining your outputs, optimisations and observations; and a
2. source code - Jupyter notebook as a single .DBC archive or notebook source file that answers
the given two tasks.
3. A demo in week 12/13 where you will be quizzed on the contribution of each member.
Here are the specifics associated with each deliverable item.
3
Report
Filename recommendation: data3404 y23s1 assignment2 tutgroupname assignmentgroupnum.pdf
Your group name, including tutorial code and group number
The answers (output) you receive for each question when executing against all three (small-
/medium/large) datasets.
A short explanation of ’what is going on’ in the general sense in the Spark code of your
solutions for each Task.
A group contribution statement with the following headings:
– Your group members’ names and SIDs
– Whether they contributed meaningfully to tasks and the report (yes or no)
This does not have a strict page limit, but you should keep it relevant to ensure feedback can
be useful. In particular:
– Do not include large code dumps. You are already submitting all of your code. Use 1
or 2 line snippets if you absolutely must. Remember that including your code is not
explaining your code.
– Do not keep writing more in the hope that some of what you include will be correct. You
are more likely to run into issues including incorrect information than you are to gain
by including correct information.
Jupyter notebook or DBC Archive
Filename recommendation:data3404 y23s1 assignment2 tutgroupname assignmentgroupnum.ipynb
A single Jupyter notebook file (or .DBC archive) that contains all of your completed tasks. This file
must be able to be run attached to a Databricks Community cluster that has had the Assignment
Bootstrap notebook run on it, and no other configurations made.
Due Date: All deliverables are due in Week 13, no later than Sunday 28th May. Late submission
penalty: -5% of the marks per day late. The marking rubric is in Canvas.
Students must retain electronic copies of their submitted assignment files and databases, as the
unit coordinator may request to inspect these files before marking of an assignment is completed. If
these assignment files are not made available to the unit coordinator when requested, the marking
of this assignment may not proceed.
All the best!
Group member participation
This is a group assignment. The mark awarded for your assignment is conditional on you being
able to explain any of your answers to your tutor or the lecturers if asked.
If your group is experiencing difficulties with the content, you should ask on Ed (use a private
post if you need to discuss code or report writing directly).
Level of contribution Proportion of final grade received
No participation 0%
Minor contributor, but at least some understanding of group submission 50%
Major contributor to the group’s submission. 100%
4