RQ4留学生讲解、辅导Java，Python编程设计、讲解c++，Java 讲解留学生Processing|讲解留学生Processing

2019.10.25 - 首页 >> 其他

ADM / 2019 / Homework_2 / README.MD
tlancian [upd] RQ4 - Clarification
a9dc986 4 days ago
4 contributors
Branch: master Find file Copy path
130 lines (73 sloc) 11.3 KB
Homework 2 - Soccer analytics
Soccer analytics is attracting an increasing interest of academia
and industry, thanks to the availability of sensing technologies
that provide high-fidelity data streams extracted from every
match.
The goal of this assignment is to perform an analysis on the
largest open collection of soccer-logs ever released, collected
by [Wyscout] (https://wyscout.com/) containing all the spatiotemporal
events (passes, shots, fouls, etc.) that occur during all
matches of the entire season 2017-2018 of seven competitions
(La Liga, Serie A, Bundesliga, Premier League, Ligue 1, FIFA
World Cup 2018, UEFA Euro Cup 2016). A match event contains
information about its position, time, outcome, player and
characteristics.
In particular, we are curious to answer to some specific
research questions (RQs) that may help us discover and
interpret meaningful patterns in data.
Raw Blame History
Before starting
Among all numerous things and good practises a data scientist
needs to do before running any analysis, there is one the is of
uttermost importance: get data and understand it!
Here you find the list of tasks you need to perform before
digging into the rich world of soccer.
Get your data! Go to this website and download the files
related to Coaches, Players, Events, Teams and Matches.
Throughout the analysis we focus only in club teams
information. So, there is no need to download/use the
files relative to the European Cup and the World Cup.
Understand your data. Read the legend of each column to
understand what it refers to. Additional information about
the labels can be found here: Coaches, Players, Events,
Teams, Matches. Please, be sure that you've understood
the data before start coding.
Handling data. The data are provided in multiple .json
files, with some of the columns present in more than one
file. For this reason, in order to answer the RQs, we kindly
suggest you to import the .json files as pandas
DataFrame object and then, based on what you want to
analyze, perform joins among the DataFrames. Here you
can find a quick useful guide. Remember, Google is your
best friend!
VERY VERY IMPORTANT
]. !!! Read the entire homework before coding anything!!!
^. My solution it's not better than yours and yours is not
better than mine. In any data analysis task, there is not a
unique way to answer to RQs. For this reason it is crucial
(necessary and mandatory) that you describe any single
decision you take and all the steps you do.
_. Once performed any exercise, comments about the
obtained results are mandatory. We are not always explicit
where to focus your comments, but we will always want
some brief sentences about your discoveries.
Research questions
Exploratory Data Analysis
General Setup: All the analysis requested from RQ1 to RQ5,
must be performed only over the Premier League dataset.
]. [RQ1] Who wants to be a Champion? During a season could
happen that a team has bad periods. For example, more
than three consecutive games lost, or it could have a
positive trend where it seems to be unbeatable. Let's
visualize this trends!
Create a plot where each point (x,y) represents the number
of points obtained by team x at game week y. In order to
show the trends, points related to the same team must be
connected to each other. Remind: in soccer each team gets
3 points for a win, 1 point for a tied game, and 0 for a loss.
Highlight the two teams that got the longest winning streak
(# of consecutive wins), and the two teams that got the
longest losing streak (# of consecutive losses).
Below you can see a similar example of what we would like
you to show us. Keep in mind that you must create this plot
for all the entire season (38 game weeks).
^. [RQ2] Is there a home-field advantage? It is generally
believed that there is an underlying home field advantage in
sport, i.e. an highest probability of winning of the home
team. Let's check for this, and see whether the outcome of
the game (win, draw, lose) is correlated to the playing side
(home or away). For 5 different teams of Premier League,
show the contingency table (outcome x side). Therefore,
perform an "overall" Chi-squared test in the following way:
build a unique contingency table, that contains all the
matches in which only one of the 5 teams previously
selected is involved, to see whether there is home field
advantage. State clearly the tested hypothesis and whether
it is accepted or rejected.
_. [RQ3] Which teams have the youngest coaches? Rank all
the teams by the age of their coach and show the 10 teams
with the youngest coaches. Remember that during a
season a team could have more coaches, in that case pick
the younger of them. Additionally, show the distirbutions of
the ages of all coaches in Premier League, using a boxplot.
(Hint: There's an attribute birthDate).
f. [RQ4] Find the top 10 players with the highest ratio
between completed passes and attempted passes. For this
task, consider all the different types of passes, and as
specified in the website, a completed pass has tag 1801
(accurate event).
In order to avoid meaningless results (e.g. players who
played few minutes, and completed 2 passes over 2,
achieving 100% ratio), select an arbitrary threshold of
minimum attempted passes, in order to consider only the
subset of players that played enough. Justify the choices
you make.
i. [RQ5] Does being a tall player mean winning more air
duels? Soccer is a physical game, and it happens often in a
match that players are involved in air duels (i.e. when two
players are contending for the ball while it is not on the
ground). Make a plot that shows the dependency between
height of the player and the ratio of air duels won with air
duels attempted. The visualization should be a scatterplot,
where each point (x,y) represent a player whose height is
equal to x, and that has a ratio of winning air duels equal to
y. Furthermore, color any point according an arbitrary
selection of categories of height (e.g. yellow: 160-165cm,
orange: 165-170cm, etc.)
Remember that the "Air Duel" is a subevent of the event
"Duel" and that an air duel is said to be won if it has the tag
"1801". Same as in RQ4, choose a threshold of minimum air
duels attempted, in order filter your data, get reliable
results, and justify your choice.
j. [RQ6] Free your mind! Go further with the EDA (Exploratory
Data Analysis) showing a new interesting result about the
dataset that you found.
Core Research Questions
[CRQ1] What are the time slots of the match with more
goals? Let's analyse and visualise the goals distribution into
9-minutes sets for all the matches. I.e., let's transform the
minute of a goal from a continuous variable in a discrete
variable (e.g. A goal scored in 5th minute, will end up in the
interval [0-9)). Remind that every match goes usually from
minute 0, to minute 90, but in football it is always added an
arbitary amount of extra-time to every half of the match,
thus consider also the intervals "45+" and "90+".
i. Make a barplot with the absolute frequency of goals in
all the time slots.
ii. Find the top 10 teams that score the most in the
interval "81-90".
iii. Show if there are players that were able to score at
least one goal in 8 different intervals.
[CRQ2] Visualize movements and passes on the pitch! Here
we try to focus our attention on the zones that a player
covers during a match. For each event, we have a pair of
coordinates, that are respectively the starting and ending
point of that event. It can be helpful to follow this link.
Knowing all the different positions where events happen, let us
be able to create different types of visualizations:
]. Considering only the match Barcelona - Real Madrid
played on the 6 May 2018:
visualize with a heatmap the zones where Cristiano
Ronaldo was more active. The events to be considered
are: passes, shoots, duels, free kicks.
compare his map with the one of Lionel Messi.
Comment the results and point out the main
differences (we are not looking for deep and technique
analysis, just show us if there are some clear
differences between the 2 plots).
Here's an example of heatmap where are shown all the starting
positions of the goals of Arsenal during the entire season.
^. Considering only the match Juventus - Napoli played on
the 22 April 2018:
visualize with arrows the starting point and ending
point of each pass done during the match by Jorginho
and Miralem Pjanic. Is there a huge difference
between the map with all the passes done and the one
with only accurate passes? Comment the results and
point out the main differences.
Here there's an example of a map with arrows.
Theoretical Question
You are given the recursive function splitSwap, which accepts
an array a, an index i, and a length n.
function splitSwap(a, l, n):
if n <= 1:
return
splitSwap(a, l, n/2)
splitSwap(a, l+ n /2, n/2)
swapList(a, l, n)
The subroutine swapList is described here:
function swapList(a, l, n):
for i = 1 to n/2:
tmp = a[l + i]
a[l + i] = a[l + n/2 + i]
a[l + n/2 + i] = tmp
]. How much running time does it take to execute splitSwap(a,
0, n)? (We want a Big O analysis.)
^. What does this algorithm do? Is it optimal? Describe the
mechanism of the algorithm in details, we do not want to
know only its final result.
Bonus
]. Repeat the entire analysis for other leagues (La Liga, Serie
A, Bundesliga and Ligue 1), aggregating the results and
highlighting the differences you find among the leagues.
^. Make nice visualization using libraries like Bokeh and
Seaborn.