辅导CE634、讲解csv/C/C++编程、Java/Python讲解、辅导csv/C/C++设计
- 首页 >> 其他CE634 Assignment 1&2
Preprocessing and Exploratory Data Analysis of Large-Scale Taxi GPS Traces
1. Introduction
In all the three assignments of this subject, you will be dealing with a large-scale taxi GPS dataset. The
dataset records millions of taxi trips in Manhattan, New York in a given year. This dataset has been used
extensively to study the dynamics of the urban taxi flow. For example, it has been used by a group of
researchers at MIT to evaluate the ride sharing potential of the city (Santi et al., 2014) or to estimate the
minimum taxi fleet that is able to serve all the travel demand in the city (Vazifeh et al., 2018).
In this assignment, you will be asked to preprocess the dataset, play with it, and derive meaningful statistics
through exploratory data analysis. To start, you are provided with the following two files:
taxi_id.csv.bz2
intersections.csv
The first compressed file (taxi_id.csv.bz2) records the origin and destination of the taxi trips along with the
timestamps. For simplicity, the origin and destination of the actual trips have been matched to the nearest
road intersections. The format of this file is as follows:
taxi_id, pick_up_time, drop_off_time, pick_up_intersection, drop_off_intersection
The taxi_id is a numerical value that uniquely identifies each taxi. pick_up_time and drop_off_time are
expressed in Unix epoch time, and pick_up_intersection, drop_off_intersection are the indices of the
intersections (numbers from 1 to 4091).
The second file (intersection.csv) represents the street intersections to which pick-up and drop-off points
were snapped to. The format of the file is:
id, latitude, longitude
where id is a progressive identifier from 1 to 4091 and latitude and longitude are the GPS coordinates of
the intersection. Below are two screenshots of these road intersections:
2
2. Tasks
In this section, you will be asked to analyze the dataset ? using any software or programming language that
you prefer ? and then provide answers to the following research questions:
(1) How many unique taxis are there in this dataset, and how many trips are recorded?
(2) What is the distribution of the number of trips per taxi? Who are the top performers?
(3) How does the daily trip count (i.e., number of trips per day) change throughout the year? Any rhythm
or seasonality?
(4) What is the distribution of the number of departure trips at different locations (i.e., intersections)? What
about the distribution of arrival trips? What will you conclude from these two distributions?
(5) How does the number of trips change over time in a day? (You will be given three dates randomly
selected from the dataset, and then plot the hourly variation of trips from the perspective of local time).
(6) What is the probability distribution of the trip distance (measured as straight-line distance)? How about
travel time (i.e., trip duration)? What will you conclude from these two distributions?
For question (2) – (6), you are required to provide figures along with your answers. Note that some of the
above questions are open ended, and the answers could vary among students.
3
3. What to submit
A word document or pdf file with answers to (1) – (6)
The computer code used in this assignment. If particular software is used, please elaborate the
procedures on how it helps derive the answers.
The submission due date is November 9th, 2018.
4. Access to the dataset
The dataset used in this assignment can be download via the following link:
https://polyuitmy.sharepoint.com/:f:/g/personal/yangxu_polyu_edu_hk/EtCz10QsyxZMhY8_Z3bF9xYBmhb_2CLZK9G6QtlgO0jtg?e=YePkJ0
Please contact the subject instructor if the link is invalid.
Reference.
Santi, P., Resta, G., Szell, M., Sobolevsky, S., Strogatz, S. H., & Ratti, C. (2014). Quantifying the benefits
of vehicle pooling with shareability networks. Proceedings of the National Academy of Sciences, 111(37),
13290-13294.
Vazifeh, M. M., Santi, P., Resta, G., Strogatz, S. H., & Ratti, C. (2018). Addressing the minimum fleet
problem in on-demand urban mobility. Nature, 557(7706), 534.