辅导Level 4、讲解java/Python程序语言、讲解MongoDB、辅导Java/Python设计
- 首页 >> Python编程Course Work – Level 4
Event Detection
Total Marks – 100 marks & Weightage – 20%
Course work Deadline – Monday 19th November 2018 4:30PM
INTRODUCTION
The objective of this course work is todevelopa Twitter crawler tocollect
data (English only). Further tothis, perform geo-tagging and conduct
basic data analytics.Werecommend students to use Python or Java
programming languages and also MongoDB for datastorage. It is very
important that studentsprovideworking versionof the software, as we
need to run them. Use Twitter API for accessingdata.
Students submittheir code and report on or before the specified
deadline. In addition,students provide a sample of data set.Submissionis
throughthe Moodle pagefor theWeb Science course.
For more details, you need to attend the lecture on Monday, 24th
September 2018(Tuesday 25th forSingapore) & also 1st and 12th October
2018.
The coursework will be marked out of 100.Coursework will have 20%
weight of the final marks.As the usual practiceacross the school,
numerical markswill beappropriately convertedinto bands.Final
writtenexam will have 80% weightage, which will be in April/May2019.
Collectdata for 1 hourof any day.In addition,collect geo-coded data for
Glasgow /Singapore for the sametime period.(ps. Singapore students
should collect geo-tagged data for Singapore, whereas Glasgow students
focus on Glasgow specific go-tagged data)
Specific tasks to do
1. Develop a crawler toaccess as much Twitter data as possible(Total 25
marks)
a. Use streaming API (gardenhose api) for collecting 1%data (5
marks)
b. Enhance the crawlingusing Streaming & REST API (10 marks)
i. For example topic based or user based streaming (provide
justification for why you chosecertainwords or user to
follow)
ii. Keyword based and/or user based REST probes
c. Grabas much geo-tagged datafor Glasgow/Singapore for the same
period (5 marks) – It turned out that you cannot run this in
parallel as above; so just use this sequentially (for the subsequent
1 hour) – Depending on the way you access the Twitter API, you
may getdata for the same period as well – Forexample, Streaming
2
API data provide only real-timedata, however, if you are using
REST API then you will get past data aswell
d. Discuss your data access strategies and how did you address
Twitterdata access restrictions (5 marks)
Clearlyspecifythe Twitter API specific restrictions you
encountered and how you addressed theserestrictions for
collecting as much Twitter dataas possible;
2. Develop basic data analytics(Total 20 marks)
a. Count the amount of data collected.Specify amount of geo-tagged
data in this data set.(5 marks) You consider all the data you
collected for counting this.
b. Count the amount of geo-tagged data from Glasgow / Singapore.
Measureif there is anyoverlapwith 1%data. (5 marks) –
c. Count redundant data presentin the collection (you may end up
collecting the same tweets again through various APIs)(5 marks);
Redundancy can be counted usingsame tweet id
d. Count the re-tweets and quotes (5 marks)
3. Enhance the geo-tagged data (Total marks 30)
The idea is to enhance geo-location information of tweets. Less than 6% tweets
have geo-location information.You should group similar tweets,and thenassign
a tweet location to each memberof the group.Any grouping method can beused,
however, LSH based approach is a good option.More grouping techniqueswill be
discussed on 12th October 2018 Event detection lecture.
Locality sensitive hashing (LSH) can be used for grouping similartweets.LSH is
an algorithm for grouping similar documents into a single bucket. LSH isa data
independent hash method. It is easier to find an item corresponds to anearest
neighbor using traditional approaches like linear search but imagine ifthe
database is bigand if the itemis complicated,it can lead to more costand
computational time as well. LSHcan be used for clustering, nearestneighbor
search,detecting near or exact duplicates. SH is implemented inpython using
LSHash given byKayzhu (https://github.com/kayzhu/LSHash.) – anygrouping
techniques can be used.
Next assign geo-location for each tweetin eachof the groups. We willdiscussa
strategy on 1st October2018.
a. Grouping of tweets (10 marks). Provide statistics like how manygroups
and number of tweets per group.This can be given as a histogram or
graph (for example, groups in x-axis and y-axissize ofthe group.Enrich
this information with another graph showing number of geo-tagged
tweets and alsoprofilebased geo-information.
b. Geo-locationassignment (10 marks). Explain your method and justify it.
Enhancethe above graphinformation providing the number of additional
tweets with geo-information. Also comment on the numberoftweets
with nogeo-information(you wont be able to assign geo-information if
none ofthe tweets in agiven group(created above) have geo-information.
3
You should say how manygroups fall in this category.
c. Conduct an evaluation of themethod (10 marks). A method will be
discussed on Monday 1st October0218. One option would be take 50%of
geo-tagged tweets and assume that they have no geo-information.Assign
a geo-information to each of these using your above method. Measure the
differences between assigned location and the actual location. Plotthem!
4. OpenEnded Problem (Total 25marks)
a. Develop a crawler for any ofthe following social media sites
(facebook, Instagram, Google Plus, Tumblr, Flickr, any other);
b. Conduct data analysis similar to Twitter
c. We are not giving any more specific directions, as we want
students to be creative.
d. Analysis similar to Steps 1 and 2 expected.
Report structure & MarkDistribution
Report should be organised the following way & Mark distribution
1. Section 1:Introduction
a. Describe the software developed withappropriate details; if
you have used code fromelsewhere please specify it
b. Specify the time andduration of data collected;
2. Section 2: Data crawl
a. Use streaming API for collecting 1% data (5 marks)
i. Specify the APIs used
1. Please do not include entirecode here; justmain
description of the function
2. Along with ashort description/justification
b. Enhance the crawlingusing Streamingand REST API (10
marks)
i. Specify the APIs used
1. Please do not include entirecode here; justmain
description of the function
2. Along with ashort description/justification
c. Grabas muchgeo-tagged datafor Glasgow/Singapore for the
same period (5 marks)
i. Specify the APIs used
1. Please do not include entirecode here;
d. Discuss yourdata access strategies and how did you address
Twitterdata access restrictions (5 marks)
i. Discuss how creativeyou arein collecting as muchdata
3. Basic data analytics(Total 20 marks)
a. Count the amount of data collected (5 marks)
i. Showa histogram of 10 minutes periods (x –axis
duration of 10 minutes – y-axiscount)
4
b. Count the amount of geo-tagged data from Glasgow / Singapore
(5 marks)
i. Showa histogram of 10 minutes periods (x –axis
duration of 10 minutes – y-axiscount)
c. Count redundant datapresentin the collection (you may end
up collecting the same tweets again through various PIs) (5
marks)
i. Showa histogram of 10 minutes periods (x –axis
duration of 10 minutes – y-axiscount) for bothcollected
data and redundant data for same period – Redundancy
can be measuredbased on the Tweet ID. (In cases where
you useAPIs sequentially,you just assume all the data
as one repository).
d. Count the re-tweets and quotes (5 marks)
i. Showa histogram of 10 minutes periods (x –axis
duration of 10 minutes – y-axis count) for both collected
data,re-tweets,quotes
4. Enhance the geo-tagged data (Totalmarks30)
a. Grouping of tweets (10 marks).Provide statistics like how
many groups andnumber of tweets per group. This can be
given as a histogram orgraph (for example, groups in x-axis
and y-axis size of the group.Enrich this information with
anothergraph showing number ofgeo-tagged tweets and also
profilebased geo-information.
b. Geo-locationassignment (10 marks). Explain your method and
justifyit. Enhance theabove graph information providing the
number of additional tweets with geo-information. Also
commenton the number of tweetswith nogeo-information
(you wont be able to assign geo-information if none of the
tweets in grouphave geo-information.
c. Conduct an evaluation of themethod (10 marks). A method will
be discussed onMonday 1st October 0218.One option would be
take 50%of geo-tagged tweets and assumethat they have no
geo-information.Assign a geo-information to each of these
using your above method.Measurethe differencesbetween
assigned location and the actual location.Plot them!
5. Crawling data from another social media
a. Develop a crawler for any ofthe following social media sites
(facebook,Instagram,Google Plus, Tumblr, Flickr, any other);
i. Describe access restrictionsand theapproach
developed
b. Conduct dataanalysis similar to Twitter data analysis
i. Whatlevel of comparative analysis given
c. We are not giving any more specific directions, as we want
students to be creative.
5
What tosubmit –
non conformity to the submission instructions will leadto reduction inmarks.
1) Report as a pdf file.(Pleasesubmit this jusfor thereport link)
2) A zip file containing(it should be less than 100MB)Please submit thisin
a separate link)
a. Software (runnable version,readme info,and also properly
commented). It is important that software is runnable with
minimumeffortfor themarkers
b. Data– provide a sample datafor about 5 minutes. You can decide
the format(likeJSON,orplain data file).Importantly your
software shouldbe ableto run on this sample data,without, much
hassle.
c. Makesure, together a & b isless than 100 MB as Moodle doesn’t
allow to uploadfiles over 100 MB
Where to submit
1) Through Moodle link (given)