辅导Level 4、讲解java/Python程序语言、讲解MongoDB、辅导Java/Python设计

2018.11.12 - 首页 >> Python编程

Course Work – Level 4

Event Detection

Total Marks – 100 marks & Weightage – 20%

Course work Deadline – Monday 19th November 2018 4:30PM

INTRODUCTION

The objective of this course work is todevelopa Twitter crawler tocollect

data (English only). Further tothis, perform geo-tagging and conduct

basic data analytics.Werecommend students to use Python or Java

programming languages and also MongoDB for datastorage. It is very

important that studentsprovideworking versionof the software, as we

need to run them. Use Twitter API for accessingdata.

Students submittheir code and report on or before the specified

deadline. In addition,students provide a sample of data set.Submissionis

throughthe Moodle pagefor theWeb Science course.

For more details, you need to attend the lecture on Monday, 24th

September 2018(Tuesday 25th forSingapore) & also 1st and 12th October

2018.

The coursework will be marked out of 100.Coursework will have 20%

weight of the final marks.As the usual practiceacross the school,

numerical markswill beappropriately convertedinto bands.Final

writtenexam will have 80% weightage, which will be in April/May2019.

Collectdata for 1 hourof any day.In addition,collect geo-coded data for

Glasgow /Singapore for the sametime period.(ps. Singapore students

should collect geo-tagged data for Singapore, whereas Glasgow students

focus on Glasgow specific go-tagged data)

Specific tasks to do

1. Develop a crawler toaccess as much Twitter data as possible(Total 25

marks)

a. Use streaming API (gardenhose api) for collecting 1%data (5

marks)

b. Enhance the crawlingusing Streaming & REST API (10 marks)

i. For example topic based or user based streaming (provide

justification for why you chosecertainwords or user to

follow)

ii. Keyword based and/or user based REST probes

c. Grabas much geo-tagged datafor Glasgow/Singapore for the same

period (5 marks) – It turned out that you cannot run this in

parallel as above; so just use this sequentially (for the subsequent

1 hour) – Depending on the way you access the Twitter API, you

may getdata for the same period as well – Forexample, Streaming

API data provide only real-timedata, however, if you are using

REST API then you will get past data aswell

d. Discuss your data access strategies and how did you address

Twitterdata access restrictions (5 marks)

Clearlyspecifythe Twitter API specific restrictions you

encountered and how you addressed theserestrictions for

collecting as much Twitter dataas possible;

2. Develop basic data analytics(Total 20 marks)

a. Count the amount of data collected.Specify amount of geo-tagged

data in this data set.(5 marks) You consider all the data you

collected for counting this.

b. Count the amount of geo-tagged data from Glasgow / Singapore.

Measureif there is anyoverlapwith 1%data. (5 marks) –

c. Count redundant data presentin the collection (you may end up

collecting the same tweets again through various APIs)(5 marks);

Redundancy can be counted usingsame tweet id

d. Count the re-tweets and quotes (5 marks)

3. Enhance the geo-tagged data (Total marks 30)

The idea is to enhance geo-location information of tweets. Less than 6% tweets

have geo-location information.You should group similar tweets,and thenassign

a tweet location to each memberof the group.Any grouping method can beused,

however, LSH based approach is a good option.More grouping techniqueswill be

discussed on 12th October 2018 Event detection lecture.

Locality sensitive hashing (LSH) can be used for grouping similartweets.LSH is

an algorithm for grouping similar documents into a single bucket. LSH isa data

independent hash method. It is easier to find an item corresponds to anearest

neighbor using traditional approaches like linear search but imagine ifthe

database is bigand if the itemis complicated,it can lead to more costand

computational time as well. LSHcan be used for clustering, nearestneighbor

search,detecting near or exact duplicates. SH is implemented inpython using

LSHash given byKayzhu (https://github.com/kayzhu/LSHash.) – anygrouping

techniques can be used.

Next assign geo-location for each tweetin eachof the groups. We willdiscussa

strategy on 1st October2018.

a. Grouping of tweets (10 marks). Provide statistics like how manygroups

and number of tweets per group.This can be given as a histogram or

graph (for example, groups in x-axis and y-axissize ofthe group.Enrich

this information with another graph showing number of geo-tagged

tweets and alsoprofilebased geo-information.

b. Geo-locationassignment (10 marks). Explain your method and justify it.

Enhancethe above graphinformation providing the number of additional

tweets with geo-information. Also comment on the numberoftweets

with nogeo-information(you wont be able to assign geo-information if

none ofthe tweets in agiven group(created above) have geo-information.

You should say how manygroups fall in this category.

c. Conduct an evaluation of themethod (10 marks). A method will be

discussed on Monday 1st October0218. One option would be take 50%of

geo-tagged tweets and assume that they have no geo-information.Assign

a geo-information to each of these using your above method. Measure the

differences between assigned location and the actual location. Plotthem!

4. OpenEnded Problem (Total 25marks)

a. Develop a crawler for any ofthe following social media sites

(facebook, Instagram, Google Plus, Tumblr, Flickr, any other);

b. Conduct data analysis similar to Twitter

c. We are not giving any more specific directions, as we want

students to be creative.

d. Analysis similar to Steps 1 and 2 expected.

Report structure & MarkDistribution

Report should be organised the following way & Mark distribution

1. Section 1:Introduction

a. Describe the software developed withappropriate details; if

you have used code fromelsewhere please specify it

b. Specify the time andduration of data collected;

2. Section 2: Data crawl

a. Use streaming API for collecting 1% data (5 marks)

i. Specify the APIs used

1. Please do not include entirecode here; justmain

description of the function

2. Along with ashort description/justification

b. Enhance the crawlingusing Streamingand REST API (10

marks)

i. Specify the APIs used

1. Please do not include entirecode here; justmain

description of the function

2. Along with ashort description/justification

c. Grabas muchgeo-tagged datafor Glasgow/Singapore for the

same period (5 marks)

i. Specify the APIs used

1. Please do not include entirecode here;

d. Discuss yourdata access strategies and how did you address

Twitterdata access restrictions (5 marks)

i. Discuss how creativeyou arein collecting as muchdata

3. Basic data analytics(Total 20 marks)

a. Count the amount of data collected (5 marks)

i. Showa histogram of 10 minutes periods (x –axis

duration of 10 minutes – y-axiscount)

b. Count the amount of geo-tagged data from Glasgow / Singapore

(5 marks)

i. Showa histogram of 10 minutes periods (x –axis

duration of 10 minutes – y-axiscount)

c. Count redundant datapresentin the collection (you may end

up collecting the same tweets again through various PIs) (5

marks)

i. Showa histogram of 10 minutes periods (x –axis

duration of 10 minutes – y-axiscount) for bothcollected

data and redundant data for same period – Redundancy

can be measuredbased on the Tweet ID. (In cases where

you useAPIs sequentially,you just assume all the data

as one repository).

d. Count the re-tweets and quotes (5 marks)

i. Showa histogram of 10 minutes periods (x –axis

duration of 10 minutes – y-axis count) for both collected

data,re-tweets,quotes

4. Enhance the geo-tagged data (Totalmarks30)

a. Grouping of tweets (10 marks).Provide statistics like how

many groups andnumber of tweets per group. This can be

given as a histogram orgraph (for example, groups in x-axis

and y-axis size of the group.Enrich this information with

anothergraph showing number ofgeo-tagged tweets and also

profilebased geo-information.

b. Geo-locationassignment (10 marks). Explain your method and

justifyit. Enhance theabove graph information providing the

number of additional tweets with geo-information. Also

commenton the number of tweetswith nogeo-information

(you wont be able to assign geo-information if none of the

tweets in grouphave geo-information.

c. Conduct an evaluation of themethod (10 marks). A method will

be discussed onMonday 1st October 0218.One option would be

take 50%of geo-tagged tweets and assumethat they have no

geo-information.Assign a geo-information to each of these

using your above method.Measurethe differencesbetween

assigned location and the actual location.Plot them!

5. Crawling data from another social media

a. Develop a crawler for any ofthe following social media sites

(facebook,Instagram,Google Plus, Tumblr, Flickr, any other);

i. Describe access restrictionsand theapproach

developed

b. Conduct dataanalysis similar to Twitter data analysis

i. Whatlevel of comparative analysis given

c. We are not giving any more specific directions, as we want

students to be creative.

What tosubmit –

non conformity to the submission instructions will leadto reduction inmarks.

1) Report as a pdf file.(Pleasesubmit this jusfor thereport link)

2) A zip file containing(it should be less than 100MB)Please submit thisin

a separate link)

a. Software (runnable version,readme info,and also properly

commented). It is important that software is runnable with

minimumeffortfor themarkers

b. Data– provide a sample datafor about 5 minutes. You can decide

the format(likeJSON,orplain data file).Importantly your

software shouldbe ableto run on this sample data,without, much

hassle.

c. Makesure, together a & b isless than 100 MB as Moodle doesn’t

allow to uploadfiles over 100 MB

Where to submit

1) Through Moodle link (given)