代写Towards Personal Twitter Ad Analysis调试Java编程
- 首页 >> WebX or Twitter?
In this exercise, we will use Twitter for what is now known as X.
Towards Personal Twitter Ad Analysis
Imagine you are an employee of the "personal analytics" company, Scryer.
Scryer's mission statement is "to give people the tools to understand their data". As part of that mission, Scryer has developed various applications to help people understand their social media presences. Until now, Scryer has mostly focused on Facebook. But your team was formed to tackle Twitter!
The first goal of your team is to help people understand why Twitter shows them certain advertisements (ads). We are starting from a download of a user's personal Twitter data, as described here:
https://help.x.com/en/managing-your-account/how-to-download-your-x-archive
If you have a Twitter account, feel free to download yours and use that one, too. You have been tasked with exploring the relevant data (given in JSON format), migrating it to a SQL database, and developing key queries to drive a dashboard.
The Data
You have been given one obviously relevant file in a volunteered bundle from a user: ad-impressions.js.
From the associated readme.txt we have the following descriptions:
ad-impressions.js
- ad: Promoted Tweets the account has viewed and any associated metadata.
- deviceInfo: Information about the device where the impression was viewed such as its ID and operating system.
- displayLocation: Location where the ad was viewed on Twitter.
- promotedTweetInfo: Information about the associated tweet such as unique identifier, text, URLs and media when applicable.
- advertiserInfo: Advertiser name and screen name.
- matchedTargetingCriteria: Targeting criteria that were used to run the campaign.
- impressionTime: Date and time when the ad was viewed.
This file is technically a Javascript. file, but it is essentially a big blob of JSON data assigned to a variable.
Some preliminary analysis suggests that, while there is a pretty decent structure, there are also some irregularities, missing values, and general strangeness.
The Database
One of your colleagues has created a SQL database schema which captures what they understand of the structure of the two files. They made a "mostly normalised" schema with some integrity constraints. You have a copy of the generated database in the SQLite database `twitterads.db` (the raw SQL is available in the file `twitterads.sql`).
Task 1: Draw a conceptual model of the impressions data
To familiarise yourself with the data, draw an ER Diagram of the data in ad-impressions.js and double check your understanding of his data and its shape. Also, familiarise yourself with the shape of the given database.
Task 2: Populate twitterads.db
Next, write a Python script. that will do an Extract-Transform-Load (ETL) operation from the Javascript/JSON files into the twitterads.db. In principle, we should be able to round trip date from JSON to JSON through the twitterads.db: while we don't require you here to write a script. that transform. your twitterads.db back into a JSON file, this should be possible, and it should result in basically the same file as you started with.
Task 3: Craft some queries
Once we have data in our databases, we need to see what analytical understanding we can give our users! You need to write SQLLite queries for each of the following information requests: you want to know
1. how many ads have been sent to the user, i.e., the number of all ads from the current dataset
2. the number of di\erent advertisers that have targeted the user's twitter account
3. how the advertisers target the user; i.e., you want to see the top ten targeting types and the number of ads of that type
4. when the advertisers target the user; that is, for each time of the day in hours, the number of ads they have been sent
5. how exactly the advertisers target the user; that is, you want to see the top ten advertisers and, for each of these ten advertisers, their top ten combinations of targeting type and targeting value. Warning: this one is really hard - and not worth very much. Only attempt it out of interest!
Consider the stubs for each of the queries carefully: they contain more relevant information, including information about the required output. All queries are read-only, i.e., they should not change the content of the database, and each of them should consist of a single SQL statement.
Important:
• You can draw the ER diagram by hand on a piece of paper. In fact, you should do that.
• Use the associated stub load_ad_json.py for Task 2 and Python 3.
• For Task 3, you can use the SQLLite DB
browser https://sqlitebrowser.org/ for testing your queries, and the associated stubs for each query (and of course do not change filenames or edit parts of the stub where it says "do not edit").
• Do not import any libraries other than those already imported.
• Do not modify the given database schema.
• Look at the data carefully while working on Task 1, and test your assumptions carefully: we have given you an already populated DB test_handloaded_ads.db in case you want to start with Task 3 or you find Task 2 too daunting (though you will need to do more Python coding in other coursework tasks), but this DB may not reflect fully the Twitter data.
• Ask in the lab or on the discussion board if you have any further questions.
Marking:
Task 2 is worth 9 marks, Task 3 is worth 6, and each query is worth between 1 and 1.5 marks (queries 3 and 4 are worth 1.5 mark, queries 1, 2, and 5 are worth 1 mark). We autograde your answer: a Python script. that does not run is worth 0 marks; the more tables it populates correctly the more marks you get.
In the instructions below, we ask for a zipped directory: this means you should use zip for this, and not 7zip or rar or other compression mechanisms.
If you ignore these points, you risk scoring 0 marks.
Submissions
You will hand in up to six files in a zipped directory named by the following pattern yourcentralusername_CW1.zip. Your directory contains 1 file load_ad_json.py and up to 5 files qj.sql where j is the number of the attendant query, i.e., between 1 and 5. You should not include the database empty or especially not populated, nor should you include the JSON file (we already have those).