COMP2420辅导、SQL编程设计辅导
- 首页 >> Web COMP2420/COMP6420 - Introduction to Data Management, Analysis and Security
Lab 08 - Introduction to SQL
Learning Outcomes
L01: Demonstrate a conceptual understanding of database systems and architecture, data models and declarative query languages L02: Define, query and manipulate a relational database
L06: Apply their knowledge to a given problem domain
Preparation
Before starting this lab, we suggest you complete the following:
Watch all lectures covering data types, database models, relational model (especially relational algebra) and DML commands
The following functions may be useful for this lab:
DML Commands and clause keywords(Query part)
SELECT DISTINCT GROUP BY
COUNT, AVG, MIN, MAX, SUM HAVING ORDER BY
(We will be using a lot in this lab)
Used to retrieve data from the database.
For removing duplicate tuples in the query result.
Groups tuples for each value combination.
Total number/average/smallest/biggest/sum of argument values. Add the condition on the groups.
Sort the tuples in a query result.
Aim
Our aim in this lab is to strengthen understanding of key concepts from lecture including data types, DB models, relational model and DBMS. A SQLite database will be the practical instance used to learn to create a database, query and display entires in relation in a meaningful and informative manner using filters and creating calculated fields. We will then practise drawing inferences from this data.
In [1]:
Question 1: Data types and database models
While our previous work in data analysis and machine learning in this course has provided us with a very loose interpretation of data structuring, we won't be able to apply the same rules for defining data going forward. As discussed in the lectures, data can be classified based on it's characteristics to aid in storage, retrevial and other operations. In the case of Databases, you must learn to first crawl before walk, and walk before run. Therefore, we will be starting with determining the classification of data, before moving onto the more code heavy questions.
Q1.1: Data types: warm-up quick classification
Like starting gym work with stretches, let's make our brain start spinning! Please review the lecture slides regarding the characteristics of data, then complete the exercise below as fast as possible. No cheating by looking back at the slides!
Please indicate which data type each example below belongs to:
Your options are:
# Code Imports
import numpy as np
import pandas as pd import sqlite3
from sqlite3 import Error
Option Identifier
Structured 1 Semi-Structured 2 Unstructured 3
# Enter Your Number in the `type` column
| Example |type
| --------------------- |------------- |
|
|
| |
| `Spreadsheets`
| `Server logs`
| `Tweets`
| `Postgre SQL`
| `titanic.csv dataset` | 2.
| `JSON documents` | 2. | `Mongodb`
| `Email`
| `Echo360 recordings` |
| `Chiefley catalogues` |
| `XML documents` | | `ANU homepage content`|
There will always be the existential question of "why?". Check with some of your fellow students and your tutor, and see if there are any inconsistencies. Make your case with your colleagues and see if you can come to a conclusion.
Q1.2: Data types: Existence is reason
Though structured data is organised and the easiest to query and analyse, the other two data types have their own use case. Come up with a scenario where a data analyst would prefer semi-structured data over structured data in his/her daily job. You are more than welcome to discuss this with your neighbours.
Q1.3: The Data Runway: Database models
Move over Gigi Hadid and Kendall Jenner, data is the new queen of the runway.
Modelling a database is one of the key ways of ensuring your DB is fit for purpose and meets your requirements. This can be performed at many stages of the development process for any software development and integration (this is discussed in COMP3120 (https://programsandcourses.anu.edu.au/course/comp3120)), and it is therefore vital to be able to visualise and model the various aspects of your data.
We are interested in 3 key points of data modelling, which we want you to discuss here. Your tasks are as follows:
Compare and Contrast the characteristics of a Conceptual , Logical and Physical data model.
Find an image or set of images online that highlight the differences between these items and discuss how these images show the unique characteristics of each model.
You are welcome to discuss this with your tutor.
By this point, you should have some basic idea of data management in the bag. Now we can move onto the underlying logic behind database systems such as SQL.
Question 2: Relational Algebra: why and how
Relational Algebra (https://www.tutorialspoint.com/dbms/relational_algebra.htm) (RA), first proposed by Edgar F. Codd (https://en.wikipedia.org/wiki/Edgar_F._Codd), is considered the basis for database query languages. In Question 2, we are going to dicuss why it is important, why we need it, and we'll practice writing relational algebra statements.
Q2.1: SQL not enough?
Before we get stuck into the depths of Relational Algebra (RA), lets discuss the high level ideas behind why we might consider relational algebra to be useful. Write down some notes for the following questions:
What is the difference between procedural and declarative languages? Which type does SQL belong to and why?
If we hand over the SQL statement directly to machine, what are some possible problems?
### Your answer here
### Your answer here
How can RA address these problem?
Discuss with your neighbours and put your ideas down below.
Q2.2: Pencil & eraser ready? Here comes RA exercise
After discussing Q2.1, you should be ready to feel the power of RA. The below exercise is aimed at increasing your confidence with writing RA statements. First consider a database with below schemas:
### Your ideas here
relation schema
Person ( name, age, gender ) Frequents ( name, pizzeria ) Eats ( name, pizza ) Serves ( pizzeria, pizza, price )
description
name is a key
(name, pizzeria) is a key (name, pizza) is a key (pizzeria, pizza) is a key
Secondly, recall the function items from the lecture slides. While we aren't going to fill in everything for you, the following symbols may be of use:
Π σ ρ ?
(You will find the meaning behind these symbols in the lecture slides)
Using these symbols and the database above, answer the following questions: Note: You may write these down instead of putting them in the notebook. Up to you!
1. Find all pizzerias frequented by at least one person under the age of 18.
2. Find the names of all females who eat either mushroom or pepperoni pizza (or both).
3. Find all pizzerias that serve at least one pizza that Amy eats for less than $10.00.
4. For each person, find all pizzas the person eats that are not served by any pizzeria the person frequents. List all such person (name) / pizza pairs.
Question 3: Welcome the SQLite
The lecture slides provide a good description of a DBMS
-Database Management System (DBMS) Consists of interrelated data and software for analysing the data
There's so much more to say about DBMS. It implements one of the database model, does the heavy job of storing 1s and 0s and provides user convinient interface to do creation, updating, quering etc. Let's put all these buzzes in a context - SQLite!
Detour: SQLite
You may be wondering how we are planning on teaching SQL concepts without actually using a "real" database. While many DBMS such as PostgreSQL (https://www.postgresql.org) exist, there is additional overhead of installing items such as this, getting you used to them and getting them to interact with a Jupyter Notebook (or Python script, etc). While other courses (such as COMP2400 (https://programsandcourses.anu.edu.au/course/comp2400)) will use PostgreSQL or other items, we are going to go with the simple option as this is an introductory course. Therefore, enter SQLite!
# Your answer here
# Your answer here
# Your answer here
# Your answer here
What is SQLite ?
SQLite is a C-language library that implements a small (https://www.sqlite.org/footprint.html), fast (https://www.sqlite.org/fasterthanfs.html), self- contained (https://www.sqlite.org/selfcontained.html), high-reliability (https://www.sqlite.org/hirely.html), full-featured (https://www.sqlite.org/fullsql.html), SQL database engine. SQLite is the most used (https://www.sqlite.org/mostdeployed.html) database engine in the world. SQLite is built into all mobile phones and most computers and comes bundled inside countless other applications that people use every day. (reference (https://www.sqlite.org/index.html))
Why SQLite?
The key difference from other DBMS: architecture
1. Normally, a DBMS such as MySQL, PostgreSQL, etc., requires a separate server process to operate. The applications that want to access the database server use TCP/IP protocol to send and receive requests. This is called client/server architecture.
image.png
2. SQLite does NOT work this way. SQLite does NOT require a server to run. SQLite database is integrated with the application that accesses the database.
The applications interact with the SQLite database read and write directly from the database files stored on disk. image.png
Other reasons we choose SQLite
1. It's so light-weight that we do NOT need to install or configure anything since we are using its python interface (if you want to access the DB through command line like other DBMS, you need to install the commandline tool)
2. PySQLite(sqlite3 below): It is a python standard lib! We do not need to install any additional libraries!
3. PySQLite(sqlite3 below): A standardized Python DBI API 2.0 compliant interface (i.e. same API for PSQL, MySQL etc.);
Okay, back to the actual work.
Creating and Playing with a Database
All the tables must reside in a database, even if it is a fake one in SQLite.
The first step is to connect to the database, below we have provided some boilerplate code (https://en.wikipedia.org/wiki/Boilerplate_code) to handle the
database connection. Do not modify it, otherwise things will break. (If the database does not exist, the connection operation will create it for us)
If you check the data directory after running the below box, you will find the database file sqliteQ3.db . Initially, we also provide you with toy tables with a few entries inside sqliteQ3.db to play with. For now, you do NOT need to know the details of defining schemas or inserting records. It is the content of the next lab.
In [2]:
# Our code here, do NOT modify
def create_connection(db_file):
""" Connect to the specified SQLite database, if not exist, create a new one; :db_file: location of db to connect to
:return: Connection object or None
"""
conn = None
try:
conn = sqlite3.connect(db_file)
print("Connection established!") except Error as e:
print(e) return conn
db_file_loc = "./data/sqliteQ3.db"
conn = create_connection(db_file_loc) # remember to close the connection when everything done
Connection established!
How do I extract information out of the database just created?
It's really simple! Since you already have a Connection, you can us pd.read_sql_query and execute the query you wish. You will have a pandas dataframe as a return. See the below example which query all records in table tasks.
In [3]:
# Our code here, do NOT modify
query = "SELECT * FROM projects;" df_projects = pd.read_sql_query(query, conn) df_projects.set_index('id', inplace=True) display(df_projects)
name begin_date end_date 1 First data analysis job in career 2020-01-01 2020-01-30
Now it is your turn.
Display all the entries in table tasks
Once you have been able to display them, try making them prettier to look at. This will involve looking at the documentation, so have fun!
In [4]:
Question 4: SQLite and ... the Premier League
Data scientists don't only play with example databases! They are good at extracting whatever they want from a real-life (or real-life-like) database. Since the aim of this course is to get you to the point of being a data scientist, let's see what you can get out of from the fragmentated Premier League match statistics.
Note: This Database is in the data directory in the premier.db file. For your convinence, the columns of relation are here:
relation schema
team name, city
player player_id, name, position, height, weight, team
game game_id, home_team, away_team, homescore, awayscore game_stats player_id, game_id, goals, assists, passes
You can use the following command to find the names of the tables in the database.
id
# Your code here
In [5]:
name 0 projects 1 tasks
sql
CREATE TABLE projects (\n ... CREATE TABLE tasks (\n ...
# do NOT modify
query = "SELECT name, sql FROM sqlite_master WHERE type='table';" df_result = pd.read_sql_query(query, conn)
display(df_result)
Q4.1: How's every table like?
By answering the below questions by quering the database, you can get a rough idea of what is in the Premier League database. In [6]:
Q4.1.1: What're the names of all the teams from London?
# Don't forget to create a connection to the database # Your Code here
In [7]:
# Your Code Here
Q4.1.2: For the tallest player in the database, what is their height?
In [8]:
Q4.1.3: What's the homescore and awayscore for the only recorded Liverpool (home) VS Chelsea (away) game? In [9]:
Q4.1.4: For all the players that their performance is recorded in game_stats, what is the average goal per game for them (list their player_id and the average)?
In [10]:
Q4.2: What if we want information from 2 (or more) tables?
As will be discussed in next week's lecture, it is a bad design to include too many fields within one table. The following question requires to connect 2 tables in the database. (That's not that hard, we have the join operation, don't we?)
Question: Who is our least strong (assuming only weight matters) player from London? List his weight only.
If you cannot complete this task by single SQL, try multiple first.
In [11]:
That is it from us! Nice work on making it this far, and have a go at the extensions if you have time.
Lab 08 - Introduction to SQL
Learning Outcomes
L01: Demonstrate a conceptual understanding of database systems and architecture, data models and declarative query languages L02: Define, query and manipulate a relational database
L06: Apply their knowledge to a given problem domain
Preparation
Before starting this lab, we suggest you complete the following:
Watch all lectures covering data types, database models, relational model (especially relational algebra) and DML commands
The following functions may be useful for this lab:
DML Commands and clause keywords(Query part)
SELECT DISTINCT GROUP BY
COUNT, AVG, MIN, MAX, SUM HAVING ORDER BY
(We will be using a lot in this lab)
Used to retrieve data from the database.
For removing duplicate tuples in the query result.
Groups tuples for each value combination.
Total number/average/smallest/biggest/sum of argument values. Add the condition on the groups.
Sort the tuples in a query result.
Aim
Our aim in this lab is to strengthen understanding of key concepts from lecture including data types, DB models, relational model and DBMS. A SQLite database will be the practical instance used to learn to create a database, query and display entires in relation in a meaningful and informative manner using filters and creating calculated fields. We will then practise drawing inferences from this data.
In [1]:
Question 1: Data types and database models
While our previous work in data analysis and machine learning in this course has provided us with a very loose interpretation of data structuring, we won't be able to apply the same rules for defining data going forward. As discussed in the lectures, data can be classified based on it's characteristics to aid in storage, retrevial and other operations. In the case of Databases, you must learn to first crawl before walk, and walk before run. Therefore, we will be starting with determining the classification of data, before moving onto the more code heavy questions.
Q1.1: Data types: warm-up quick classification
Like starting gym work with stretches, let's make our brain start spinning! Please review the lecture slides regarding the characteristics of data, then complete the exercise below as fast as possible. No cheating by looking back at the slides!
Please indicate which data type each example below belongs to:
Your options are:
# Code Imports
import numpy as np
import pandas as pd import sqlite3
from sqlite3 import Error
Option Identifier
Structured 1 Semi-Structured 2 Unstructured 3
# Enter Your Number in the `type` column
| Example |type
| --------------------- |------------- |
|
|
| |
| `Spreadsheets`
| `Server logs`
| `Tweets`
| `Postgre SQL`
| `titanic.csv dataset` | 2.
| `JSON documents` | 2. | `Mongodb`
| `Email`
| `Echo360 recordings` |
| `Chiefley catalogues` |
| `XML documents` | | `ANU homepage content`|
There will always be the existential question of "why?". Check with some of your fellow students and your tutor, and see if there are any inconsistencies. Make your case with your colleagues and see if you can come to a conclusion.
Q1.2: Data types: Existence is reason
Though structured data is organised and the easiest to query and analyse, the other two data types have their own use case. Come up with a scenario where a data analyst would prefer semi-structured data over structured data in his/her daily job. You are more than welcome to discuss this with your neighbours.
Q1.3: The Data Runway: Database models
Move over Gigi Hadid and Kendall Jenner, data is the new queen of the runway.
Modelling a database is one of the key ways of ensuring your DB is fit for purpose and meets your requirements. This can be performed at many stages of the development process for any software development and integration (this is discussed in COMP3120 (https://programsandcourses.anu.edu.au/course/comp3120)), and it is therefore vital to be able to visualise and model the various aspects of your data.
We are interested in 3 key points of data modelling, which we want you to discuss here. Your tasks are as follows:
Compare and Contrast the characteristics of a Conceptual , Logical and Physical data model.
Find an image or set of images online that highlight the differences between these items and discuss how these images show the unique characteristics of each model.
You are welcome to discuss this with your tutor.
By this point, you should have some basic idea of data management in the bag. Now we can move onto the underlying logic behind database systems such as SQL.
Question 2: Relational Algebra: why and how
Relational Algebra (https://www.tutorialspoint.com/dbms/relational_algebra.htm) (RA), first proposed by Edgar F. Codd (https://en.wikipedia.org/wiki/Edgar_F._Codd), is considered the basis for database query languages. In Question 2, we are going to dicuss why it is important, why we need it, and we'll practice writing relational algebra statements.
Q2.1: SQL not enough?
Before we get stuck into the depths of Relational Algebra (RA), lets discuss the high level ideas behind why we might consider relational algebra to be useful. Write down some notes for the following questions:
What is the difference between procedural and declarative languages? Which type does SQL belong to and why?
If we hand over the SQL statement directly to machine, what are some possible problems?
### Your answer here
### Your answer here
How can RA address these problem?
Discuss with your neighbours and put your ideas down below.
Q2.2: Pencil & eraser ready? Here comes RA exercise
After discussing Q2.1, you should be ready to feel the power of RA. The below exercise is aimed at increasing your confidence with writing RA statements. First consider a database with below schemas:
### Your ideas here
relation schema
Person ( name, age, gender ) Frequents ( name, pizzeria ) Eats ( name, pizza ) Serves ( pizzeria, pizza, price )
description
name is a key
(name, pizzeria) is a key (name, pizza) is a key (pizzeria, pizza) is a key
Secondly, recall the function items from the lecture slides. While we aren't going to fill in everything for you, the following symbols may be of use:
Π σ ρ ?
(You will find the meaning behind these symbols in the lecture slides)
Using these symbols and the database above, answer the following questions: Note: You may write these down instead of putting them in the notebook. Up to you!
1. Find all pizzerias frequented by at least one person under the age of 18.
2. Find the names of all females who eat either mushroom or pepperoni pizza (or both).
3. Find all pizzerias that serve at least one pizza that Amy eats for less than $10.00.
4. For each person, find all pizzas the person eats that are not served by any pizzeria the person frequents. List all such person (name) / pizza pairs.
Question 3: Welcome the SQLite
The lecture slides provide a good description of a DBMS
-Database Management System (DBMS) Consists of interrelated data and software for analysing the data
There's so much more to say about DBMS. It implements one of the database model, does the heavy job of storing 1s and 0s and provides user convinient interface to do creation, updating, quering etc. Let's put all these buzzes in a context - SQLite!
Detour: SQLite
You may be wondering how we are planning on teaching SQL concepts without actually using a "real" database. While many DBMS such as PostgreSQL (https://www.postgresql.org) exist, there is additional overhead of installing items such as this, getting you used to them and getting them to interact with a Jupyter Notebook (or Python script, etc). While other courses (such as COMP2400 (https://programsandcourses.anu.edu.au/course/comp2400)) will use PostgreSQL or other items, we are going to go with the simple option as this is an introductory course. Therefore, enter SQLite!
# Your answer here
# Your answer here
# Your answer here
# Your answer here
What is SQLite ?
SQLite is a C-language library that implements a small (https://www.sqlite.org/footprint.html), fast (https://www.sqlite.org/fasterthanfs.html), self- contained (https://www.sqlite.org/selfcontained.html), high-reliability (https://www.sqlite.org/hirely.html), full-featured (https://www.sqlite.org/fullsql.html), SQL database engine. SQLite is the most used (https://www.sqlite.org/mostdeployed.html) database engine in the world. SQLite is built into all mobile phones and most computers and comes bundled inside countless other applications that people use every day. (reference (https://www.sqlite.org/index.html))
Why SQLite?
The key difference from other DBMS: architecture
1. Normally, a DBMS such as MySQL, PostgreSQL, etc., requires a separate server process to operate. The applications that want to access the database server use TCP/IP protocol to send and receive requests. This is called client/server architecture.
image.png
2. SQLite does NOT work this way. SQLite does NOT require a server to run. SQLite database is integrated with the application that accesses the database.
The applications interact with the SQLite database read and write directly from the database files stored on disk. image.png
Other reasons we choose SQLite
1. It's so light-weight that we do NOT need to install or configure anything since we are using its python interface (if you want to access the DB through command line like other DBMS, you need to install the commandline tool)
2. PySQLite(sqlite3 below): It is a python standard lib! We do not need to install any additional libraries!
3. PySQLite(sqlite3 below): A standardized Python DBI API 2.0 compliant interface (i.e. same API for PSQL, MySQL etc.);
Okay, back to the actual work.
Creating and Playing with a Database
All the tables must reside in a database, even if it is a fake one in SQLite.
The first step is to connect to the database, below we have provided some boilerplate code (https://en.wikipedia.org/wiki/Boilerplate_code) to handle the
database connection. Do not modify it, otherwise things will break. (If the database does not exist, the connection operation will create it for us)
If you check the data directory after running the below box, you will find the database file sqliteQ3.db . Initially, we also provide you with toy tables with a few entries inside sqliteQ3.db to play with. For now, you do NOT need to know the details of defining schemas or inserting records. It is the content of the next lab.
In [2]:
# Our code here, do NOT modify
def create_connection(db_file):
""" Connect to the specified SQLite database, if not exist, create a new one; :db_file: location of db to connect to
:return: Connection object or None
"""
conn = None
try:
conn = sqlite3.connect(db_file)
print("Connection established!") except Error as e:
print(e) return conn
db_file_loc = "./data/sqliteQ3.db"
conn = create_connection(db_file_loc) # remember to close the connection when everything done
Connection established!
How do I extract information out of the database just created?
It's really simple! Since you already have a Connection, you can us pd.read_sql_query and execute the query you wish. You will have a pandas dataframe as a return. See the below example which query all records in table tasks.
In [3]:
# Our code here, do NOT modify
query = "SELECT * FROM projects;" df_projects = pd.read_sql_query(query, conn) df_projects.set_index('id', inplace=True) display(df_projects)
name begin_date end_date 1 First data analysis job in career 2020-01-01 2020-01-30
Now it is your turn.
Display all the entries in table tasks
Once you have been able to display them, try making them prettier to look at. This will involve looking at the documentation, so have fun!
In [4]:
Question 4: SQLite and ... the Premier League
Data scientists don't only play with example databases! They are good at extracting whatever they want from a real-life (or real-life-like) database. Since the aim of this course is to get you to the point of being a data scientist, let's see what you can get out of from the fragmentated Premier League match statistics.
Note: This Database is in the data directory in the premier.db file. For your convinence, the columns of relation are here:
relation schema
team name, city
player player_id, name, position, height, weight, team
game game_id, home_team, away_team, homescore, awayscore game_stats player_id, game_id, goals, assists, passes
You can use the following command to find the names of the tables in the database.
id
# Your code here
In [5]:
name 0 projects 1 tasks
sql
CREATE TABLE projects (\n ... CREATE TABLE tasks (\n ...
# do NOT modify
query = "SELECT name, sql FROM sqlite_master WHERE type='table';" df_result = pd.read_sql_query(query, conn)
display(df_result)
Q4.1: How's every table like?
By answering the below questions by quering the database, you can get a rough idea of what is in the Premier League database. In [6]:
Q4.1.1: What're the names of all the teams from London?
# Don't forget to create a connection to the database # Your Code here
In [7]:
# Your Code Here
Q4.1.2: For the tallest player in the database, what is their height?
In [8]:
Q4.1.3: What's the homescore and awayscore for the only recorded Liverpool (home) VS Chelsea (away) game? In [9]:
Q4.1.4: For all the players that their performance is recorded in game_stats, what is the average goal per game for them (list their player_id and the average)?
In [10]:
Q4.2: What if we want information from 2 (or more) tables?
As will be discussed in next week's lecture, it is a bad design to include too many fields within one table. The following question requires to connect 2 tables in the database. (That's not that hard, we have the join operation, don't we?)
Question: Who is our least strong (assuming only weight matters) player from London? List his weight only.
If you cannot complete this task by single SQL, try multiple first.
In [11]:
That is it from us! Nice work on making it this far, and have a go at the extensions if you have time.