代做INFS3200 Advanced Database Systems
- 首页 >> DatabaseINFS3200 Advanced Database Systems
Prac 2: Data Warehousing (5%)
Semester 1, 2023
Due time: 4:00pm, Friday, 21 April 2023 (Week 8)
Submission: Submit your work online (INFS3200 Course Website)
1.Introduction
1.1 Learning objectives
Learn how to create a cube and its dependent components in Oracle OLAP. The tasks
include:
To identify and build the dimensions,
To define measures (both stored and calculated facts).
Learn how to map an OLAP model to source data and load data into the dimensions.
Learn how to view an OLAP dataset. The tasks include:
To query through hierarchies of dimensions,
To perform DW operations such as roll-up, drill-down, and pivot.
Understand how data is organized and stored in dimensions and data cubes,
Learn how to query a data cube using SQL queries.
Understand how materialized views are used in cube query optimization.
1.2 Assessment
This Prac carries 5 marks for 5% assessment of four (4) tasks.
Marking Scheme:
1 mark: Complete the cube creation in Task 1.
1 mark: Maintain the cube successfully in Task 2.
2 marks: Complete Task 3, each bullet point is worth 1 mark.
1 mark: Complete Task 4.
You need to provide the screenshots of the data cubes you have created as well as the results
of your OLAP operations. So the submission must include your screenshots, SQL queries and
necessary descriptions in your submitted document. Please make sure your screenshots contain
your student ID as the proof of originality.
Put all your content in a word/pdf document or leave scripts in separate files and pack all files
into a zip/rar package. The file name should contain your name and student ID. Please format
your document nicely to help tutor’s marking process. A poorly formatted document may
receive a reduced mark. Submit your work to the Blackboard Course Website site by 4:00pm,
Friday, 21 April 2023.
Late Penalties (from ECP):
“Where an assessment item is submitted after the original deadline, without an approved extension, a
late penalty will apply. The late penalty shall be 10% of the maximum possible mark for the assessment
item will be deducted per calendar day (or part thereof) late, up to a maximum of seven (7) days.
After seven days, no marks will be awarded for the item. A day is a 24-hour block from the assessment
item due time. Negative marks will not be awarded.”
Page 2 of 24
2. Preparation: Data Warehouse Setup
2.1 System overview
Oracle OLAP: Oracle OLAP is a multidimensional analytic engine embedded in Oracle
database system. It supports online analytics based on data warehousing techniques. In Oracle
OLAP, a cube provides a convenient way of collecting stored and calculated measures with
similar characteristics, including dimensionality, aggregation rules and so on. A particular
analytic workspace (AW) may contain more than one cube, and each cube may describe a
different dimensional shape. Multiple cubes in the same AW may share one or more
dimensions (Fact constellation, See Week 04 Lecture Notes).
Analytic Workspace Manager (AWM): In Oracle OLAP, AWM is an easy-to-use GUI tool
for creating, developing, and managing multidimensional data in an Oracle data warehouse.
Please refer to more online materials at: https://www.oracle.com/database/technologies/olap12101b-readme.html .
2.2 Description of the Data Warehouse
You are given a Data Warehouse, namely OLAPTRAIN which has a star schema.
OLAPTRAIN was sourced from a transactional database system, which contains data for a
fictional electronics store. The followings are the description of the tables (including four
dimension tables and one fact table), intended for analysis that have gone through the ETL
(Extraction, Transformation, and Loading of heterogeneous data) process:
Table Description
CHANNELS This table contains distribution channels for customers’ purchases.
CUSTOMERS This shows who purchased products, and where products were sold for
Geographical dimensions of locations.
PRODUCTS This table contains product categories sold by the company.
TIMES This table contains time periods when products were sold.
SALES_FACT This table stores purchases in dollars, quantity and unit price, by the channel of
distribution, product item, day, and customer.
2.3 Data import
Download the “P2.rar” package from Assessment on Blackboard course website and extract
it to a local directory, for example “c:\app\P2\”. The given package contains the AWM software,
the OLAPTRAIN schema, cube templates and query scripts used in the following tasks.
Before analyzing the data, we need firstly import the OLAPTRAIN schema into the Oracle
database. Open a Command Prompt by searching ‘command prompt’ in the search window
(Please run it as administrator to avoid privilege issues), shown as below, and complete the
following steps:
Page 3 of 24
/*Enter the directory that has the installation files, in
my case it is “cd c:\app\P2\olaptrain_install”*/
> cd YOUR_P2_FOLDER\olaptrain_install
/*Login to SQL*Plus*/
> sqlplus sys as sysdba
Enter password: Password1!
/*Set a system parameter to avoid a future error
when creating users*/
SQL> alter session set "_ORACLE_SCRIPT"=true;
/*Run the OLAPTRAIN installation script*/
SQL> @install_olaptrain_student
/*Enter the install directory and password*/
Directory: YOUR_P2_FOLDER\olaptrain_install
Password: w
StudentID: S1234567
Note that the directory should be the same as above and you can choose your own password
for user “OLAPTRAIN_S1234567” (Please DO NOT include ‘-‘ or other special characters
in your password), which will be the main user throughout this practical. Please change
“S1234567” to your student ID and make sure “S” is upper case, this is crucial. The processes
are shown below:
Page 4 of 24
2.4 Troubleshooting:
A successful installation should end up with 6 errors, shown as follows:
However, if the installation ends quickly with the following messages, it is usually caused by
specifying an incorrect directory.
2.5 Connect to data warehouse using AWM
Click the “awm.bat” in “awm122010_Standalone” folder to open the AWM (if there is no
response after a click, please check if you have Java JDK installed in your system, Java 8
recommended). Create a new connection to the Oracle database and set the Description as
your student ID and the Connection Information as ‘localhost:1521:orcl’. Click the
connection and log in using the username “OLAPTRAIN_S1234567” and the password you
set.
Page 5 of 24
Part 1: Create Logical Data Model
1. Understand the data
Before designing the data model, it is highly suggested to first understand the OLAPTRAIN
schema. Use SQL developer to connect to the database and check the following tables:
channels, customers, products, times and sales_fact. The connection name should include your
student ID, like “DW_S1234567”.
After examining the OLAPTRAIN schema, we need to identify the dimensions, hierarchies
and attributes of our data model. In addition, we define various measures based on business
interests.
Page 6 of 24
(1) Identifying Dimensions
Using the source data tables as the primary input, the following dimensions are identified as
requirements for the OLAP data model:
Channel
Geography
Product
Time
Besides, each of the dimensions contains hierarchical structure, for example, in CHANNELS
table, shown as follows, we can identify four hierarchies: the actual channels
(channel_name/channel_key) -> channel type (channel_type) -> channel class
(class_name/class_key) -> ∅ (not chosen in group-by query, refer to Tutorial5-Q2, here we
name it as all_channel).
(2) Identifying Measures
The measures are defined based on common business interests, each of which is equivalent to
an SQL aggregation query. The measures include both stored and calculated measures. Stored
measures are facts acquired from the fact table directly, while the calculated measures require
complicated calculations over one or multiple facts. In this dataset, we focus on the following
measures:
Stored Measures
Sales
Quantity
Calculated Measures
Sales Year-to-Date
Sales Year-to-Date Prior Year
The measures will be defined during the creation of cubes, which will be introduced later.
Page 7 of 24
2. Create analytic workspace
Right-click Analytic Workspaces and select Create Analytic Workspace to create a new
analytic workspace under the name of your student ID.
3. Create dimensions using templates
We provide templates for all four dimensions so that you do not need to define them manually.
The template feature in Analytic Workspace Manager saves the definition of the OLAP data
objects as an XML file. Using a saved template, you can create a new analytic workspace,
dimension, cube, and measure exactly like an existing object, with or without mappings.
Templates do not include the data, only the definition of the object.
In order to import a template, right-click the Dimensions folder, then select Create Dimension
from Template. The templates are stored in the templates folder in the extracted folder.
Import Channel, Geography, Product and Time dimensions in the same way. Check the
settings of these dimensions and make sure all members are mapped to the data source (no
modification needed). For example, in Channel dimension, the template contains the following
settings:
Click the Dimensions folder and select Channel. In the General tab, we can see the dimension
name (Channel) and the dimension type (User Dimension).
Page 8 of 24
In the Levels tab, we can see three levels are defined, which correspond to three of the four
possible dimensions mentioned above (channel type not included):
ALL_CHANNELS = ∅
CLASS = channel class
CHANNEL = the actual channels
Under the Hierarchies in Channel dimension, we can see the SALES_CHANNEL hierarchy.
These three levels are ordered in the following way, which is consistent with our understanding
of the data:
Under the Attribute dialogue, we can see that channel type has been defined as an attribute
attached to CHANNEL level, instead of being another hierarchy level, which means it is out
of the business interest and cannot be queried on.
Page 9 of 24
After defining a dimension, those defined components should also be mapped to the existing
tables and views in Oracle Database. Specifically, the Member attributes in the OLAP
dimension should be mapped to the key columns in the dimension tables, while the attribute
columns should also be mapped to the appropriate OLAP dimension attributes.
Click the Mappings in Channel dimension and make sure the source columns (left) appears in
the mapping pane (right) correctly. Note that there is no such source data column for
ALL_CHANNELS level, therefore, for "All/Total" hierarchy levels, the descriptions are typed
manually:
Perform the same inspection to the other imported dimensions and understand how they are
defined.
Page 10 of 24
4. Create Cube
After all dimensions are defined, we are ready to create our cube. Right-click the Cubes folder,
then click Create Cube. In the General tab of the Create Cube window, specify the following:
Name: SALES_CUBE
Use the Add button (>) to select dimensions in the following order:
o CHANNEL
o TIME
o GEOGRAPHY
o PRODUCT
Notes: The order in which the dimensions are listed in a cube may affect performance because
it determines the way the data is stored on disk.
Next, select the Storage tab. The Storage tab helps you manage the data compression strategy.
By default, we choose Use compression, and then enable the Sparse option for all dimensions,
as shown below:
Finally, click Create to finish the dialogue.
Page 11 of 24
5. Create measures
You can create two types of measures in a cube: Stored (or Base) measures, and Calculated
measures. Every measure that belongs to a particular cube shares the characteristics that were
defined for the cube.
(1) Stored Measures
Stored measures store the facts collected about your business. When you create stored measures
in your OLAP data model, you will map them to the source data just like what you have done
with dimensions.
(2) Calculated Measures
One of the powerful features of the Oracle OLAP technology is the ability to efficiently and
easily generate business calculations of data held in the database. In any OLAP implementation,
the number of calculated measures greatly exceeds the number of stored measures.
OLAP calculated measures are derived from stored measures or other calculated measures.
These calculations are computed dynamically when users query the data. Calculations are
automatically exposed as columns in a cube view, just like the facts.
According to our design, we define two measures, i.e., Sales and Quantity, and two calculated
measures. Two measures are created as follows:
And we create the first calculated measures by entering or selecting the following:
Name = SALES_YTD (the Label and Description fields are auto-filled)
Calculation Type = Period to Date
In the Calculation inputs section, select the following:
First hyperlink = Ancestor At Level
Second hyperlink = TIME.CALENDAR_YEAR
Finally, the window should look like this:
Page 12 of 24
Similarly, we create a YTD calculation for the previous year. This measure facilitates year-toyear comparisons. The settings are shown below:
Name = SALES_YTD_PY
Calculation Type = Parallel Period
In the Calculation inputs section, select the following:
Second hyperlink = SALES_YTD
Fifth hyperlink = TIME.CALENDAR_YEAR
6. Map the cube
Same as the dimension mapping, we need to map our cube to the existing data source. In a data
cube, we need to map the following fields:
The stored measures that are defined within the cube.
The lowest level of detail for each dimension hierarchy.
The Join Condition field. This field associates the foreign key column from the fact
table to the primary key column from the dimension table.
Note that the mapping is done by dragging the corresponding column from the source columns
(left) to the correct spot in the mapping pane (right). A Join condition can be achieved by
dragging both joining columns to that slot and the “=” will be added automatically. Please DO
NOT type those values manually as it will cause unexpected problems in the future. Eventually,
the mapping result should look like the follows. Click Apply to complete the mapping.
Page 13 of 24
Task 1: Complete all steps introduced above until you successfully create and map the cube.
Take three screenshots of the cube you have created, including the mapping, the definition of
two calculated measures. Include those screenshots in your document submission.
NOTE: Please make sure your student ID appears in every screenshot you take to show the
originality of your work. This rule applies to ALL screenshots taken throughout this prac
unless specified. Screenshots without student ID may be regarded as invalid and receive
reduced marks. An example screenshot of the mapping should look like the follows:
Page 14 of 24
Part 2: Load and View Cube Data
The Maintenance Wizard loads and aggregates the data in a single step. We can load all mapped
objects in the analytic workspace, or individual dimensions and measures. We can also choose
to run the job immediately, enter it in the Oracle Job Queue, or save it as a SQL script. The
materialized views, calculated measures are preprocessed during this maintenance phase. After
the maintenance, we can use SQL queries to retrieve the result. Also, we can view the cube
data using operators like Roll-up, Drill-down and Pivot.
1. Maintain the cube
The cube maintenance will load data to the cube. By default, the dimensions of that cube are
also processed before the cube. If you have already loaded dimension data, you can specify
only to load measure data.
Task 2: We start to maintain the cube. It will take quite a while to finish the maintenance.
Please take a screenshot of your maintenance result, same as below, and include it in your
document.
2. View the cube data
Page 15 of 24
Click the View Data SALES_CUBE to start the Data Viewer.
Task 3: Now you can play with this tool to explore the data. Meanwhile, you are required to
achieve a few goals:
Perform roll-up, drill-down and pivot operations in Data Viewer, respectively
(Lecture5-P36, Tutorial4-Q2). Each operation includes three parts: (1) the screenshot
of the view before the operation, (2) a sentence describing which operation you are
performing on which columns/levels; (3) the screenshot of the view after the operation.
Therefore, there are 6 screenshots and 3 lines of description in total. Include the
screenshots and descriptions in your document.
Adjust your Data Viewer window until it is identical (the value can be different) to the
following two views (Hint: your viewer windows must be identical to get full mark
and you may need help from the Query Builder, shown below). Take a screenshot for
each view you made and put them in your document.
Page 16 of 24
Part 3: Understand Data Warehouse Design Mechanism
We have learnt how to create dimensions and cubes using Oracle OLAP and AWM. Now, we
will try to understand the mechanism behind. Specifically, there is NO mark in this part, but
it will help you understand the whole process of how data cubes are built, queried and managed
by Oracle OLAP, which will make the subsequent tasks easier to follow. Therefore, this part
is optional but recommended. Please follow the instructions below.
1. View the dimension tables and fact table
Connect to the olaptrain schema via SQL Developer using your “OLAPTRAIN_S1234567”
(Once again, S1234567 refers to your student number) user, and then check the tables
“CHANNELS” and “SALES_FACT”.
Page 17 of 24
Just like what we have learnt from the lecture and tutorial, the dimension table “CHANNELS”
contains dimension key (Channel_Key), hierarchy (Channel –> Class -> All Channels) and
attributes (Channel_Name, Channel_Type), while the fact table consists of the dimension key
for each dimension (Channel, Product, Customer, etc.) and the facts (Quantity, Price and Sales).
The dimension tables and fact table constitute a star schema as follows:
2. Compare the views with dimension tables and fact table
Now, let us view the data in the “CHANNEL” dimension in our analytic workspace. Connect
to our analytic workspace through AWM and open the views in the “CHANNEL” dimension.
Here we have two views: the dimension view “CHANNEL_VIEW” and the hierarchy view
“CHANNEL_SALES_CHANNEL_VIEW”.
Page 18 of 24
In addition to the raw dimension table “CHANNELS”, the “CHANNEL_VIEW” stores the
hierarchy information, and the “CHANNEL_SALES_CHANNEL_VIEW” contains additional
columns indicating the hierarchy level of each record.
Page 19 of 24
By storing the hierarchy information as separate records in dimension tables, we can easily
store the summarization of each hierarchy level when maintaining a cube. Check the
materialized view “SALES_CUBE_VIEW” in the cube “SALES_CUBE” and see such results:
Therefore, given above materialized views, we can answer the aggregation queries quickly.
Assuming the user is viewing the total sales on “All_Channels”, “All_Years”, “All_Regions”
and “All_Products”, we can easily perform a drill-down operation from “All_Channels” to
“Class” by SELECT the Sales from “SALES_CUBE_VIEW” where:
Time = “All_Years”
Geography = “All_Regions”
Product = “All_Products”
Channel in {-3,-4}
as we can identify that -3 and -4 in “CHANNEL_VIEW” refer to the two possible classes. With
the help of the materialized view “SALES_CUBE_VIEW” and the dimension tables, the
complex aggregation queries are converted to simple selections.
3. Cube optimization
With the help of the “SALES_CUBE_VIEW”, we can easily get the results of the aggregation
queries. However, maintaining such a cube view is not easy. If you open an SQL Plus terminal
and type in the following two queries, you will find out that the number of records in the fact
table and the cube are 2,811,097 and 24,440,506 respectively (since the second query may take
more than one hour, it is not mandatory to run these queries).
/* Check the count of records in the fact table. */
SELECT count(*) FROM SALES_FACT;
/* Check the count of records in the cube view*/
SELECT count(*) FROM SALES _CUBE_VIEW;
Therefore, a cube view contains much more records than the original table since the cube size
is determined by the size of dimensions. In our case, although each dimension only contains
tens to thousands of rows, the cube size already exceeds the million level. Hence, it is
impractical to pre-compute all 24 million of aggregation results in the maintenance step.
Page 20 of 24
In order to achieve a balance between maintenance cost and query performance, the analytic
workspace provides a bunch of configurations to control the cost of maintenance and query:
(1) Data compression
When we built our cube in Part 1, we leveraged the data compression feature and specified the
sparse dimensions as shown below:
The data sparsity is very common in dimensional data models. When there are a large number
of empty cells in a cube, the cube is said to be “sparse”. In our data cube, if we define multiple
dimensions as “sparse” dimensions, the Oracle OLAP will create a special index for the cube
so as to automatically manage sparsity. Compression on sparse dimensions can also
significantly reduce the size of cubes and improve the performance of both data loads and
queries. Do not try to uncheck the “Use compression” and maintain the cube, it will take for
days.
(2) Cost-based aggregation
Cost-based aggregation enables you to balance the maintenance cost and query performance
directly. Specify a percentage value and the database will pre-compute and store the most costly
aggregate values based on your input. That is to say, if we set the value to 30%, then 30% of
the aggregate values will be calculated and stored during data maintenance, and 70% will be
calculated in response to a query. Try to change the value into something between 1% to 100%,
and you will see a time difference in maintaining the cube. Please maintain the cube again to
apply the changes you made each time. However, the query performance varies slightly as the
cube is relatively small. (Note: “Partition Cube” should be unchecked to enable cost-based
aggregation)
Page 21 of 24
(3) Dimension partitioning and ordering
Partitioning is a method of physically storing the measures in a cube. It improves the
performance of large measures in the following ways:
Improves scalability by keeping data structures small. Each partition function is like a
smaller measure.
Keeps the working set of data smaller both for queries and maintenance, since the
relevant data are stored together.
Enables parallel aggregation during data maintenance. Each partition can be aggregated
by a separate process.
Simplifies removal of old data from storage. Old partitions can be dropped, and new
partitions can be added.
You can activate the partitioning as below, and choose a proper partitioning strategy to
accelerate your maintenance and queries.
Also, the order of the dimensions in a cube may affect performance. In general, when you
dimension a cube, the first dimension in a cube should have the smallest cardinality and the
last dimension has the largest.
Part 4: Query Data Cube via SQL
As mentioned above, the views provided by Oracle data cube is similar to traditional tablebased star models. However, there are two key differences:
Fact tables in a star schema store detailed data, while the cube views reveal many
summary levels.
In addition to the facts in fact tables, the cube provides additional measures and
calculations, which are calculated and materialized as columns in the cube view.
These differences impact the way you query data. With star queries, you aggregate the data by
combining aggregation functions (such as sum) and the GROUP BY clause. With cube queries,
if the cube has been fully calculated (cube view is fully materialized), you simply select the
data you want (either stored or calculated measures) as a column. Typically, no aggregation
function is necessary since the data has already been summarized by the cube.
Since the cube data is made directly accessible to SQL by a set of relational views, in this
part, we will write some SQL queries to query the cube views.
Page 22 of 24
1. Simple Cube Queries
Get back to the SQL Developer and connect to the olaptrain schema. Check the views that
were created in AWM.
Select File > Open and open the “cube_queries.sql” file in the queries folder from the extracted
files. Check the following query:
Query Notes:
“Sales” is simply selected as a column. There is no SQL aggregation operation applied.
A level within the Product dimension hierarchy -- DEPARTMENT -- is used to filter
product members.
All of the dimensions are qualified in the WHERE clause, even though only the Product
dimension is selected. In OLAP cube queries, dimensions that are not selected in the
query require an “ALL” condition -- which specifies the top-level hierarchy value for
each of the dimension columns -- in order to leverage summaries that are already
computed by the cube.
Execute the query using your connection (“DW_S1234567” in the above example). The query
should return three rows almost instantaneously. The results should look like this:
Page 23 of 24
2. Advanced aggregation query
Now you are asked to write a similar query as above. In the previous query, a "Level" Condition
was used for the Product dimension (which was the only dimension selected). Here, you need
to apply level conditions to multiple dimensions in a cube query.
Every hierarchy and dimension view contains a “LEVEL_NAME” column. The value in this
column is the name of the OLAP hierarchy Level object that you created when modelling the
dimension in AWM. By simply specifying a value for this column in the WHERE clause, you
filter the data to include only those dimension members at the specified level in the hierarchy.
Since the Oracle OLAP truncates column names at 24 characters, in order to get the exact name
of a column, simply drill on the view that you want to examine by using SQL Developer’s
Connections navigator. You can also view the data to see the exact values in a column.
Task 4: Now, complete the following SQL query:
SELECT c.class_short_description as class,
p.department_long_descript as dept,
t.calendar_quarter_long_de as qtr,
round(s.sales) as sales
FROM ?
WHERE ?
In this query we need to use c, p, g, t, s to denote the the views from channel, product,
geography, time dimensions and sales_cube, respectively. Please perform a summarization on
dimension levels channel.CLASS, product.DEPARTMENT, geography.ALL_REGION and
Page 24 of 24
time.CALENDAR_QUARTER, and the time is filtered for “CY2009” only. The query result
should look like this (the record order doesn’t matter):
Include your query and a screenshot of the results in your submitted document.
---ooo000OOO000ooo---