Data Mining编程辅导、讲解CSS，Java程序设计讲解留学生Prolog|讲解Python程序

2021.04.21 - 首页 >> Matlab编程

Unsupervised Data Mining Workbook
Prof Emel Aktas
8 March 2021
1 A Simple Affinity Example
A typical application of data mining is in marketing: We can ask a customer who buys a product
if they would like to buy another similar product. This type of analysis is ‘affinity analysis’, which
studies things that appear together. It is a correlation analysis and don’t mistake correlation for
causation.
The example we will investigate is from a hardware wholesaler. For the purposes of class demonstration
we processed the raw data from the wholesaler and shortened it to 100 observations and
five products:
1. Adjustable Wristband
2. Portable Counterweight
3. Canvas Pouch
1
4. Rescue Harness
5. Cellphone Holster
In the exercise, we will investigate which of these products were ordered together by 100 customers.
A practical application of this analysis could inform where in the warehouse the products
should be located to minimise the picking time.
Disclaimer: Product images from https://www.3m.co.uk/3M/en_GB/company-uk/3m-products/
~/All-3M-Products/Safety/DBI-SALA/?N=5002385+8709322+8711017+8734872&rt=r3 are used
as representations of products. The data is realistic but processed to disguise any commercial
information. No conclusions about any company or volumes of sales shall be inferred from the
data.
[1]: # import the numpy package
import numpy as np
# import seaborn for graphs
import seaborn as sn
sn.set(color_codes = True)
# import matplotlib for plotting
import matplotlib.pyplot as plt
# import pandas for data analysis
import pandas as pd
# define the dataset file name
dataset_filename = "affinity_dataset.txt"
# load the dataset
X = np.loadtxt(dataset_filename)
[2]: X.shape
2
[2]: (100, 5)
[3]: X[:5]
[3]: array([[0., 1., 0., 0., 0.],
[1., 1., 0., 0., 0.],
[0., 0., 1., 0., 1.],
[1., 1., 0., 0., 0.],
[0., 0., 1., 1., 1.]])
[4]: # assign number of observations and number of variables
n_samples, n_features = X.shape
[5]: # names of variables (features)
features = ["Wristband", "Counterweight", "Pouch", "Harness", "Holster"]
[6]: # support and confidence for the rule: "if a customer orders Holster, they also␣
,→buy X"
# third row of the X as an example sample
sample = X[2]
[7]: # take a look at what's in sample and compare with X[:5]
sample
[7]: array([0., 0., 1., 0., 1.])
[8]: # the fifth element of sample (corresponds to Hoster)
sample[4]
[8]: 1.0
[9]: # create a default dictionary to capture valid and invalid rules
from collections import defaultdict
valid_rules = defaultdict(int)
invalid_rules = defaultdict(int)
num_occurences = defaultdict(int)
[10]: # check the entire dataset for each feature as a premise and check the␣
,→conclusion.
# when the premise is true, if the conclusion is also true, the rule is valid.
for sample in X:
for premise in range(n_features):
if sample[premise] == 0:
continue
# Record that the premise was bought in another transaction
num_occurences[premise] += 1
for conclusion in range(n_features):
3
if premise == conclusion:
# It makes little sense to measure if X -> X.
continue
if sample[conclusion] == 1:
# This person also bought the conclusion item
valid_rules[(premise, conclusion)] += 1
[11]: # how many times each product is bought
num_occurences
[11]: defaultdict(int, {1: 52, 0: 28, 2: 39, 4: 57, 3: 43})
[12]: # how many times holster was ordered together with other products
valid_rules
[12]: defaultdict(int,
{(0, 1): 13,
(1, 0): 13,
(2, 4): 20,
(4, 2): 20,
(2, 3): 22,
(3, 2): 22,
(3, 4): 27,
(4, 3): 27,
(1, 3): 18,
(3, 1): 18,
(1, 4): 27,
(4, 1): 27,
(0, 2): 5,
(2, 0): 5,
(0, 4): 16,
(4, 0): 16,
(1, 2): 11,
(2, 1): 11,
(0, 3): 9,
(3, 0): 9})
[13]: # support of the rule
support = valid_rules
[14]: # confidence calculation
confidence = defaultdict(float)
for premise, conclusion in valid_rules.keys():
rule = (premise, conclusion)
confidence[rule] = valid_rules[rule] / num_occurences [premise]
4
[15]: # confidence of the rule. percentage of times the rule applies when the premise␣
,→applies
confidence
[15]: defaultdict(float,
{(0, 1): 0.4642857142857143,
(1, 0): 0.25,
(2, 4): 0.5128205128205128,
(4, 2): 0.3508771929824561,
(2, 3): 0.5641025641025641,
(3, 2): 0.5116279069767442,
(3, 4): 0.627906976744186,
(4, 3): 0.47368421052631576,
(1, 3): 0.34615384615384615,
(3, 1): 0.4186046511627907,
(1, 4): 0.5192307692307693,
(4, 1): 0.47368421052631576,
(0, 2): 0.17857142857142858,
(2, 0): 0.1282051282051282,
(0, 4): 0.5714285714285714,
(4, 0): 0.2807017543859649,
(1, 2): 0.21153846153846154,
(2, 1): 0.28205128205128205,
(0, 3): 0.32142857142857145,
(3, 0): 0.20930232558139536})
[16]: for premise, conclusion in confidence:
premise_name = features[premise]
conclusion_name = features[conclusion]
print("Rule: If a customer orders {0} they will also order {1}".
,→format(premise_name, conclusion_name))
print(" - Confidence: {0:.3f}".format(confidence[(premise,conclusion)]))
print(" - Support: {0}".format(support [(premise, conclusion)]))
print("")
Rule: If a customer orders Wristband they will also order Counterweight
- Confidence: 0.464
- Support: 13
Rule: If a customer orders Counterweight they will also order Wristband
- Confidence: 0.250
- Support: 13
Rule: If a customer orders Pouch they will also order Holster
- Confidence: 0.513
- Support: 20
5
Rule: If a customer orders Holster they will also order Pouch
- Confidence: 0.351
- Support: 20
Rule: If a customer orders Pouch they will also order Harness
- Confidence: 0.564
- Support: 22
Rule: If a customer orders Harness they will also order Pouch
- Confidence: 0.512
- Support: 22
Rule: If a customer orders Harness they will also order Holster
- Confidence: 0.628
- Support: 27
Rule: If a customer orders Holster they will also order Harness
- Confidence: 0.474
- Support: 27
Rule: If a customer orders Counterweight they will also order Harness
- Confidence: 0.346
- Support: 18
Rule: If a customer orders Harness they will also order Counterweight
- Confidence: 0.419
- Support: 18
Rule: If a customer orders Counterweight they will also order Holster
- Confidence: 0.519
- Support: 27
Rule: If a customer orders Holster they will also order Counterweight
- Confidence: 0.474
- Support: 27
Rule: If a customer orders Wristband they will also order Pouch
- Confidence: 0.179
- Support: 5
Rule: If a customer orders Pouch they will also order Wristband
- Confidence: 0.128
- Support: 5
Rule: If a customer orders Wristband they will also order Holster
- Confidence: 0.571
- Support: 16
6
Rule: If a customer orders Holster they will also order Wristband
- Confidence: 0.281
- Support: 16
Rule: If a customer orders Counterweight they will also order Pouch
- Confidence: 0.212
- Support: 11
Rule: If a customer orders Pouch they will also order Counterweight
- Confidence: 0.282
- Support: 11
Rule: If a customer orders Wristband they will also order Harness
- Confidence: 0.321
- Support: 9
Rule: If a customer orders Harness they will also order Wristband
- Confidence: 0.209
- Support: 9
[17]: from operator import itemgetter
sorted_support = sorted(support.items(), key=itemgetter(1), reverse=True)
[18]: sorted_support
[18]: [((3, 4), 27),
((4, 3), 27),
((1, 4), 27),
((4, 1), 27),
((2, 3), 22),
((3, 2), 22),
((2, 4), 20),
((4, 2), 20),
((1, 3), 18),
((3, 1), 18),
((0, 4), 16),
((4, 0), 16),
((0, 1), 13),
((1, 0), 13),
((1, 2), 11),
((2, 1), 11),
((0, 3), 9),
((3, 0), 9),
((0, 2), 5),
((2, 0), 5)]
7
[19]: sorted_support = sorted(support.items(), key=itemgetter(1), reverse=True)
for index in range(5):
print("Rule #{0}".format(index + 1))
premise, conclusion = sorted_support[index][0]
print("Rule: If a customer orders {0} they will also order {1}".
,→format(features[premise], features[conclusion]))
print(" - Confidence: {0:.3f}".format(confidence[(premise,conclusion)]))
print(" - Support: {0}".format(support[(premise, conclusion)]))
print("")
Rule #1
Rule: If a customer orders Harness they will also order Holster
- Confidence: 0.628
- Support: 27
Rule #2
Rule: If a customer orders Holster they will also order Harness
- Confidence: 0.474
- Support: 27
Rule #3
Rule: If a customer orders Counterweight they will also order Holster
- Confidence: 0.519
- Support: 27
Rule #4
Rule: If a customer orders Holster they will also order Counterweight
- Confidence: 0.474
- Support: 27
Rule #5
Rule: If a customer orders Pouch they will also order Harness
- Confidence: 0.564
- Support: 22
[20]: sorted_confidence = sorted(confidence.items(), key=itemgetter(1), reverse=True)
for index in range(5):
print("Rule #{0}".format(index + 1))
premise, conclusion = sorted_confidence[index][0]
print("Rule: If a customer orders {0} they will also order {1}".
,→format(features[premise], features[conclusion]))
print(" - Confidence: {0:.3f}".format(confidence[(premise,conclusion)]))
print(" - Support: {0}".format(support [(premise, conclusion)]))
print("")
Rule #1
8
Rule: If a customer orders Harness they will also order Holster
- Confidence: 0.628
- Support: 27
Rule #2
Rule: If a customer orders Wristband they will also order Holster
- Confidence: 0.571
- Support: 16
Rule #3
Rule: If a customer orders Pouch they will also order Harness
- Confidence: 0.564
- Support: 22
Rule #4
Rule: If a customer orders Counterweight they will also order Holster
- Confidence: 0.519
- Support: 27
Rule #5
Rule: If a customer orders Pouch they will also order Holster
- Confidence: 0.513
- Support: 20
[21]: from matplotlib import pyplot as plt
plt.plot([confidence[rule[0]] for rule in sorted_confidence])
plt.ylabel('Confidence')
plt.xlabel('Rule') # possibly use the first five rules
[21]: Text(0.5, 0, 'Rule')
9
2 Cluster Analysis
Cluster analysis is used for categorising data without knowing categories. For this exercise we
have a customer data with four features and 3184 customers. The features are
1. Transportation Cost
2. Warehouse Cost
3. Overhead
4. Profit
Without knowing how many customer groups exist in the data, we will read, plot, and cluster the
customers into an acceptable number of groups using the k-means clustering algorithm.
[59]: customer = pd.read_csv("cluster_dataset.csv")
[60]: customer.head()
[60]: transportation_cost warehouse_cost overhead profit
0 18890.063700 1510.207094 683.145441 59675.1824
1 242.019333 716.213643 122.229621 58091.2485
2 119.276458 358.106821 122.229621 47502.0468
3 33730.627370 1510.207094 683.145441 38209.7592
4 7591.826018 1168.491532 325.997463 38197.7255
[61]: customer.shape
10
[61]: (3184, 4)
[62]: customer.info()

RangeIndex: 3184 entries, 0 to 3183
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 transportation_cost 3184 non-null float64
1 warehouse_cost 3184 non-null float64
2 overhead 3184 non-null float64
3 profit 3184 non-null float64
dtypes: float64(4)
memory usage: 99.6 KB
[63]: customer.describe()
[63]: transportation_cost warehouse_cost overhead profit
count 3184.000000 3184.000000 3184.000000 3184.000000
mean 2107.769227 1101.700615 462.705894 3993.600789
std 2894.888155 439.223708 262.686723 5284.903650
min 0.094115 226.138945 122.229621 -137377.462000
25% 513.508606 810.384711 325.997463 1377.042925
50% 1374.763639 1168.491532 325.997463 3126.758200
75% 2704.144414 1510.207094 683.145441 5214.488175
max 49106.959210 2028.357506 915.429857 59675.182400
[64]: # box and whisker plots
customer.plot(kind='box', sharex=False, sharey=False)
[64]:
11
[65]: # box and whisker plots
customer["transportation_cost"].plot(kind='box', sharex=False, sharey=False)
[65]:
12
[66]: # box and whisker plots
customer["warehouse_cost"].plot(kind='box', sharex=False, sharey=False)
[66]:
[67]: # box and whisker plots
customer["overhead"].plot(kind='box', sharex=False, sharey=False)
[67]:
13
[68]: # box and whisker plots
customer["profit"].plot(kind='box', sharex=False, sharey=False)
[68]:
14
[70]: # histograms
customer.hist(edgecolor='white', linewidth=1.1)
[70]: array([[,
],
[,
]], dtype=object)
[71]: from pandas.plotting import scatter_matrix
# scatter plot matrix
scatter_matrix(customer,figsize=(10,10))
plt.show()
15
[73]: # updating the diagonal elements in a pairplot to show a kernel density␣
,→estimation (kde)
sn.pairplot(customer,diag_kind="kde")
[73]:
16
[74]: # import KMeans
from sklearn.cluster import KMeans
[78]: # create kmeans object
kmeans = KMeans(n_clusters=4)
# create np array for data points
points = np.array(customer)
# fit kmeans object to data
kmeans.fit(points)
# print location of clusters learned by kmeans object
17
print(kmeans.cluster_centers_)
# save new clusters for chart
y_km = kmeans.fit_predict(points)
[[ 1.23540923e+03 1.00672039e+03 4.26418588e+02 2.18647926e+03]
[ 3.67597186e+03 1.34914009e+03 5.54594706e+02 7.63349887e+03]
[ 2.52645291e+03 1.07432046e+03 1.22229621e+02 -1.37377462e+05]
[ 1.17679998e+04 1.40823540e+03 6.09630815e+02 2.19542255e+04]]
[79]: plt.scatter(points[y_km ==0,0], points[y_km == 0,1], s=100, c='red')
plt.scatter(points[y_km ==1,0], points[y_km == 1,1], s=100, c='black')
plt.scatter(points[y_km ==2,0], points[y_km == 2,1], s=100, c='blue')
plt.scatter(points[y_km ==3,0], points[y_km == 3,1], s=100, c='cyan')
[79]:
[88]: # create kmeans object
kmeans = KMeans(n_clusters=5)
# create np array for data points
points = np.array(customer)
# fit kmeans object to data
kmeans.fit(points)
18
# print location of clusters learned by kmeans object
print(kmeans.cluster_centers_)
# save new clusters for chart
y_km = kmeans.fit_predict(points)
[[ 2.46353548e+03 1.28705755e+03 5.53547056e+02 4.61479988e+03]
[ 1.08885603e+04 1.35750884e+03 5.62405013e+02 2.99614482e+04]
[ 2.52645291e+03 1.07432046e+03 1.22229621e+02 -1.37377462e+05]
[ 6.84868290e+02 8.42693534e+02 3.47143638e+02 1.19178766e+03]
[ 5.71651402e+03 1.40609811e+03 5.67251782e+02 1.08938860e+04]]
[91]: plt.scatter(points[y_km ==0,0], points[y_km == 0,1], s=100, c='orange')
plt.scatter(points[y_km ==1,0], points[y_km == 1,1], s=100, c='blue')
plt.scatter(points[y_km ==2,0], points[y_km == 2,1], s=100, c='green')
plt.scatter(points[y_km ==3,0], points[y_km == 3,1], s=100, c='black')
plt.scatter(points[y_km ==4,0], points[y_km == 4,1], s=100, c='red')
[91]:
[96]: from sklearn.metrics import silhouette_score
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaled_features = scaler.fit_transform(customer)
19
[107]: kmeans_kwargs = {
"init": "random",
"n_init": 10,
"max_iter": 300,
"random_state": 42,
}
# A list holds the SSE values for each k
sse = []
for k in range(1, 50):
kmeans = KMeans(n_clusters=k, **kmeans_kwargs)
kmeans.fit(scaled_features)
sse.append(kmeans.inertia_)
[108]: plt.plot(range(1, 50), sse)
plt.xticks(range(1, 50))
plt.xlabel("Number of Clusters")
plt.ylabel("SSE")
plt.show()
[ ]:
20