辅导CECS, 2CSIRO留学生、讲解html/web/css编程、讲解RDF data
- 首页 >> WebLearning OWL classes from RDF data
with OWL-MINER
David Ratcliffe1,2 & Dr. Kerry Taylor1
1ANU CECS, 2CSIRO Data61
June 28, 2017
Lab Outline
In this lab, we will explore several concrete machine learning problems
over RDF data with OWL. For each problem, we will:
Understand the problem domain, and what we intend to solve;
Use Proteg′ e to explore the graph-based RDF data and how it is de- ′
scribed with OWL properties and classes;
Use OWL-MINER to learn OWL classes as solution hypotheses over the
RDF data;
Use Proteg′ e to inspect the OWL classes generated by OWL-M ′ INER.
We will focus on five problem domains:
1. Michalski’s Trains: Classifying eastbound from westbound trains.
2. Mushrooms: Classifying edible vs. poisonous mushrooms.
3. Mutagenesis: Classifying mutagenic vs. non-mutagenic chemical compounds.
4. Carcinogenesis: Classifying carcinogenic vs. non-carcinogenic chemical
compounds.
5. Poker: Classifying poker hands (two pair, straight, royal flush, etc.).
1
1 Michalski’s Trains
Michalski’s Trains is a well-studied problem for rule learning over structured
data [4]. In this problem, we have five examples of eastbound trains,
and five examples of westbound trains. Our goal here is to learn a simple
description of the features which correctly identify only eastbound or only
westbound trains.
Figure 1: Michalski Trains. Trains numbered 1-5 are classified as belonging
to the class ‘eastbound’, and those numbered 6-10 are labelled ‘westbound’.
1. Using Proteg′ e, open the file: ′
owlminer-lab/examples/data/trains.owl
Inspect the OWL classes and properties describing the various features
of trains, such as the types of car, load shapes and occurrences,
wheel counts, and so on. Do not save any changes to this file.
Given these features in the data together with Figure 1, can you identify
which features describe only eastbound or westbound trains?
2. OWL-MINER is a software program which takes RDF data together
with an OWL ontology which describes it, and can automatically generate
OWL classes which describe some subset of the data. For example,
OWL-MINER can be instructed to find OWL classes which describe
only eastbound or westbound trains with a certain accuracy, such
as 100%. To run OWL-MINER for this problem:
./owlminer --config examples/trains.xml
2
OWL-MINER will report that it will be trying to solve two tasks, the
first is to find classes which describe only eastbound trains with 100%
accuracy, and the next is to do the same for westbound trains. The results
are saved as new OWL ontology files:
owlminer-lab/results/trains East.owl
owlminer-lab/results/trains West.owl
3. Use Proteg′ e to open each of the output files (above). OWL-M ′ INER
has added a number of classes, C0, . . . , C9 which are the hypotheses
which describe features of trains travelling either eastbound, or westbound.
Annotations on each class, such as comment describe various
measures about each class based on their performance for the given
problem. For example, C0 may read:
C0 ≡ Train and (hasCar some (Closed and Short))
This class describes eastbound trains as only those which have at least
one (some) Car which is both Closed and Short. Inspect the other generated
classes to see there are several solutions to this problem.
In the comments on classes, a variety of quality measure values are
reported:
Util: The utility score for this class (a measure of quality attributed
by OWL-MINER based on the chosen measure function,
and the simplicity/readability of the class);
Acc: The accuracy;
Prec: The precision or positive predictive value (PPV);
FDR: The false discovery rate: 1-PPV;
Spec: The specificity or true negative rate (TNR);
Sens: The sensitivity or recall;
P: p/P, and N: n/N: The number of positive (p) or negative (n)
examples covered out of all positive (P) or negative (N) examples.
WRAcc: The weighted relative accuracy, defined as: p
See https://en.wikipedia.org/wiki/Confusion_matrix for definitions.
3
2 Mushrooms
The mushroom dataset [5] consists of 8,124 hypothetical examples of the
characteristics of mushrooms from the agaricus and lepiota families, with
roughly half labelled as being edible (4,208) and the other half as being poisonous
(3,916). We have created a new dataset using RDF and OWL consisting
of a random sample of 200 mushrooms from each label.
Figure 2: The mushroom dataset [5] contains feature descriptions of thousands
of mushrooms, labelled as being either edible or poisonous (image
from [7]).
1. Open and explore the ontology with Proteg′ e:′
owlminer-lab/examples/data/mushroom kb.rdf
Under the class Mushroom are two classes, Edible and Poisonous. Explore
the various features which describe these mushrooms. Clearly,
with so many examples, manually deducing the features which indicate
if a mushroom is poisonous or edible is not feasible.
2. Run OWL-MINER which will attempt to compute classes with at least
99% accuracy for both edible and poisonous mushrooms:
./owlminer --config examples/mushrooms.xml
Note that because of the larger size of this problem, computation of
the classes takes longer than finding solutions to Michalski’s Trains.
The results are saved as:
owlminer-lab/results/mushroom Edible.owl
owlminer-lab/results/mushroom Poisonous.owl
3. Explore the generated classes with Proteg′ e for each type of mush- ′
room, edible or poisonous.
4
By modifying the configuration file for OWL-MINER, we can change
the quality measure which is used to score the classes that are generated.
In the example above, accuracy was the chosen quality measure. Another
measure is Weighted Relative Accuracy (WRAcc). This measure is suited for
learning subgroups of examples with an unusual distribution to the given
set. Here, we have an even split by 200 edible / 200 poisonous mushrooms.
A class with WRAcc score close to -1.0 indicates that it mainly describes
edible mushrooms as opposed to poisonous ones, and a score closer to 1.0
indicates the class describes mainly poisonous mushrooms as opposed to
edible ones. To test the use of WRAcc as a quality measure, run:
./owlminer --config examples/mushrooms wracc.xml
The output can be inspected with Proteg′ e over the result: ′
./owlminer --config examples/mushrooms wracc Poisonous.xml
Note how there are many classes which describe different features of one
set of mushrooms over another (edible, poisonous).
3 Mutagenesis
Mutagenesis is a well-known benchmark problem in machine learning [2].
The mutagenesis dataset contains examples of various chemical compounds
and their characteristics, such as the atomic structure including functional
groups, and various real-valued measures such as a water/octanol partition
coefficient, log P. The full dataset contains 230 example compounds,
125 of which are labelled positive for mutagenicity, and the remaining 105
are labelled negative for non-mutagenicity. Figure 3 shows a sample of
three compounds which appear in the mutagenesis dataset.
(a) 2-bromo-4,6-dinitroaniline (b) 5-nitroisatin (c) 6-nitroquinoline
Figure 3: Various small molecules from the mutagenesis dataset.
1. Inspect the mutagenesis dataset with Proteg′ e:′
5
owlminer-lab/examples/data/mutagenesis.owl
Note that the OWL ontology contains significantly more classes than
our earlier examples. In this problem, we aim to generate classes
which differentiate mutagenic compounds from non-mutagenic compounds
which high accuracy. The size of the ontology makes this
challenging as there are very many more potential classes to generate
and test.
2. Run OWL-MINER over this dataset to generate a single OWL class
which maximises the accuracy of classifying mutagenic compounds
over others:
./owlminer --config examples/mutagenesis.xml
Note that the expected runtime for this problem is around 2 minutes,
as OWL-MINER searches through the very large space of possible
classes.
3. The resulting OWL class C0 which describes mutagenic compounds
can be inspected with Proteg′ e by opening: ′
owlminer-lab/results/mutagenesis Pos.owl.
C0 is a complex OWL class which describes several features of mutagenic
compounds, including:
The range of a datatype property over numerical values;
A particular known molecular sub-structure;
The use of negation to compactly describe a large subset of data.
This highlights the expressivity of OWL as a powerful hypothesis language
over data which combines categorical, numerical and structural
features.
A variety of machine learning techniques have been applied to the mutagenesis
dataset to construct classification models, from ILP, neural networks,
support vector machines to other kernel-based methods [3]. However, the
OWL/DL learning method outperforms all these techniques, with OWLMINER
achieving the strongest known result for this problem with a 10-fold
cross-validation accuracy of 97.62 ± 3.31%.
6
4 Carcinogenesis
The carcinogenesis dataset is another long standing benchmark problem
in machine learning [6]. Similar to the mutagenesis problem, carcinogenesis1
contains examples of chemical compounds together with the results
of bioassays and are labelled as being carcinogenic or not. The full dataset
contains 337 example compounds, 182 of which are labelled positive for
carcinogenicity, and the remaining 155 are labelled negative as being noncarcinogenic.
The OWL ontology for this dataset contains 142 classes, 18
object properties, 1 datatype property and 22,374 instances, along with
many class axioms describing the subsumption hierarchy of atom, bond
and functional group structural classes, such as various types of halides or
ring structures.
1. Use Proteg′ e to inspect the data and ontology: ′
owlminer-lab/examples/data/carcinogenesis.owl
Note that the carcinogenesis dataset has more classes, properties and
examples than the mutagenesis problem, and represents an even larger
search space of possible concepts.
2. Run OWL-MINER for this problem to generate OWL classes which
maximise the accuracy of classifying carcinogenic compounds:
./owlminer --config examples/carcinogenesis.xml
Note that the expected runtime for this problem is around 5 to 7
minutes, as OWL-MINER searches through the huge space of possible
classes. Figure 4 (on the next page) plots the expected trajectory
of the OWL-MINER system in solving this problem, where we
see that classes of around 65% accuracy are reached quickly, but it
will take OWL-MINER much longer to locate higher quality classes
with around 69% to 71% accuracy (which are currently the strongest
known results for this problem to date).
3. The resulting OWL classes which describe carcinogenic compounds
can be inspected with Proteg′ e by opening: ′
owlminer-lab/results/carcinogenesis Pos.owl.
1http://www.cs.ox.ac.uk/activities/machlearn/cancer.html
7
0.54
0.55
0.56
0.57
0.58
0.59
0.6
0.61
0.62
0.63
0.64
0.65
0.66
0.67
0.68
0.69
0.7
0.71
0.72
100 101 102 103 104 105 106 107
Accuracy of best concept found
Concepts searched
OWL-MINER
Figure 4: The trajectory of OWL-MINER in learning classes for the carcinogenesis
problem, plotting the number of classes searched versus the
accuracy of the best performing candidate.
5 Poker
The Poker dataset captures a structured representation of various five-card
hands which are labelled with their corresponding poker hand type, such
as nothing, straight, pair, flush, etc. [1]. The dataset presents a classification
problem where a learning system must infer the definition of each of the
poker hands to the exclusion of all others.
This problem is interesting because it has proven to be very difficult for
many learning systems and methods, primarily because of the scale of the
problem in the number of examples in the dataset, but also because it requires
learned theories to be expressive in order to describe poker hands
with high accuracy. The original dataset consists of a large number of examples
of five-card poker hands, where each example is described only
with the rank and suit of each card, for example:
A K Q J 10 (royal flush) 3 4 5 6 7 (straight flush)
7 7 7 7 3 (four of a kind) Q Q Q 9 9 (full house)
J 10 8 3 2 (flush) 6 5 4 3 2 (straight)
5 5 5 K 7 (three of a kind) 4 4 K K 3 (two pair)
9 9 10 4 2 (one pair) K Q 6 7 3 (nothing)
The OWL onology describing this problem includes background knowledge
which has been added to aid in classification. Specifically, the roles
8
sameSuit, sameRank, nextRank are used to assert whether cards within an
individual hand have the same suit (e.g., 4 sameSuit 6), the same rank
(e.g., A sameRank A), or the next rank (e.g., 10 nextRank J).
For a single deck of 52 cards, there are 311,875,200 possible hands each
with a type as listed above. In the ontology, we have selected a random
stratified sample of hands of each type, consisting of a total of 603 example
hands, a tiny fraction of all possibilities.
1. Use Proteg′ e to inspect the data and ontology: ′
owlminer-lab/examples/data/poker kb.owl
Note the completeness of assertions of properties such as sameSuit,
sameRank and nextRank for each hand. As we will see, these properties
will give us the language OWL-MINER will need to classify
hands effectively and compactly.
2. Run any of the following to invoke OWL-MINER to solve the respective
problems:
./owlminer --config examples/pokerFourOfAKind.xml
./owlminer --config examples/pokerStraight.xml
./owlminer --config examples/pokerStraightFlush.xml
./owlminer --config examples/pokerRoyalFlush.xml
./owlminer --config examples/pokerTwoPair.xml
The expected runtime for each of these problems is from seconds to
around 5 minutes.
3. For any of the above, inspect the result with Proteg′ e as: ′
owlminer-lab/results/poker FourOfAKind.owl
owlminer-lab/results/poker Straight.owl
owlminer-lab/results/poker StraightFlush.owl
9
owlminer-lab/results/poker RoyalFlush.owl
owlminer-lab/results/poker TwoPair.owl
References
[1] Robert Cattral, Franz Oppacher, and Dwight Deugo. “Evolutionary
data mining with automatic rule generalization”. In: Recent Advances
in Computers, Computing and Communications. 2002, pp. 296–300.
[2] Asim Kumar Debnath et al. “Structure-activity relationship of mutagenic
aromatic and heteroaromatic nitro compounds. Correlation with
molecular orbital energies and hydrophobicity”. In: Journal of Medicinal
Chemistry 34.2 (Feb. 1991), pp. 786–797. ISSN: 0022-2623. DOI: 10.
1021 / jm00106a046. URL: http : / / dx . doi . org / 10 . 1021 /
jm00106a046 (visited on 04/26/2016).
[3] Huma Lodhi and Stephen Muggleton. “Is mutagenesis still challenging?”
In: In: Proceedings of the 15th International Conference on Inductive
Logic Programming, ILP 2005, Late-Breaking Papers. (2005) 35–40. 2005,
pp. 35–40.
[4] R Michalski and J Larson. “Inductive inference of VL decision rules”.
In: Proceedings of the Workshop in Pattern-Directed Inference Systems (Published
in SIGART Newsletter ACM). 1977, pp. 38–44.
[5] Jeffrey Curtis Schlimmer. “Concept Acquisition Through Representational
Adjustment”. AAI8724747. PhD thesis. University of California,
Irvine, 1987.
[6] A. Srinivasan et al. “Carcinogenesis predictions using ILP”. en. In: Inductive
Logic Programming. Ed. by Nada Lavrac and Sa ˇ so D ˇ zeroski. ˇ
Lecture Notes in Computer Science 1297. DOI: 10.1007/3540635149 56.
Springer Berlin Heidelberg, Sept. 1997, pp. 273–287. ISBN: 978-3-540-
63514-7 978-3-540-69587-5. URL: http : / / link . springer . com /
chapter/10.1007/3540635149_56 (visited on 06/29/2016).
[7] Thomas Taylor. Twelve Edible Mushrooms of the United States, USDA.
1894. URL: https://commons.wikimedia.org/w/index.php
curid=9098324.