辅导CECS, 2CSIRO留学生、讲解html/web/css编程、讲解RDF data

- 首页 >> Web

Learning OWL classes from RDF data

with OWL-MINER

David Ratcliffe1,2 & Dr. Kerry Taylor1

1ANU CECS, 2CSIRO Data61

June 28, 2017

Lab Outline

In this lab, we will explore several concrete machine learning problems

over RDF data with OWL. For each problem, we will:

Understand the problem domain, and what we intend to solve;

Use Proteg′ e to explore the graph-based RDF data and how it is de- ′

scribed with OWL properties and classes;

Use OWL-MINER to learn OWL classes as solution hypotheses over the

RDF data;

Use Proteg′ e to inspect the OWL classes generated by OWL-M ′ INER.

We will focus on five problem domains:

1. Michalski’s Trains: Classifying eastbound from westbound trains.

2. Mushrooms: Classifying edible vs. poisonous mushrooms.

3. Mutagenesis: Classifying mutagenic vs. non-mutagenic chemical compounds.

4. Carcinogenesis: Classifying carcinogenic vs. non-carcinogenic chemical

compounds.

5. Poker: Classifying poker hands (two pair, straight, royal flush, etc.).

1

1 Michalski’s Trains

Michalski’s Trains is a well-studied problem for rule learning over structured

data [4]. In this problem, we have five examples of eastbound trains,

and five examples of westbound trains. Our goal here is to learn a simple

description of the features which correctly identify only eastbound or only

westbound trains.

Figure 1: Michalski Trains. Trains numbered 1-5 are classified as belonging

to the class ‘eastbound’, and those numbered 6-10 are labelled ‘westbound’.

1. Using Proteg′ e, open the file: ′

owlminer-lab/examples/data/trains.owl

Inspect the OWL classes and properties describing the various features

of trains, such as the types of car, load shapes and occurrences,

wheel counts, and so on. Do not save any changes to this file.

Given these features in the data together with Figure 1, can you identify

which features describe only eastbound or westbound trains?

2. OWL-MINER is a software program which takes RDF data together

with an OWL ontology which describes it, and can automatically generate

OWL classes which describe some subset of the data. For example,

OWL-MINER can be instructed to find OWL classes which describe

only eastbound or westbound trains with a certain accuracy, such

as 100%. To run OWL-MINER for this problem:

./owlminer --config examples/trains.xml

2

OWL-MINER will report that it will be trying to solve two tasks, the

first is to find classes which describe only eastbound trains with 100%

accuracy, and the next is to do the same for westbound trains. The results

are saved as new OWL ontology files:

owlminer-lab/results/trains East.owl

owlminer-lab/results/trains West.owl

3. Use Proteg′ e to open each of the output files (above). OWL-M ′ INER

has added a number of classes, C0, . . . , C9 which are the hypotheses

which describe features of trains travelling either eastbound, or westbound.

Annotations on each class, such as comment describe various

measures about each class based on their performance for the given

problem. For example, C0 may read:

C0 ≡ Train and (hasCar some (Closed and Short))

This class describes eastbound trains as only those which have at least

one (some) Car which is both Closed and Short. Inspect the other generated

classes to see there are several solutions to this problem.

In the comments on classes, a variety of quality measure values are

reported:

Util: The utility score for this class (a measure of quality attributed

by OWL-MINER based on the chosen measure function,

and the simplicity/readability of the class);

Acc: The accuracy;

Prec: The precision or positive predictive value (PPV);

FDR: The false discovery rate: 1-PPV;

Spec: The specificity or true negative rate (TNR);

Sens: The sensitivity or recall;

P: p/P, and N: n/N: The number of positive (p) or negative (n)

examples covered out of all positive (P) or negative (N) examples.

WRAcc: The weighted relative accuracy, defined as: p

See https://en.wikipedia.org/wiki/Confusion_matrix for definitions.

3

2 Mushrooms

The mushroom dataset [5] consists of 8,124 hypothetical examples of the

characteristics of mushrooms from the agaricus and lepiota families, with

roughly half labelled as being edible (4,208) and the other half as being poisonous

(3,916). We have created a new dataset using RDF and OWL consisting

of a random sample of 200 mushrooms from each label.

Figure 2: The mushroom dataset [5] contains feature descriptions of thousands

of mushrooms, labelled as being either edible or poisonous (image

from [7]).

1. Open and explore the ontology with Proteg′ e:′

owlminer-lab/examples/data/mushroom kb.rdf

Under the class Mushroom are two classes, Edible and Poisonous. Explore

the various features which describe these mushrooms. Clearly,

with so many examples, manually deducing the features which indicate

if a mushroom is poisonous or edible is not feasible.

2. Run OWL-MINER which will attempt to compute classes with at least

99% accuracy for both edible and poisonous mushrooms:

./owlminer --config examples/mushrooms.xml

Note that because of the larger size of this problem, computation of

the classes takes longer than finding solutions to Michalski’s Trains.

The results are saved as:

owlminer-lab/results/mushroom Edible.owl

owlminer-lab/results/mushroom Poisonous.owl

3. Explore the generated classes with Proteg′ e for each type of mush- ′

room, edible or poisonous.

4

By modifying the configuration file for OWL-MINER, we can change

the quality measure which is used to score the classes that are generated.

In the example above, accuracy was the chosen quality measure. Another

measure is Weighted Relative Accuracy (WRAcc). This measure is suited for

learning subgroups of examples with an unusual distribution to the given

set. Here, we have an even split by 200 edible / 200 poisonous mushrooms.

A class with WRAcc score close to -1.0 indicates that it mainly describes

edible mushrooms as opposed to poisonous ones, and a score closer to 1.0

indicates the class describes mainly poisonous mushrooms as opposed to

edible ones. To test the use of WRAcc as a quality measure, run:

./owlminer --config examples/mushrooms wracc.xml

The output can be inspected with Proteg′ e over the result: ′

./owlminer --config examples/mushrooms wracc Poisonous.xml

Note how there are many classes which describe different features of one

set of mushrooms over another (edible, poisonous).

3 Mutagenesis

Mutagenesis is a well-known benchmark problem in machine learning [2].

The mutagenesis dataset contains examples of various chemical compounds

and their characteristics, such as the atomic structure including functional

groups, and various real-valued measures such as a water/octanol partition

coefficient, log P. The full dataset contains 230 example compounds,

125 of which are labelled positive for mutagenicity, and the remaining 105

are labelled negative for non-mutagenicity. Figure 3 shows a sample of

three compounds which appear in the mutagenesis dataset.

(a) 2-bromo-4,6-dinitroaniline (b) 5-nitroisatin (c) 6-nitroquinoline

Figure 3: Various small molecules from the mutagenesis dataset.

1. Inspect the mutagenesis dataset with Proteg′ e:′

5

owlminer-lab/examples/data/mutagenesis.owl

Note that the OWL ontology contains significantly more classes than

our earlier examples. In this problem, we aim to generate classes

which differentiate mutagenic compounds from non-mutagenic compounds

which high accuracy. The size of the ontology makes this

challenging as there are very many more potential classes to generate

and test.

2. Run OWL-MINER over this dataset to generate a single OWL class

which maximises the accuracy of classifying mutagenic compounds

over others:

./owlminer --config examples/mutagenesis.xml

Note that the expected runtime for this problem is around 2 minutes,

as OWL-MINER searches through the very large space of possible

classes.

3. The resulting OWL class C0 which describes mutagenic compounds

can be inspected with Proteg′ e by opening: ′

owlminer-lab/results/mutagenesis Pos.owl.

C0 is a complex OWL class which describes several features of mutagenic

compounds, including:

The range of a datatype property over numerical values;

A particular known molecular sub-structure;

The use of negation to compactly describe a large subset of data.

This highlights the expressivity of OWL as a powerful hypothesis language

over data which combines categorical, numerical and structural

features.

A variety of machine learning techniques have been applied to the mutagenesis

dataset to construct classification models, from ILP, neural networks,

support vector machines to other kernel-based methods [3]. However, the

OWL/DL learning method outperforms all these techniques, with OWLMINER

achieving the strongest known result for this problem with a 10-fold

cross-validation accuracy of 97.62 ± 3.31%.

6

4 Carcinogenesis

The carcinogenesis dataset is another long standing benchmark problem

in machine learning [6]. Similar to the mutagenesis problem, carcinogenesis1

contains examples of chemical compounds together with the results

of bioassays and are labelled as being carcinogenic or not. The full dataset

contains 337 example compounds, 182 of which are labelled positive for

carcinogenicity, and the remaining 155 are labelled negative as being noncarcinogenic.

The OWL ontology for this dataset contains 142 classes, 18

object properties, 1 datatype property and 22,374 instances, along with

many class axioms describing the subsumption hierarchy of atom, bond

and functional group structural classes, such as various types of halides or

ring structures.

1. Use Proteg′ e to inspect the data and ontology: ′

owlminer-lab/examples/data/carcinogenesis.owl

Note that the carcinogenesis dataset has more classes, properties and

examples than the mutagenesis problem, and represents an even larger

search space of possible concepts.

2. Run OWL-MINER for this problem to generate OWL classes which

maximise the accuracy of classifying carcinogenic compounds:

./owlminer --config examples/carcinogenesis.xml

Note that the expected runtime for this problem is around 5 to 7

minutes, as OWL-MINER searches through the huge space of possible

classes. Figure 4 (on the next page) plots the expected trajectory

of the OWL-MINER system in solving this problem, where we

see that classes of around 65% accuracy are reached quickly, but it

will take OWL-MINER much longer to locate higher quality classes

with around 69% to 71% accuracy (which are currently the strongest

known results for this problem to date).

3. The resulting OWL classes which describe carcinogenic compounds

can be inspected with Proteg′ e by opening: ′

owlminer-lab/results/carcinogenesis Pos.owl.

1http://www.cs.ox.ac.uk/activities/machlearn/cancer.html

7

0.54

0.55

0.56

0.57

0.58

0.59

0.6

0.61

0.62

0.63

0.64

0.65

0.66

0.67

0.68

0.69

0.7

0.71

0.72

100 101 102 103 104 105 106 107

Accuracy of best concept found

Concepts searched

OWL-MINER

Figure 4: The trajectory of OWL-MINER in learning classes for the carcinogenesis

problem, plotting the number of classes searched versus the

accuracy of the best performing candidate.

5 Poker

The Poker dataset captures a structured representation of various five-card

hands which are labelled with their corresponding poker hand type, such

as nothing, straight, pair, flush, etc. [1]. The dataset presents a classification

problem where a learning system must infer the definition of each of the

poker hands to the exclusion of all others.

This problem is interesting because it has proven to be very difficult for

many learning systems and methods, primarily because of the scale of the

problem in the number of examples in the dataset, but also because it requires

learned theories to be expressive in order to describe poker hands

with high accuracy. The original dataset consists of a large number of examples

of five-card poker hands, where each example is described only

with the rank and suit of each card, for example:

A K Q J 10 (royal flush) 3 4 5 6 7 (straight flush)

7 7 7 7 3 (four of a kind) Q Q Q 9 9 (full house)

J 10 8 3 2 (flush) 6 5 4 3 2 (straight)

5 5 5 K 7 (three of a kind) 4 4 K K 3 (two pair)

9 9 10 4 2 (one pair) K Q 6 7 3 (nothing)

The OWL onology describing this problem includes background knowledge

which has been added to aid in classification. Specifically, the roles

8

sameSuit, sameRank, nextRank are used to assert whether cards within an

individual hand have the same suit (e.g., 4 sameSuit 6), the same rank

(e.g., A sameRank A), or the next rank (e.g., 10 nextRank J).

For a single deck of 52 cards, there are 311,875,200 possible hands each

with a type as listed above. In the ontology, we have selected a random

stratified sample of hands of each type, consisting of a total of 603 example

hands, a tiny fraction of all possibilities.

1. Use Proteg′ e to inspect the data and ontology: ′

owlminer-lab/examples/data/poker kb.owl

Note the completeness of assertions of properties such as sameSuit,

sameRank and nextRank for each hand. As we will see, these properties

will give us the language OWL-MINER will need to classify

hands effectively and compactly.

2. Run any of the following to invoke OWL-MINER to solve the respective

problems:

./owlminer --config examples/pokerFourOfAKind.xml

./owlminer --config examples/pokerStraight.xml

./owlminer --config examples/pokerStraightFlush.xml

./owlminer --config examples/pokerRoyalFlush.xml

./owlminer --config examples/pokerTwoPair.xml

The expected runtime for each of these problems is from seconds to

around 5 minutes.

3. For any of the above, inspect the result with Proteg′ e as: ′

owlminer-lab/results/poker FourOfAKind.owl

owlminer-lab/results/poker Straight.owl

owlminer-lab/results/poker StraightFlush.owl

9

owlminer-lab/results/poker RoyalFlush.owl

owlminer-lab/results/poker TwoPair.owl

References

[1] Robert Cattral, Franz Oppacher, and Dwight Deugo. “Evolutionary

data mining with automatic rule generalization”. In: Recent Advances

in Computers, Computing and Communications. 2002, pp. 296–300.

[2] Asim Kumar Debnath et al. “Structure-activity relationship of mutagenic

aromatic and heteroaromatic nitro compounds. Correlation with

molecular orbital energies and hydrophobicity”. In: Journal of Medicinal

Chemistry 34.2 (Feb. 1991), pp. 786–797. ISSN: 0022-2623. DOI: 10.

1021 / jm00106a046. URL: http : / / dx . doi . org / 10 . 1021 /

jm00106a046 (visited on 04/26/2016).

[3] Huma Lodhi and Stephen Muggleton. “Is mutagenesis still challenging?”

In: In: Proceedings of the 15th International Conference on Inductive

Logic Programming, ILP 2005, Late-Breaking Papers. (2005) 35–40. 2005,

pp. 35–40.

[4] R Michalski and J Larson. “Inductive inference of VL decision rules”.

In: Proceedings of the Workshop in Pattern-Directed Inference Systems (Published

in SIGART Newsletter ACM). 1977, pp. 38–44.

[5] Jeffrey Curtis Schlimmer. “Concept Acquisition Through Representational

Adjustment”. AAI8724747. PhD thesis. University of California,

Irvine, 1987.

[6] A. Srinivasan et al. “Carcinogenesis predictions using ILP”. en. In: Inductive

Logic Programming. Ed. by Nada Lavrac and Sa ˇ so D ˇ zeroski. ˇ

Lecture Notes in Computer Science 1297. DOI: 10.1007/3540635149 56.

Springer Berlin Heidelberg, Sept. 1997, pp. 273–287. ISBN: 978-3-540-

63514-7 978-3-540-69587-5. URL: http : / / link . springer . com /

chapter/10.1007/3540635149_56 (visited on 06/29/2016).

[7] Thomas Taylor. Twelve Edible Mushrooms of the United States, USDA.

1894. URL: https://commons.wikimedia.org/w/index.php

curid=9098324.


站长地图