The Impact of Semantic Handshakes
This site provides all information about the
“Impact of Semantic Handshakes”. This is the name of the phenomenon that
besides the top-down approach of terminological standardisation using ontologies
a bottom-up approach using distributed, local terminological standardisation
decisions yields excellent results. The advantage of the semantic handshakes
is, that no centralised authority has to establish and defend one universal
vocabulary.
Last Change: Monday, 12 June 2006
Author: Lutz Maicher (maicher add informatik
dot uni-leipzig dot de)
Experiment Series
Configuration
Analysing an
Experiment Series
Protocols of
Experiments Series
The impact of semantic handshakes is
described and discussed in the TMRA 2006 paper “The Impact of Semantic
Handshakes” (draft available here; submitted to TMRA 2006).
Please read this paper before you start using the software. The paper does
refer to parts of this website to have both: all implementation details and
saved space in the paper.
This site provides all further information
about the experiments, including the software for the simulation experiments.
In the section Software the simulation software is
described: downloading and customizing details. The Experiment Design is
described in its own section.
[TODO: describe
the purpose of the software]. The simulation software is a java application.
The simulation software is available as
jar-files. Please download the file semantic_handshakes.zip
(available here) to get the
simulation software. This file contains the package org.semports.handshakes
with the following jar-files (No further lib-files are needed):
· Simulation.java (available here) is the main class providing the simulation
functionality. Each simulation has to configured in the simultaion.java file
(see section Experiment Series
Configuration)
· Proxy.java (available here) is a class representing a Proxy, it’s
Identity Identifiers and a set of all Proxies which have the identical
identity.
· Helper.java (available here) is a class with some helper methods for
output purposes.
The full javadoc for the simulation software
is available here.
An experiment series can be configured by the
following variables in the source code of simulation.java. In the protocols of
the experiments series the used values of the parameters are documented:
|
path |
String |
Represents the path where the results of the experiment series are
saved as expname+".dat" and expname+".plt"
(see Section Output Configuration) |
|
expname |
String |
Represents the name of the experiment series. It will be used for
output purposes. |
|
header |
String |
Represents a header for the experiment series. It will be used for
output purposes. |
|
xlabel |
String |
Represents the label
for the x-axis in the output diagram. See gnuplot
documentation for more details. |
|
xrange |
String |
Defines the range of the x-axis in gnuplot notation. See gnuplot documentation for more details. |
|
nbrOfTests |
int |
Represents the number of tests of an experiment. |
|
nbrOfMergeRoundtrips |
int |
Represents the number of merge roundtrips within a test. |
|
it_cardE |
boolean |
Defines whether iteration over card(E) has
to be done in the experiment series. |
|
cardE |
int |
Represents the number of different Proxies which has to be created. |
|
it_nbrOfDifferentII |
boolean |
Defines whether iteration over nbrOfDifferentII has to be done
in the experiment series. |
|
nbrOfDifferentII |
int |
Represents of the number of different Identity Identifiers which are
"known" in the world. If nbrOfDifferentII is 5, only five
different Identity Identifier can be assigned to the Proxys. |
|
it_distribution_nbrOfII |
boolean |
Defines whether iteration over the distributionNbrOfII has
to be done in the experiment series. |
|
distributionNbrOfII |
double[] |
Represents D of the
distribution of the number of initial Identity Identifier which should be
assigned to a proxy. The upper limit b is distributionNbrOfII.length().
(See section Defining Distributions.) |
|
it_distribution_II |
boolean |
Defines whether iteration over the distributionII has
to be done in the experiment series. |
|
distributionII |
double[] |
Represents D of the
distribution of the values of the Identity Identifiers of proxies. The upper
limit b is nbrOfDifferentII. (See section Defining Distributions.) |
|
differentDistributionII2 |
boolean |
Defines whether the second Identity Identifier of a proxy should be
assigned according to a different distribution as the first Identity
Identifier. |
|
distributionII2 |
double[] |
Represents D of the distribution of the values of the second Identiy
Identifier. The upper limit b is nbrOfDifferentII.
(See section Defining Distributions.) |
|
i_start |
int |
Represents the starting point of the iteration in executeExperimentSeries. |
|
i_loop |
int |
Represents the number of steps which have to be done in the iteration
in executeExperimentSeries. |
|
i_step |
int |
Represents the width of a step in the iteration in
executeExperimentSeries. |
|
output_testresults |
Boolean |
Defines whether the results of all tests should be printed to the
output files. |
|
|
|
|
The iteration of the parameters (distributionNbrOfII,
cardE, nbrOfDifferentII, distributionII) should
be fine tuned in executeExperimentSeries(). See the examples in the code.
The output of the experiment series is
printed to the console and to two different files. These files can be used to
plot the result of the experiment series using gnuplot (see the gnuplot website). The path of these files
is defined by the variable path. The plotting configuration will be defined in
the file expname+”.plt” (see example.plt)
and the data for the plotting is written in the file expname+”.dat” (see example.dat). See the gnuplot documentation for
further details about the configuration of gnuplotting.
The content of expname+”.plt” is defined in
the source code of Simulation(String, String). The content of expname+”.dat” is
defined in the source code of executeExperimentSeries() and executeTest() (if output_testresults
= true).
If you have any questions concerning the
simulation software or if you want to submit a bug please send an email to the
contact address given in the header.
Within this document the following four terms
are used:
Experiment Series. An experiment series is a sequence of experiments. Usually, one parameter iterates (in example the number of different identity identifiers which are “known” in the world) in a given range.
Experiment. An experiment is a sequence of tests. Because the setup of a test environment is a stochastic process, the results of an experiment are means of measures observed in a sequence of tests.
Test. A test is one process as described below. According to the given parameters, all proxies are created and identity identifiers assigned. Within a test a specified number of merge roundtrips is executed.
Merge roundtrip. A merge roundtrip is the following process: for each proxy in E it is decided whether there are other proxies available in E which have to be merged with the given proxy.
We will define E
as a set of proxies ei
which have by definition the identical identity. For example E might be the set of all available proxies for the type
“persons” or E might be the set of all available proxies for the individual
“Johann Sebastian Bach”. Each proxy ei
has a unique proxy identifier which is used to refer to this proxy[1]. Additionally, each proxy ei defines its identity by a non empty set Ii of identity identifiers (strings). The set Ti of a proxy ei
consists of the proxy identifiers of all proxies which have the same identity
as ei.
Two proxies ei
and ej will be considered as
equal (identity equality holds) if
If proxy ei
is equal with proxy ej
merging will create two proxies ei’
and ej’ in E’:
The premise of the simulation design is, that
all proxies in E have the same identity. But
this can only be globally exploited by information systems, if identity
equality is detected between all proxies in E. In terms of
the simulation design, this “best of all worlds” situation is reached if Ti of all proxies ei
in E is equal to E[2]:
Distributions
are defined in the experiments as a tuple [D,b]: a sorted
set D of numbers between 0 und 1 (the last
number has to be 1) and the upper limit b.
Example The distribution [{0.5, 0.7 , 0.9 , 1.0} , 100] defines the following lottery: with a probability of 50% a value in [1,25] is drawn, with a probability of 20% a value in [26,50] is drawn, with a probability of 20% a value in [51,75] is drawn and with probability of 10% a value in [76,100] is drawn.
In general,
the lottery can be described as follows. Draw a uniformly distributed double
value u in the range [0,1[. Determine the rank
r, where u is the first time bigger than
the value of D. Determine the size s of an interval by dividing the upper limit b wirth card(D).
Draw a uniformly distributed integer i in [0,s[. The
result of the lottery is i+(r-1)*s+1.
This lottery is implemented by Simulation.getDistributionValue(D,b).
Note Adding 1 to the result implies shifting the domain from [0,b-1] to [1,b].
Example Given the distribution definition above, the result is 64 in the following scenario. As uniformly distributed double in the range [0,1[ the value u=0.8 is drawn. The value of r is 3, because 0,8 is smaller than 0.9. The size s of an interval is 100/4=25. As uniformly distributed integer i in the range [0,25[ the value 13 is drawn. The result of the lottery is 13+(3-1)*25+1=64.
For a test, E
has to be initialised. The variable cardE defines the number of proxies which
have to be created.[3] Each proxy is represented by an object of the
class Proxy. To each proxy a unique proxy identifier is assigned. This proxy
identifier is the value of the variable id. The proxy identifiers are increasing numbers,
starting from 1 to cardE.
The variable distributionNbrOfII defines D of the distribution of
the numbers of identity identifiers each
entity will get. (This variable will be described by prose in the different
simulation settings below). According to this distribution, the number of
identity identifiers which have to be assigned to each ei
is calculated. For each identity identifier a value according to the variable distributionII is created. This variable defines
the distribution of the numbers which will get the identity identifiers. The
upper limit of this distribution is nbrOfDifferentII.
Example. The distribution for the values of th identity identifiers is defined as follows [{0.8 , 1.0} , 6]. This is equivalent to the lottery, that with a probability of 80% an identity identifier gets the value 1,2 or 3. In the same time, with a probability of 20% an identity identifier gets the value 4,5 or 6. This means, that half of the six possible identity identifiers are widely used and the other half of the six possible identity identifiers is rarely used.
A test
is a sequence of merge roundtrips (the number of merge roundtrips is defined by
nbrOfMergeRoundtrips). In a merge roundtrip for each proxy ei in E identity equality
to all other proxies in E is
decided according to (1). If identity equality holds ei’ and ej’ will be created in E’ according to (2) and (3).
At the end of the merge roundtrip all ei in E which have
counterpart in E’ will be replaced by this ei’.[4]
To get statistically valid measures, each
experiment is a sequence of test with the same instantiation parameters. This
is necessary due to the stochastic nature of the initialisation process. The
number of tests in an experiment is defined by nbrOfTests.
For comparing the results of different
experiments within an experiment series different measures have to be
calculates. These measures specify the size and nature of the “integration
clouds” which emerge after the tests. An “integration cloud” is a set of
proxies where identity equality is detected. The global integration point is
reached, if there exist only one “integration cloud” with the size card(E).
After each test the following measures are
calculated:
Mean of card(Ti). This measure depicts the average size of an “integration cloud” in E after a test. Formally, it is the average cardinality of Ti of all ei in E. The algorithm is implemented in Simulation.getAverageCardT().
Note. This measure advantages large integration clouds because the size of a cloud is considered for all Ti which are members in the cloud. Given three clouds (one of size 98, and two of size 1) the mean of card(T) is 96,06 which reflect the high integration density.
Number of cluster(E). This measure depicts the number of different clouds in E. Formally, it’s the maximal number of Ti in E which have empty intersections. The algorithm is implemented in Simulation.getNbrOfClouds().
To evaluate an experiment, the mean of all
tests’ mean of card(Ti) is the approbriate measure.
To evaluate an experiment, the mean of all
tests’ number of cluster(E) is the approbriate
measure.
Within an experiment series, these measures
for experiments with different parameters are compared.
Protocols for the following experiment series
are available:
[1] For clarity, the value of the index i will be
the value of the proxy identifier. In example, eid1 is the proxy
with proxy identifier id1. The same
holds for all indexed variables, like Ii and Ti.
[2] This holds iff the proxy ei
is in its own Ti (otherwise Ti always consists at the maximum of one entity
less then E). Because Ti
does only consist of elements of E the comparison
of the cardinality of both sets is allowed.
[3] Experiments have shown, that cardE partially influences the result. If cardE is less then a threshold both measures card(T) and cluster(E) changes if cardE changes. In the case cardE exceeds this threshold both values are not influenced by changes of cardE. Experiments have shown, that cardE depends on distributionNbrOfII, at least. But in all cases, the threshold is under or near card(E)=100.
[4] More comfortable, all proxy ei and proxy ei’
will be merged into E.