The Impact of Semantic Handshakes

This site provides all information about the “Impact of Semantic Handshakes”. This is the name of the phenomenon that besides the top-down approach of terminological standardisation using ontologies a bottom-up approach using distributed, local terminological standardisation decisions yields excellent results. The advantage of the semantic handshakes is, that no centralised authority has to establish and defend one universal vocabulary.

 

Last Change: Monday, 12 June 2006

Author: Lutz Maicher (maicher add informatik dot uni-leipzig dot de)

Table of Contents

Table of Contents 1

Introduction 1

Software 1

Download 1

Javadoc 2

Experiment Series Configuration 2

Output Configuration 3

Questions and Bug Report 3

Experiment Design 3

Terminological Specifications 3

Defining Distributions 4

Initialisation of a Test 4

Executing a Merge Roundtrip 5

Analysing an Experiment Series 5

Protocols of Experiments Series 5

Introduction

The impact of semantic handshakes is described and discussed in the TMRA 2006 paper “The Impact of Semantic Handshakes” (draft available here; submitted to TMRA 2006). Please read this paper before you start using the software. The paper does refer to parts of this website to have both: all implementation details and saved space in the paper.

This site provides all further information about the experiments, including the software for the simulation experiments. In the section Software the simulation software is described: downloading and customizing details. The Experiment Design is described in its own section.

Software

[TODO: describe the purpose of the software]. The simulation software is a java application.

Download

The simulation software is available as jar-files. Please download the file semantic_handshakes.zip (available here) to get the simulation software. This file contains the package org.semports.handshakes with the following jar-files (No further lib-files are needed):

·      Simulation.java (available here) is the main class providing the simulation functionality. Each simulation has to configured in the simultaion.java file (see section Experiment Series Configuration)

·      Proxy.java (available here) is a class representing a Proxy, it’s Identity Identifiers and a set of all Proxies which have the identical identity.

·      Helper.java (available here) is a class with some helper methods for output purposes.

Javadoc

The full javadoc for the simulation software is available here.

Experiment Series Configuration

An experiment series can be configured by the following variables in the source code of simulation.java. In the protocols of the experiments series the used values of the parameters are documented:

 

path

String

Represents the path where the results of the experiment series are saved as expname+".dat" and expname+".plt" (see Section Output Configuration)

expname

String

Represents the name of the experiment series. It will be used for output purposes.

header

String

Represents a header for the experiment series. It will be used for output purposes.

xlabel

String

Represents  the label for the x-axis in the output diagram. See gnuplot documentation for more details.

xrange

String

Defines the range of the x-axis in gnuplot notation. See gnuplot documentation for more details.

nbrOfTests

int

Represents the number of tests of an experiment.

nbrOfMergeRoundtrips

int

Represents the number of merge roundtrips within a test.

it_cardE

boolean

Defines whether iteration over card(E) has to be done in the experiment series.

cardE

int

Represents the number of different Proxies which has to be created.

it_nbrOfDifferentII

boolean

Defines whether iteration over nbrOfDifferentII has to be done in the experiment series.

nbrOfDifferentII

int

Represents of the number of different Identity Identifiers which are "known" in the world. If nbrOfDifferentII is 5, only five different Identity Identifier can be assigned to the Proxys.

it_distribution_nbrOfII

boolean

Defines whether iteration over the distributionNbrOfII has to be done in the experiment series.

distributionNbrOfII

double[]

Represents D of the distribution of the number of initial Identity Identifier which should be assigned to a proxy. The upper limit b is distributionNbrOfII.length(). (See section Defining Distributions.)

it_distribution_II

boolean

Defines whether iteration over the distributionII has to be done in the experiment series.

distributionII

double[]

Represents D of the distribution of the values of the Identity Identifiers of proxies. The upper limit b is nbrOfDifferentII. (See section Defining Distributions.)

differentDistributionII2

boolean

Defines whether the second Identity Identifier of a proxy should be assigned according to a different distribution as the first Identity Identifier.

distributionII2

double[]

Represents D of the distribution of the values of the second Identiy Identifier. The upper limit b is nbrOfDifferentII. (See section Defining Distributions.)

i_start

int

Represents the starting point of the iteration in executeExperimentSeries.

i_loop

int

Represents the number of steps which have to be done in the iteration in executeExperimentSeries.

i_step

int

Represents the width of a step in the iteration in executeExperimentSeries.

output_testresults

Boolean

Defines whether the results of all tests should be printed to the output files.

 

 

 

The iteration of the parameters (distributionNbrOfII, cardE, nbrOfDifferentII, distributionII) should be fine tuned in executeExperimentSeries(). See the examples in the code.

Output Configuration

The output of the experiment series is printed to the console and to two different files. These files can be used to plot the result of the experiment series using gnuplot (see the gnuplot website). The path of these files is defined by the variable path. The plotting configuration will be defined in the file expname+”.plt” (see example.plt) and the data for the plotting is written in the file expname+”.dat” (see example.dat). See the gnuplot documentation for further details about the configuration of gnuplotting.

The content of expname+”.plt” is defined in the source code of Simulation(String, String). The content of expname+”.dat” is defined in the source code of executeExperimentSeries() and executeTest() (if output_testresults = true).

Questions and Bug Report

If you have any questions concerning the simulation software or if you want to submit a bug please send an email to the contact address given in the header.

Experiment Design

Terminological Specifications

Within this document the following four terms are used:

Experiment Series. An experiment series is a sequence of experiments. Usually, one parameter iterates (in example the number of different identity identifiers which are “known” in the world) in a given range.

Experiment. An experiment is a sequence of tests. Because the setup of a test environment is a stochastic process, the results of an experiment are means of measures observed in a sequence of tests.

Test. A test is one process as described below. According to the given parameters, all proxies are created and identity identifiers assigned. Within a test a specified number of merge roundtrips is executed.

Merge roundtrip. A merge roundtrip is the following process: for each proxy in E it is decided whether there are other proxies available in E which have to be merged with the given proxy.

We will define E as a set of proxies ei which have by definition the identical identity. For example E might be the set of all available proxies for the type “persons” or E might be the set of all available proxies for the individual “Johann Sebastian Bach”. Each proxy ei has a unique proxy identifier which is used to refer to this proxy[1]. Additionally, each proxy ei defines its identity by a non empty set Ii of identity identifiers (strings). The set Ti of a proxy ei consists of the proxy identifiers of all proxies which have the same identity as ei.

Two proxies ei and ej will be considered as equal (identity equality holds) if

(1)  

If proxy ei is equal with proxy ej merging will create two proxies ei and ej in E’:

(2)  

(3)  

The premise of the simulation design is, that all proxies in E have the same identity. But this can only be globally exploited by information systems, if identity equality is detected between all proxies in E. In terms of the simulation design, this “best of all worlds” situation is reached if Ti of all proxies ei in E is equal to E[2]:

(4)    

Defining Distributions

Distributions are defined in the experiments as a tuple [D,b]: a sorted set D of numbers between 0 und 1 (the last number has to be 1) and the upper limit b.

Example The distribution [{0.5, 0.7 , 0.9 , 1.0} , 100] defines the following lottery: with a probability of 50% a value in [1,25] is drawn, with a probability of 20% a value in [26,50] is drawn, with a probability of 20% a value in [51,75] is drawn and with probability of 10% a value in [76,100] is drawn.

In general, the lottery can be described as follows. Draw a uniformly distributed double value u in the range [0,1[. Determine the rank r, where u is the first time bigger than the value of D. Determine the size s of an interval by dividing the upper limit b wirth card(D). Draw a uniformly distributed integer i in [0,s[. The result of the lottery is i+(r-1)*s+1. This lottery is implemented by Simulation.getDistributionValue(D,b).

Note Adding 1 to the result implies shifting the domain from [0,b-1] to [1,b].

Example Given the distribution definition above, the result is 64 in the following scenario. As uniformly distributed double in the range [0,1[ the value u=0.8 is drawn. The value of r is 3, because 0,8 is smaller than 0.9. The size s of an interval is 100/4=25. As uniformly distributed integer i in the range [0,25[ the value 13 is drawn. The result of the lottery is 13+(3-1)*25+1=64.

Initialisation of a Test

For a test, E has to be initialised. The variable cardE defines the number of proxies which have to be created.[3] Each proxy is represented by an object of the class Proxy. To each proxy a unique proxy identifier is assigned. This proxy identifier is the value of the variable id. The proxy identifiers are increasing numbers, starting from 1 to cardE.

The variable distributionNbrOfII defines D of the distribution of the numbers of identity identifiers each entity will get. (This variable will be described by prose in the different simulation settings below). According to this distribution, the number of identity identifiers which have to be assigned to each ei is calculated. For each identity identifier a value according to the variable distributionII is created. This variable defines the distribution of the numbers which will get the identity identifiers. The upper limit of this distribution is nbrOfDifferentII.

Example. The distribution for the values of th identity identifiers is defined as follows [{0.8 , 1.0} , 6]. This is equivalent to the lottery, that with a probability of 80% an identity identifier gets the value 1,2 or 3. In the same time, with a probability of 20% an identity identifier gets the value 4,5 or 6. This means, that half of the six possible identity identifiers are widely used and the other half of the six possible identity identifiers is rarely used.

Executing a Merge Roundtrip

A test is a sequence of merge roundtrips (the number of merge roundtrips is defined by nbrOfMergeRoundtrips). In a merge roundtrip for each proxy ei in E identity equality to all other proxies in E is decided according to (1). If identity equality holds ei and ej’ will be created in E’ according to (2) and (3).

At the end of the merge roundtrip all ei in E which have counterpart in E’ will be replaced by this ei.[4]

Analysing an Experiment Series

To get statistically valid measures, each experiment is a sequence of test with the same instantiation parameters. This is necessary due to the stochastic nature of the initialisation process. The number of tests in an experiment is defined by nbrOfTests.

For comparing the results of different experiments within an experiment series different measures have to be calculates. These measures specify the size and nature of the “integration clouds” which emerge after the tests. An “integration cloud” is a set of proxies where identity equality is detected. The global integration point is reached, if there exist only one “integration cloud” with the size card(E).

After each test the following measures are calculated:

Mean of card(Ti). This measure depicts the average size of an “integration cloud” in E after a test. Formally, it is the average cardinality of Ti of all ei in E. The algorithm is implemented in Simulation.getAverageCardT().

Note. This measure advantages large integration clouds because the size of a cloud is considered for all Ti which are members in the cloud. Given three clouds (one of size 98, and two of size 1) the mean of card(T) is 96,06 which reflect the high integration density.

Number of cluster(E). This measure depicts the number of different clouds in E. Formally, it’s the maximal number of Ti in E which have empty intersections. The algorithm is implemented in Simulation.getNbrOfClouds().

To evaluate an experiment, the mean of all tests’ mean of card(Ti) is the approbriate measure.

To evaluate an experiment, the mean of all tests’ number of cluster(E) is the approbriate measure.

Within an experiment series, these measures for experiments with different parameters are compared.

Protocols of Experiments Series

Protocols for the following experiment series are available:

exp01

exp02

exp03

exp04

exp05

exp06

exp07

exp08

 

 [TODO] licence for source code



[1] For clarity, the value of the index i will be the value of the proxy identifier. In example, eid1 is the proxy with proxy identifier id1. The same holds for all indexed variables, like Ii and Ti.

[2] This holds iff the proxy ei is in its own Ti (otherwise Ti always consists at the maximum of one entity less then E). Because Ti does only consist of elements of E the comparison of the cardinality of both sets is allowed.

[3] Experiments have shown, that cardE partially influences the result. If cardE is less then a threshold both measures card(T) and cluster(E) changes if cardE changes. In the case cardE exceeds this threshold both values are not influenced by changes of cardE. Experiments have shown, that cardE depends on distributionNbrOfII, at least. But in all cases, the threshold is under or near card(E)=100.

[4] More comfortable, all proxy ei and proxy ei will be merged into E.