Entity Resolution and Clustering Challenge

FAst Multi-source Entity Resolution system (FAMER)

Database group of the University of Leipzig

Entity Clustering Challenge

We provide a dataset of product specification data. It is an adapted subset of the DEXTER dataset [1]. It contains product information that was extracted from different websites, e.g. online shops. Your task is to use FAMER (or any other ER-system of your choice) to perform entity resolution and entity clustering on this dataset.

As a result you need to produce groups (clusters) of matching product entities from different sources (websites) that represent the same product. All data, tools and necessary documentation can be found online under [2].

The challenge runs in two stages:

Stage 1:

You need to find a good configuration of the FAMER system which is provided as command-line tool for the given data consisting of 20.000 entities from 213 sources (entitiesCompactBig.json). Some entities carry a golden truth cluster-id. Entites with the same ID should be within the same cluster.

Within the provided documentation (documentation.pdf) configuration options of FAMER are described in detail.

To evaluate your result we provide golden truth information for 20% of the matches in our dataset. It describes which entities should be clustered together. The clusters produced with your configuration get compared to the true clusters and precision, recall and the F-measure are calculated. This evaluation is done automatically. The result is printed when Famer-runner has finished the clustering.

You can also use other tools for the task. The partial golden matches as well as about the same number of non matches can be used for machine learning methods.

The Famer-runner also generates a CSV-File containing a list of pairs (entityID,clusterID).

Stage 2:

You need to send us (saeedi@informatik.uni-leipzig.de) the generated CSV-File and the identified configuration (if FAMER was used).

If you use other tools please adhere to the format of the CSV-File generated by FAMER. To be fair, please do not manually create cluster-pairs. We evaluate your results (+run your FAMER configuration) with the complete gold-standard and report results. Finally a winner will be selected.

Refrences

[1] Disheng Qiu, Luciano Barbosa, Xin Luna Dong, Yanyan Shen, and Divesh Srivastava. Dexter: large-scale discovery and extraction of product specifications on the web. Proceedings of the VLDB Endowment, 8(13):2194–2205, 2015.

[2] http://www.informatik.uni-leipzig.de/EDBTChallenge2019/