Space and time scalability of duplicate detection in graph data

Herschel, Melanie; Naumann, Felix

Das Suchergebnis hat sich seit Ihrer Suchanfrage verändert. Eventuell werden Dokumente in anderer Reihenfolge angezeigt.

Treffer 8 von 16

Zurück zur Trefferliste

Space and time scalability of duplicate detection in graph data

Melanie Herschel, Felix Naumann

Duplicate detection consists in determining different representations of real-world objects in a database. Recent research has considered the use of relationships among object representations to improve duplicate detection. In the general case where relationships form a graph, research has mainly focused on duplicate detection quality/effectiveness. Scalability has been neglected so far, even though it is crucial for large real-world duplicate detection tasks. In this paper we scale up duplicate detection in graph data (DDG) to large amounts of data and pairwise comparisons, using the support of a relational database system. To this end, we first generalize the process of DDG. We then present how to scale algorithms for DDG in space (amount of data processed with limited main memory) and in time. Finally, we explore how complex similarity computation can be performed efficiently. Experiments on data an order of magnitude larger than data considered so far in DDG clearly show that our methods scale to large amounts of data not residingDuplicate detection consists in determining different representations of real-world objects in a database. Recent research has considered the use of relationships among object representations to improve duplicate detection. In the general case where relationships form a graph, research has mainly focused on duplicate detection quality/effectiveness. Scalability has been neglected so far, even though it is crucial for large real-world duplicate detection tasks. In this paper we scale up duplicate detection in graph data (DDG) to large amounts of data and pairwise comparisons, using the support of a relational database system. To this end, we first generalize the process of DDG. We then present how to scale algorithms for DDG in space (amount of data processed with limited main memory) and in time. Finally, we explore how complex similarity computation can be performed efficiently. Experiments on data an order of magnitude larger than data considered so far in DDG clearly show that our methods scale to large amounts of data not residing in main memory.…

Metadaten
Verfasserangaben:	Melanie Herschel, Felix Naumann ORCiD GND
URN:	urn:nbn:de:kobv:517-opus-32851
ISBN:	978-3-940793-46-1
Schriftenreihe (Bandnummer):	Technische Berichte des Hasso-Plattner-Instituts für Digital Engineering an der Universität Potsdam (25)
Publikationstyp:	Monographie/Sammelband
Sprache:	Englisch
Erscheinungsjahr:	2008
Veröffentlichende Institution:	Universität Potsdam
Datum der Freischaltung:	07.07.2009
RVK - Regensburger Verbundklassifikation:	ST 230
Organisationseinheiten:	An-Institute / Hasso-Plattner-Institut für Digital Engineering gGmbH
DDC-Klassifikation:	0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik
Lizenz (Deutsch):	Keine öffentliche Lizenz: Unter Urheberrechtsschutz

Space and time scalability of duplicate detection in graph data

Volltext Dateien herunterladen

Metadaten exportieren

Weitere Dienste