By all these lovely tokens... Merging conflicting tokenizations

Given the contemporary trend to modular NLP architectures and multiple annotation frameworks, the existence of concurrent tokenizations of the same text represents a pervasive problem in everyday's NLP practice and poses a non-trivial theoretical problem to the integration of linguistic annotations and their interpretability in general. This paper describes a solution for integrating different tokenizations using a standoff XML format, and discusses the consequences from a corpus-linguistic perspective.

Metadaten
Verfasserangaben:	Christian Chiarcos ORCiD, Julia Ritz, Manfred Stede ORCiD GND
DOI:	https://doi.org/10.1007/s10579-011-9161-0
ISSN:	1574-020X
Titel des übergeordneten Werks (Englisch):	Language resources and evaluation
Verlag:	Springer
Verlagsort:	Dordrecht
Publikationstyp:	Wissenschaftlicher Artikel
Sprache:	Englisch
Jahr der Erstveröffentlichung:	2012
Erscheinungsjahr:	2012
Datum der Freischaltung:	26.03.2017
Freies Schlagwort / Tag:	Conflicting tokenizations; Corpus linguistics; Linguistic annotation; Multi-layer annotation; Tokenization alignment
Band:	46
Ausgabe:	1
Seitenanzahl:	22
Erste Seite:	53
Letzte Seite:	74
Fördernde Institution:	Deutsche Forschungsgemeinschaft (DFG) [(SFB) 632]
Organisationseinheiten:	Humanwissenschaftliche Fakultät / Strukturbereich Kognitionswissenschaften / Department Linguistik
Peer Review:	Referiert
Name der Einrichtung zum Zeitpunkt der Publikation:	Humanwissenschaftliche Fakultät / Institut für Linguistik / Allgemeine Sprachwissenschaft