• search hit 1 of 1
Back to Result List

By all these lovely tokens... Merging conflicting tokenizations

  • Given the contemporary trend to modular NLP architectures and multiple annotation frameworks, the existence of concurrent tokenizations of the same text represents a pervasive problem in everyday's NLP practice and poses a non-trivial theoretical problem to the integration of linguistic annotations and their interpretability in general. This paper describes a solution for integrating different tokenizations using a standoff XML format, and discusses the consequences from a corpus-linguistic perspective.

Export metadata

Additional Services

Search Google Scholar Statistics
Metadaten
Author details:Christian ChiarcosORCiD, Julia Ritz, Manfred StedeORCiDGND
DOI:https://doi.org/10.1007/s10579-011-9161-0
ISSN:1574-020X
Title of parent work (English):Language resources and evaluation
Publisher:Springer
Place of publishing:Dordrecht
Publication type:Article
Language:English
Year of first publication:2012
Publication year:2012
Release date:2017/03/26
Tag:Conflicting tokenizations; Corpus linguistics; Linguistic annotation; Multi-layer annotation; Tokenization alignment
Volume:46
Issue:1
Number of pages:22
First page:53
Last Page:74
Funding institution:Deutsche Forschungsgemeinschaft (DFG) [(SFB) 632]
Organizational units:Humanwissenschaftliche Fakultät / Strukturbereich Kognitionswissenschaften / Department Linguistik
Peer review:Referiert
Institution name at the time of the publication:Humanwissenschaftliche Fakultät / Institut für Linguistik / Allgemeine Sprachwissenschaft
Accept ✔
This website uses technically necessary session cookies. By continuing to use the website, you agree to this. You can find our privacy policy here.