- search hit 1 of 1
By all these lovely tokens... Merging conflicting tokenizations
- Given the contemporary trend to modular NLP architectures and multiple annotation frameworks, the existence of concurrent tokenizations of the same text represents a pervasive problem in everyday's NLP practice and poses a non-trivial theoretical problem to the integration of linguistic annotations and their interpretability in general. This paper describes a solution for integrating different tokenizations using a standoff XML format, and discusses the consequences from a corpus-linguistic perspective.
Author details: | Christian ChiarcosORCiD, Julia Ritz, Manfred StedeORCiDGND |
---|---|
DOI: | https://doi.org/10.1007/s10579-011-9161-0 |
ISSN: | 1574-020X |
Title of parent work (English): | Language resources and evaluation |
Publisher: | Springer |
Place of publishing: | Dordrecht |
Publication type: | Article |
Language: | English |
Year of first publication: | 2012 |
Publication year: | 2012 |
Release date: | 2017/03/26 |
Tag: | Conflicting tokenizations; Corpus linguistics; Linguistic annotation; Multi-layer annotation; Tokenization alignment |
Volume: | 46 |
Issue: | 1 |
Number of pages: | 22 |
First page: | 53 |
Last Page: | 74 |
Funding institution: | Deutsche Forschungsgemeinschaft (DFG) [(SFB) 632] |
Organizational units: | Humanwissenschaftliche Fakultät / Strukturbereich Kognitionswissenschaften / Department Linguistik |
Peer review: | Referiert |
Institution name at the time of the publication: | Humanwissenschaftliche Fakultät / Institut für Linguistik / Allgemeine Sprachwissenschaft |