30597
2008
2008
eng
article
1
--
--
--
A flexible framework for integrating annotations from different tools and tag sets
We present a general framework for integrating annotations from different tools and tag sets. When annotating corpora at multiple linguistic levels, annotators may use different expert tools for different phenomena or types of annotation. These tools employ different data models and accompanying approaches to visualization, and they produce different output formats. For the purposes of uniformly processing these outputs, we developed a pivot format called PAULA, along with converters to and from tool formats. Different annotations are not only integrated at the level of data format, but are also joined on the level of conceptual representation. For this purpose, we introduce OLiA, an ontology of linguistic annotations that mediates between alternative tag sets that cover the same class of linguistic phenomena. All components are integrated in the linguistic information system ANNIS : Annotation tool output is converted to the pivot format PAULA and read into a database where the data can be visualized, queried, and evaluated across multiple layers. For cross-tag set querying and statistical evaluation, ANNIS uses the ontology of linguistic annotations. Finally, ANNIS is also tied to a machine learning component for semiautomatic annotation.
http://www.atala.org/A-Flexible-Framework-for
1248-9433
allegro:1991-2014
10106482
Traitement automatique des langues. - ISSN 1248-9433. - 49 (2008), 2, S. 217 - 246
Christian Chiarcos
Stefanie Dipper
Michael Götze
Ulf Leser
Anke Lüdeling
Julia Ritz
Manfred Stede
Nicht referiert
Department Linguistik
36073
2012
2012
eng
53
74
22
1
46
article
Springer
Dordrecht
1
--
--
--
By all these lovely tokens... Merging conflicting tokenizations
Given the contemporary trend to modular NLP architectures and multiple annotation frameworks, the existence of concurrent tokenizations of the same text represents a pervasive problem in everyday's NLP practice and poses a non-trivial theoretical problem to the integration of linguistic annotations and their interpretability in general. This paper describes a solution for integrating different tokenizations using a standoff XML format, and discusses the consequences from a corpus-linguistic perspective.
Language resources and evaluation
10.1007/s10579-011-9161-0
1574-020X
wos:2011-2013
WOS:000302289400004
Ritz, J (reprint author), Univ Potsdam, Sonderforschungsbereich 632 Informat Struct,Karl, D-14476 Potsdam, Germany., chiarcos@uni-potsdam.de; jritz@uni-potsdam.de; stede@uni-potsdam.de
Deutsche Forschungsgemeinschaft (DFG) [(SFB) 632]
Christian Chiarcos
Julia Ritz
Manfred Stede
eng
uncontrolled
Linguistic annotation
eng
uncontrolled
Multi-layer annotation
eng
uncontrolled
Conflicting tokenizations
eng
uncontrolled
Tokenization alignment
eng
uncontrolled
Corpus linguistics
Referiert
Department Linguistik
Institut für Linguistik / Allgemeine Sprachwissenschaft
43137
2016
2016
eng
599
617
article
Oxford University Press
Oxford
1
--
--
--
Corpus Linguistics and Information Structure Research
The Oxford handbook of information structure
978-0-19-964267-0
false
true
Anke Lüdeling
Julia Ritz
Manfred Stede
Amir Zeldes
Sprache
Department Linguistik
Institut für Linguistik / Allgemeine Sprachwissenschaft
6831
2013
eng
doctoralthesis
0
2014-07-01
--
2013-11-01
Discourse-givenness of noun phrases : theoretical and computational models
Diskursgegebenheit von Nominalphrasen : theoretische und komputationelle Modelle
This thesis gives formal definitions of discourse-givenness, coreference and reference, and reports on experiments with computational models of discourse-givenness of noun phrases for English and German. Definitions are based on Bach's (1987) work on reference, Kibble and van Deemter's (2000) work on coreference, and Kamp and Reyle's Discourse Representation Theory (1993). For the experiments, the following corpora with coreference annotation were used: MUC-7, OntoNotes and ARRAU for Englisch, and TueBa-D/Z for German. As for classification algorithms, they cover J48 decision trees, the rule based learner Ripper, and linear support vector machines. New features are suggested, representing the noun phrase's specificity as well as its context, which lead to a significant improvement of classification quality.
Die vorliegende Arbeit gibt formale Definitionen der Konzepte Diskursgegebenheit, Koreferenz und Referenz. Zudem wird über Experimente berichtet, Nominalphrasen im Deutschen und Englischen hinsichtlich ihrer Diskursgegebenheit zu klassifizieren. Die Definitionen basieren auf Arbeiten von Bach (1987) zu Referenz, Kibble und van Deemter (2000) zu Koreferenz und der Diskursrepräsentationstheorie (Kamp und Reyle, 1993). In den Experimenten wurden die koreferenzannotierten Korpora MUC-7, OntoNotes und ARRAU (Englisch) und TüBa-D/Z (Deutsch) verwendet. Sie umfassen die Klassifikationsalgorithmen J48 (Entscheidungsbäume), Ripper (regelbasiertes Lernen) und lineare Support Vector Machines. Mehrere neue Klassifikationsmerkmale werden vorgeschlagen, die die Spezifizität der Nominalphrase messen, sowie ihren Kontext abbilden. Mit Hilfe dieser Merkmale kann eine signifikante Verbesserung der Klassifikation erreicht werden.
urn:nbn:de:kobv:517-opus-70818
7081
ES 965
ES 900
Keine öffentliche Lizenz: Unter Urheberrechtsschutz
Julia Ritz
deu
uncontrolled
Diskursgegebenheit
deu
uncontrolled
Klassifikator
deu
uncontrolled
Koreferenz
deu
uncontrolled
Kontext
deu
uncontrolled
tf-idf
eng
uncontrolled
discourse-givenness
eng
uncontrolled
classifier
eng
uncontrolled
coreference
eng
uncontrolled
context
eng
uncontrolled
tf-idf
Sprache
open_access
Department Linguistik
Institut für Linguistik / Allgemeine Sprachwissenschaft
Universität Potsdam
Universität Potsdam
https://publishup.uni-potsdam.de/files/6831/ritz_diss.pdf
36667
2011
2011
eng
361
374
14
3
45
article
Springer
Dordrecht
1
--
--
--
Information structure in African languages corpora and tools
In this paper, we describe tools and resources for the study of African languages developed at the Collaborative Research Centre 632 "Information Structure". These include deeply annotated data collections of 25 sub-Saharan languages that are described together with their annotation scheme, as well as the corpus tool ANNIS, which provides unified access to a broad variety of annotations created with a range of different tools. With the application of ANNIS to several African data collections, we illustrate its suitability for the purpose of language documentation, distributed access, and the creation of data archives.
Language resources and evaluation
10.1007/s10579-011-9153-0
1574-020X
wos:2011-2013
WOS:000293709900007
Ritz, J (reprint author), Univ Potsdam, Karl Liebknecht Str 24-25, D-14476 Potsdam, Germany., chiarcos@uni-potsdam.de; ines.fiedler@rz.hu-berlin.de; grubic@uni-potsdam.de; k.hartmann@rz.hu-berlin.de; jritz@uni-potsdam.de; anne.schwarz@jcu.edu.au; amir.zeldes@rz.hu-berlin.de; mazimmer@uni-potsdam.de
German Research Foundation
Christian Chiarcos
Ines Fiedler
Mira Grubic
Katharina Hartmann
Julia Ritz
Anne Schwarz
Amir Zeldes
Malte Zimmermann
eng
uncontrolled
African language resources
eng
uncontrolled
Pragmatics
eng
uncontrolled
Corpus search infrastructure
Referiert
Department Linguistik
Institut für Linguistik / Allgemeine Sprachwissenschaft