Refine
Has Fulltext
- yes (2) (remove)
Year of publication
- 2011 (2) (remove)
Document Type
- Doctoral Thesis (2) (remove)
Language
- English (2) (remove)
Is part of the Bibliography
- yes (2)
Keywords
- Annotationsprojektion (1)
- Computerlinguistik (1)
- Dependenzparsing (1)
- Orthographie (1)
- Parallelkorpora (1)
- Rechtschreibkorrektur (1)
- annotation projection (1)
- computational linguistics (1)
- dependency parsing (1)
- historical text (1)
- historischer Text (1)
- orthography (1)
- parallel corpora (1)
- partial annotations (1)
- partielle Annotationen (1)
- schwach überwachte Lernverfahren (1)
- spelling correction (1)
- weakly supervised learning techniques (1)
Institute
- Department Linguistik (2) (remove)
Does it have to be trees? : Data-driven dependency parsing with incomplete and noisy training data
(2011)
We present a novel approach to training data-driven dependency parsers on incomplete annotations. Our parsers are simple modifications of two well-known dependency parsers, the transition-based Malt parser and the graph-based MST parser. While previous work on parsing with incomplete data has typically couched the task in frameworks of unsupervised or semi-supervised machine learning, we essentially treat it as a supervised problem. In particular, we propose what we call agnostic parsers which hide all fragmentation in the training data from their supervised components. We present experimental results with training data that was obtained by means of annotation projection. Annotation projection is a resource-lean technique which allows us to transfer annotations from one language to another within a parallel corpus. However, the output tends to be noisy and incomplete due to cross-lingual non-parallelism and error-prone word alignments. This makes the projected annotations a suitable test bed for our fragment parsers. Our results show that (i) dependency parsers trained on large amounts of projected annotations achieve higher accuracy than the direct projections, and that (ii) our agnostic fragment parsers perform roughly on a par with the original parsers which are trained only on strictly filtered, complete trees. Finally, (iii) when our fragment parsers are trained on artificially fragmented but otherwise gold standard dependencies, the performance loss is moderate even with up to 50% of all edges removed.
This work addresses issues in the automatic preprocessing of historical German input text for use by conventional natural language processing techniques. Conventional techniques cannot adequately account for historical input text due to conventional tools' reliance on a fixed application-specific lexicon keyed by contemporary orthographic surface form on the one hand, and the lack of consistent orthographic conventions in historical input text on the other. Historical spelling variation is treated here as an error-correction problem or "canonicalization" task: an attempt to automatically assign each (historical) input word a unique extant canonical cognate, thus allowing direct application-specific processing (tagging, parsing, etc.) of the returned canonical forms without need for any additional application-specific modifications. In the course of the work, various methods for automatic canonicalization are investigated and empirically evaluated, including conflation by phonetic identity, conflation by lemma instantiation heuristics, canonicalization by weighted finite-state rewrite cascade, and token-wise disambiguation by a dynamic Hidden Markov Model.