Refine
Has Fulltext
- no (4)
Document Type
- Article (4)
Is part of the Bibliography
- yes (4) (remove)
Keywords
- CELEX (1)
- corpus linguistics (1)
- dlex (1)
- dlexDB (1)
- equivalence testing (1)
- eve movement (1)
- lexical database (1)
- minimization (1)
- parafovea (1)
- pushing weighted tree automaton (1)
- reading (1)
Institute
A weight normalization procedure, commonly called pushing, is introduced for weighted tree automata (wta) over commutative semifields. The normalization preserves the recognized weighted tree language even for nondeterministic wta, but it is most useful for bottom-up deterministic wta, where it can be used for minimization and equivalence testing. In both applications a careful selection of the weights to be redistributed followed by normalization allows a reduction of the general problem to the corresponding problem for bottom-up deterministic unweighted tree automata. This approach was already successfully used by Mohri and Eisner for the minimization of deterministic weighted string automata. Moreover, the new equivalence test for two wta M and M′ runs in time O((|M|+|M′|)⋅log(|Q|+|Q′|)), where Q and Q′ are the states of M and M′, respectively, which improves the previously best run-time O(|M|⋅|M′|).
The lexical database dlexDB supplies in form of an online database frequency-based norms of numerous process-related word properties for psychological and linguistic research. These values include well known variables such as printed frequency of word form and lemma as documented also in CELEX (Baayen, Piepenbrock und Gulikers, 1995). In addition, we compute new values like frequencies based on syllables, and morphemes as well as frequencies of character chains, and multiple word combinations. The statistics are based on the Kernkorpus des Digitalen Wrterbuchs der deutschen Sprache (DWDS) with over 100 million running words. We illustrate the validity of these norms with new results about fixation durations in sentence reading.
Mit der lexikalischen Datenbank dlexDB stellen wir der psychologischen und linguistischen Forschung im World Wide Web online statistische Kennwerte für eine Vielzahl von verarbeitungsrelevanten Merkmalen von Wörtern zur Verfügung. Diese Kennwerte umfassen die durch CELEX (Baayen, Piepenbrock und Gulikers, 1995) bekannten Variablen der Häufigkeiten von Wortformen und Lemmata in Texten geschriebener Sprache. Darüber hinaus berechnen wir eine Reihe neuer Kennwerte wie die Häufigkeiten von Silben, Morphemen, Zeichenfolgen und Mehrwortverbindungen sowie Wortähnlichkeitsmaße. Die Datengrundlage bildet das Kernkorpus des Digitalen Wörterbuchs der deutschen Sprache (DWDS) mit über 100 Millionen laufenden Wörtern. Wir illustrieren die Validität dieser Kennwerte mit neuen Ergebnissen zu ihrem Einfluss auf Fixationsdauern beim Lesen von Sätzen.