Variation in coreference patterns : analyses across language modes and genres

Aktas, Berfin

doi:10.25932/publishup-59608

search hit 6 of 6

Back to Result List

Variation in coreference patterns

Berfin Aktas

This thesis explores the variation in coreference patterns across language modes (i.e., spoken and written) and text genres. The significance of research on variation in language use has been emphasized in a number of linguistic studies. For instance, Biber and Conrad [2009] state that “register/genre variation is a fundamental aspect of human language” and “Given the ubiquity of register/genre variation, an understanding of how linguistic features are used in patterned ways across text varieties is of central importance for both the description of particular languages and the development of cross-linguistic theories of language use.”[p.23] We examine the variation across genres with the primary goal of contributing to the body of knowledge on the description of language use in English. On the computational side, we believe that incorporating linguistic knowledge into learning-based systems can boost the performance of automatic natural language processing systems, particularly for non-standard texts. Therefore, in addition toThis thesis explores the variation in coreference patterns across language modes (i.e., spoken and written) and text genres. The significance of research on variation in language use has been emphasized in a number of linguistic studies. For instance, Biber and Conrad [2009] state that “register/genre variation is a fundamental aspect of human language” and “Given the ubiquity of register/genre variation, an understanding of how linguistic features are used in patterned ways across text varieties is of central importance for both the description of particular languages and the development of cross-linguistic theories of language use.”[p.23] We examine the variation across genres with the primary goal of contributing to the body of knowledge on the description of language use in English. On the computational side, we believe that incorporating linguistic knowledge into learning-based systems can boost the performance of automatic natural language processing systems, particularly for non-standard texts. Therefore, in addition to their descriptive value, the linguistic findings we provide in this study may prove to be helpful for improving the performance of automatic coreference resolution, which is essential for a good text understanding and beneficial for several downstream NLP applications, including machine translation and text summarization. In particular, we study a genre of texts that is formed of conversational interactions on the well-known social media platform Twitter. Two factors motivate us: First, Twitter conversations are realized in written form but resemble spoken communication [Scheffler, 2017], and therefore they form an atypical genre for the written mode. Second, while Twitter texts are a complicated genre for automatic coreference resolution, due to their widespread use in the digital sphere, at the same time they are highly relevant for applications that seek to extract information or sentiments from users’ messages. Thus, we are interested in discovering more about the linguistic and computational aspects of coreference in Twitter conversations. We first created a corpus of such conversations for this purpose and annotated it for coreference. We are interested in not only the coreference patterns but the overall discourse behavior of Twitter conversations. To address this, in addition to the coreference relations, we also annotated the coherence relations on the corpus we compiled. The corpus is available online in a newly developed form that allows for separating the tweets from their annotations. This study consists of three empirical analyses where we independently apply corpus-based, psycholinguistic and computational approaches for the investigation of variation in coreference patterns in a complementary manner. (1) We first make a descriptive analysis of variation across genres through a corpus-based study. We investigate the linguistic aspects of nominal coreference in Twitter conversations and we determine how this genre relates to other text genres in spoken and written modes. In addition to the variation across genres, studying the differences in spoken-written modes is also in focus of linguistic research since from Woolbert [1922]. (2) In order to investigate whether the language mode alone has any effect on coreference patterns, we carry out a crowdsourced experiment and analyze the patterns in the same genre for both spoken and written modes. (3) Finally, we explore the potentials of domain adaptation of automatic coreference resolution (ACR) for the conversational Twitter data. In order to answer the question of how the genre of Twitter conversations relates to other genres in spoken and written modes with respect to coreference patterns, we employ a state-of-the-art neural ACR model [Lee et al., 2018] to examine whether ACR on Twitter conversations will benefit from mode-based separation in out-of-domain training data.…
In dieser Dissertation wird die Variation von Koreferenzmustern in verschiedenen Sprachmodi (d. h., gesprochen und geschrieben) und Textgenres untersucht. Die Relevanz der Erforschung von Variation im Sprachgebrauch wurde in einer ganzen Reihe von linguistischen Studien betont. Zum Beispiel stellen Biber und Conrad [2009] fest: "register/genre variation is a fundamental aspect of human language" und "Given the ubiquity of register/genre variation, an understanding of how linguistic features are used in patterned ways across text varieties is of central importance for both the description of particular languages and the development of cross-linguistic theories of language use."[S.23] Wir untersuchen die Variation zwischen Genres mit dem primären Ziel, einen Beitrag zum Wissensstand zur Beschreibung des Sprachgebrauchs im Englischen zu leisten. Auf der technischen Seite glauben wir, dass das Einbeziehen von linguistischem Wissen in machine learning Ansätzen die Leistung von sprachverarbeitenden Systemen verbessern kann, insbesondereIn dieser Dissertation wird die Variation von Koreferenzmustern in verschiedenen Sprachmodi (d. h., gesprochen und geschrieben) und Textgenres untersucht. Die Relevanz der Erforschung von Variation im Sprachgebrauch wurde in einer ganzen Reihe von linguistischen Studien betont. Zum Beispiel stellen Biber und Conrad [2009] fest: "register/genre variation is a fundamental aspect of human language" und "Given the ubiquity of register/genre variation, an understanding of how linguistic features are used in patterned ways across text varieties is of central importance for both the description of particular languages and the development of cross-linguistic theories of language use."[S.23] Wir untersuchen die Variation zwischen Genres mit dem primären Ziel, einen Beitrag zum Wissensstand zur Beschreibung des Sprachgebrauchs im Englischen zu leisten. Auf der technischen Seite glauben wir, dass das Einbeziehen von linguistischem Wissen in machine learning Ansätzen die Leistung von sprachverarbeitenden Systemen verbessern kann, insbesondere für Texte in nicht-Standard Varietäten. Neben ihrem sprachbeschreibenden Wert können die linguistischen Erkenntnisse, die wir in dieser Studie liefern, sich also als nützlich für die Verbesserung von Systemen für automatische Koreferenzauflösung erweisen; diese ist für ein tiefgreifendes Textverständnis unerlässlich, und potenziell hilfreich für verschiedene nachgelagerte NLP-Applikationen wie etwa die maschinelle Übersetzung und die Textzusammenfassung. Insbesondere untersuchen wir ein Textgenre, das aus Konversationsinteraktionen auf der bekannten Social-Media-Plattform Twitter gebildet wird. Zwei Faktoren motivieren uns dazu: Erstens werden Twitter-Konversationen in schriftlicher Form realisiert, ähneln dabei aber der gesprochenen Kommunikation [Scheffler, 2017] und bilden daher ein für den schriftlichen Modus untypisches Genre. Zweitens sind Twitter-Texte zwar ein kompliziertes Genre für die automatische Auflösung von Koreferenzen, aufgrund ihrer weiten Verbreitung in der digitalen Sphäre sind sie aber für Applikationen, die Informationen oder Stimmungen aus den Nachrichten der Nutzer extrahieren wollen, höchst relevant. Daher sind wir daran interessiert, mehr über die linguistischen und komputationellen Aspekte der Koreferenz in Twitter-Konversationen herauszufinden. Zu diesem Zweck haben wir zunächst ein Korpus solcher Unterhaltungen erstellt und es hinsichtlich der Koreferenzbeziehungen annotiert. Wir interessieren uns dabei aber nicht nur für die Koreferenzmuster, sondern auch allgemein für diskursstrukturelle Eigenschaften von Twitter-Konversationen. Daher haben wir zusätzlich zu den Koreferenzrelationen auch die semantisch/pragmatischen Kohärenzrelationen in dem von uns erstellten Korpus annotiert. Das Korpus ist online in einer neu entwickelten Form verfügbar, die es erlaubt, die Tweets von ihren Annotationen getrennt zu repräsentieren. Diese Studie besteht aus drei empirischen Analysen, in denen wir unabhängig voneinander korpusbasierte, psycholinguistische und computerlinguistische Ansätze zur komplementären Untersuchung der Variation von Koreferenzmustern anwenden. (1) Zunächst führen wir eine deskriptive Analyse der Variation zwischen den Genres anhand einer korpusbasierten Studie durch. Wir untersuchen linguistische Aspekte der nominalen Koreferenz in Twitter-Konversationen und stellen fest, wie sich dieses Genre zu anderen Textgenres im gesprochenen und schriftlichen Modus verhält. Neben der Variation zwischen Genres steht auch die Untersuchung der Unterschiede zwischen mündlichen und schriftlichen Formen im Fokus der linguistischen Forschung beginnend mit Woolbert [1922]. (2) Um zu untersuchen, ob der Sprachmodus auch allein einen Einfluss auf die Koreferenzmuster ausübt, führen wir ein Crowdsourcing-Experiment durch und analysieren die Muster, die sich innerhalb desselben Genres für den gesprochenen und den geschriebenen Modus ergeben. (3) Schließlich untersuchen wir Möglichkeiten der Domain-Anpassung der automatischen Koreferenzauflösung für die Twitter-Konversationsdaten. Um die Frage zu beantworten, wie sich das Genre der Twitter-Konversationen zu anderen Genres im gesprochenen und geschriebenen Modus im Hinblick auf die Koreferenzmuster verhält, verwenden wir ein neuronales Koreferenzresolutionsmodell auf dem aktuellen Stand der Technik [Lee et al., 2018], um zu untersuchen, ob die Resolution auf Twitter-Konversationen von einer modusbasierten Trennung der Trainingsdaten aus externen Domänen profitiert.…

Metadaten
Author details:	Berfin Aktas GND
URN:	urn:nbn:de:kobv:517-opus4-596086
DOI:	https://doi.org/10.25932/publishup-59608
Subtitle (English):	analyses across language modes and genres
Reviewer(s):	Tatjana Scheffler ORCiD GND, Nils Reiter ORCiD GND
Supervisor(s):	Manfred Stede, Tatjana Scheffler
Publication type:	Doctoral Thesis
Language:	English
Publication year:	2023
Publishing institution:	Universität Potsdam
Granting institution:	Universität Potsdam
Date of final exam:	2023/04/24
Release date:	2023/06/22
Tag:	Koreferenzmustern; Textgenre; Variation; geschrieben; gesprochen coreference; genre; spoken; variation; written
Number of pages:	xviii, 195
RVK - Regensburg classification:	ES 146, ET 800, ET 860
Organizational units:	Humanwissenschaftliche Fakultät / Strukturbereich Kognitionswissenschaften / Department Linguistik / Applied Computational Linguistics
DDC classification:	4 Sprache / 41 Linguistik / 410 Linguistik
License (German):	CC-BY - Namensnennung 4.0 International

Variation in coreference patterns

Download full text files

Export metadata

Additional Services