TY - JOUR A1 - Arnold, Taylor A1 - Ballier, Nicolas A1 - Lisson, Paula A1 - Tilton, Lauren T1 - Beyond lexical frequencies: using R for text analysis in the digital humanities JF - Language resources and evaluation N2 - This paper presents a combination of R packages-user contributed toolkits written in a common core programming language-to facilitate the humanistic investigation of digitised, text-based corpora.Our survey of text analysis packages includes those of our own creation (cleanNLP and fasttextM) as well as packages built by other research groups (stringi, readtext, hyphenatr, quanteda, and hunspell). By operating on generic object types, these packages unite research innovations in corpus linguistics, natural language processing, machine learning, statistics, and digital humanities. We begin by extrapolating on the theoretical benefits of R as an elaborate gluing language for bringing together several areas of expertise and compare it to linguistic concordancers and other tool-based approaches to text analysis in the digital humanities. We then showcase the practical benefits of an ecosystem by illustrating how R packages have been integrated into a digital humanities project. Throughout, the focus is on moving beyond the bag-of-words, lexical frequency model by incorporating linguistically-driven analyses in research. KW - Digital humanities KW - Text mining KW - R KW - Text interoperability Y1 - 2019 U6 - https://doi.org/10.1007/s10579-019-09456-6 SN - 1574-020X SN - 1574-0218 VL - 53 IS - 4 SP - 707 EP - 733 PB - Springer CY - Dordrecht ER - TY - JOUR A1 - Lisson, Paula A1 - Ballier, Nicolas T1 - Investigating Lexical Progression through Lexical Diversity Metrics in a Corpus of French L3 JF - Discours : revue de linguistique, psycholinguistique et informatique N2 - This article presents a corpus-based evaluation of 13 lexical diversity metrics as measures of longitudinal progression in written productions of learners of French as third language (L3). Our case study (24 learners, 3 productions per learner in the course of 3 months) deals with a semi-longitudinal corpus, where each of the productions is supposed to be more complex than the previous one. Random forests (Breiman, 2001; Hothorn et al., 2019) are used in order to see whether lexical diversity metric scores capture enough vocabulary diversity progression to predict the production wave. We report that lexical diversity metrics capture lexical progression through the three productions of each student. In particular, two metrics appear to be the most informative for lexical progression: Herdan’s C and Yule’s K. KW - lexical diversity KW - learner corpora KW - L3 French Y1 - 2018 U6 - https://doi.org/10.4000/discours.9950 SN - 1963-1723 IS - 23 PB - Université de Paris-Sorbonne, Maion Recherche CY - Paris ER -