TY  - GEN
A1  - Risch, Julian
A1  - Krestel, Ralf
T1  - My Approach = Your Apparatus?
BT  - Entropy-Based Topic Modeling on Multiple Domain-Specific Text Collections
T2  - Libraries
N2  - Comparative text mining extends from genre analysis and political bias detection to the revelation of cultural and geographic differences, through to the search for prior art across patents and scientific papers. These applications use cross-collection topic modeling for the exploration, clustering, and comparison of large sets of documents, such as digital libraries. However, topic modeling on documents from different collections is challenging because of domain-specific vocabulary. We present a cross-collection topic model combined with automatic domain term extraction and phrase segmentation. This model distinguishes collection-specific and collection-independent words based on information entropy and reveals commonalities and differences of multiple text collections. We evaluate our model on patents, scientific papers, newspaper articles, forum posts, and Wikipedia articles. In comparison to state-of-the-art cross-collection topic modeling, our model achieves up to 13% higher topic coherence, up to 4% lower perplexity, and up to 31% higher document classification accuracy. More importantly, our approach is the first topic model that ensures disjunct general and specific word distributions, resulting in clear-cut topic representations.
KW  - Topic modeling
KW  - Automatic domain term extraction
KW  - Entropy
Y1  - 2018
SN  - 978-1-4503-5178-2
U6  - https://doi.org/10.1145/3197026.3197038
SN  - 2575-7865
SN  - 2575-8152
SP  - 283
EP  - 292
PB  - Association for Computing Machinery
CY  - New York
ER  - 
TY  - GEN
A1  - Repke, Tim
A1  - Krestel, Ralf
A1  - Edding, Jakob
A1  - Hartmann, Moritz
A1  - Hering, Jonas
A1  - Kipping, Dennis
A1  - Schmidt, Hendrik
A1  - Scordialo, Nico
A1  - Zenner, Alexander
T1  - Beacon in the Dark
BT  - a system for interactive exploration of large email Corpora
T2  - Proceedings of the 27th ACM International Conference on Information and Knowledge Management
N2  - The large amount of heterogeneous data in these email corpora renders experts' investigations by hand infeasible. Auditors or journalists, e.g., who are looking for irregular or inappropriate content or suspicious patterns, are in desperate need for computer-aided exploration tools to support their investigations.
We present our Beacon system for the exploration of such corpora at different levels of detail. A distributed processing pipeline combines text mining methods and social network analysis to augment the already semi-structured nature of emails. The user interface ties into the resulting cleaned and enriched dataset. For the interface design we identify three objectives expert users have: gain an initial overview of the data to identify leads to investigate, understand the context of the information at hand, and have meaningful filters to iteratively focus onto a subset of emails. To this end we make use of interactive visualisations based on rearranged and aggregated extracted information to reveal salient patterns.
Y1  - 2018
SN  - 978-1-4503-6014-2
U6  - https://doi.org/10.1145/3269206.3269231
SP  - 1871
EP  - 1874
PB  - Association for Computing Machinery
CY  - New York
ER  -