Refine
Has Fulltext
- yes (13) (remove)
Year of publication
- 2021 (13) (remove)
Document Type
- Doctoral Thesis (13) (remove)
Language
- English (13)
Is part of the Bibliography
- yes (13)
Keywords
- 3D-Visualisierung (2)
- machine learning (2)
- maschinelles Lernen (2)
- 3D Visualization (1)
- 3D point cloud (1)
- 3D visualization (1)
- 3D-Punktwolke (1)
- Artificial Intelligence (1)
- BIM (1)
- Binary Classification (1)
Institute
- Hasso-Plattner-Institut für Digital Engineering GmbH (13) (remove)
3D point clouds are a universal and discrete digital representation of three-dimensional objects and environments. For geospatial applications, 3D point clouds have become a fundamental type of raw data acquired and generated using various methods and techniques. In particular, 3D point clouds serve as raw data for creating digital twins of the built environment.
This thesis concentrates on the research and development of concepts, methods, and techniques for preprocessing, semantically enriching, analyzing, and visualizing 3D point clouds for applications around transport infrastructure. It introduces a collection of preprocessing techniques that aim to harmonize raw 3D point cloud data, such as point density reduction and scan profile detection. Metrics such as, e.g., local density, verticality, and planarity are calculated for later use. One of the key contributions tackles the problem of analyzing and deriving semantic information in 3D point clouds. Three different approaches are investigated: a geometric analysis, a machine learning approach operating on synthetically generated 2D images, and a machine learning approach operating on 3D point clouds without intermediate representation.
In the first application case, 2D image classification is applied and evaluated for mobile mapping data focusing on road networks to derive road marking vector data. The second application case investigates how 3D point clouds can be merged with ground-penetrating radar data for a combined visualization and to automatically identify atypical areas in the data. For example, the approach detects pavement regions with developing potholes. The third application case explores the combination of a 3D environment based on 3D point clouds with panoramic imagery to improve visual representation and the detection of 3D objects such as traffic signs.
The presented methods were implemented and tested based on software frameworks for 3D point clouds and 3D visualization. In particular, modules for metric computation, classification procedures, and visualization techniques were integrated into a modular pipeline-based C++ research framework for geospatial data processing, extended by Python machine learning scripts. All visualization and analysis techniques scale to large real-world datasets such as road networks of entire cities or railroad networks.
The thesis shows that some use cases allow taking advantage of established image vision methods to analyze images rendered from mobile mapping data efficiently. The two presented semantic classification methods working directly on 3D point clouds are use case independent and show similar overall accuracy when compared to each other. While the geometry-based method requires less computation time, the machine learning-based method supports arbitrary semantic classes but requires training the network with ground truth data. Both methods can be used in combination to gradually build this ground truth with manual corrections via a respective annotation tool.
This thesis contributes results for IT system engineering of applications, systems, and services that require spatial digital twins of transport infrastructure such as road networks and railroad networks based on 3D point clouds as raw data. It demonstrates the feasibility of fully automated data flows that map captured 3D point clouds to semantically classified models. This provides a key component for seamlessly integrated spatial digital twins in IT solutions that require up-to-date, object-based, and semantically enriched information about the built environment.
Generative adversarial networks (GANs) have been broadly applied to a wide range of application domains since their proposal. In this thesis, we propose several methods that aim to tackle different existing problems in GANs. Particularly, even though GANs are generally able to generate high-quality samples, the diversity of the generated set is often sub-optimal. Moreover, the common increase of the number of models in the original GANs framework, as well as their architectural sizes, introduces additional costs. Additionally, even though challenging, the proper evaluation of a generated set is an important direction to ultimately improve the generation process in GANs. We start by introducing two diversification methods that extend the original GANs framework to multiple adversaries to stimulate sample diversity in a generated set. Then, we introduce a new post-training compression method based on Monte Carlo methods and importance sampling to quantize and prune the weights and activations of pre-trained neural networks without any additional training. The previous method may be used to reduce the memory and computational costs introduced by increasing the number of models in the original GANs framework. Moreover, we use a similar procedure to quantize and prune gradients during training, which also reduces the communication costs between different workers in a distributed training setting. We introduce several topology-based evaluation methods to assess data generation in different settings, namely image generation and language generation. Our methods retrieve both single-valued and double-valued metrics, which, given a real set, may be used to broadly assess a generated set or separately evaluate sample quality and sample diversity, respectively. Moreover, two of our metrics use locality-sensitive hashing to accurately assess the generated sets of highly compressed GANs. The analysis of the compression effects in GANs paves the way for their efficient employment in real-world applications. Given their general applicability, the methods proposed in this thesis may be extended beyond the context of GANs. Hence, they may be generally applied to enhance existing neural networks and, in particular, generative frameworks.
Massive Open Online Courses (MOOCs) open up new opportunities to learn a wide variety of skills online and are thus well suited for individual education, especially where proffcient teachers are not available locally. At the same time, modern society is undergoing a digital transformation, requiring the training of large numbers of current and future employees. Abstract thinking, logical reasoning, and the need to formulate instructions for computers are becoming increasingly relevant. A holistic way to train these skills is to learn how to program. Programming, in addition to being a mental discipline, is also considered a craft, and practical training is required to achieve mastery. In order to effectively convey programming skills in MOOCs, practical exercises are incorporated into the course curriculum to offer students the necessary hands-on experience to reach an in-depth understanding of the programming concepts presented. Our preliminary analysis showed that while being an integral and rewarding part of courses, practical exercises bear the risk of overburdening students who are struggling with conceptual misunderstandings and unknown syntax. In this thesis, we develop, implement, and evaluate different interventions with the aim to improve the learning experience, sustainability, and success of online programming courses. Data from four programming MOOCs, with a total of over 60,000 participants, are employed to determine criteria for practical programming exercises best suited for a given audience.
Based on over five million executions and scoring runs from students' task submissions, we deduce exercise difficulties, students' patterns in approaching the exercises, and potential flaws in exercise descriptions as well as preparatory videos. The primary issue in online learning is that students face a social gap caused by their isolated physical situation. Each individual student usually learns alone in front of a computer and suffers from the absence of a pre-determined time structure as provided in traditional school classes. Furthermore, online learning usually presses students into a one-size-fits-all curriculum, which presents the same content to all students, regardless of their individual needs and learning styles. Any means of a personalization of content or individual feedback regarding problems they encounter are mostly ruled out by the discrepancy between the number of learners and the number of instructors. This results in a high demand for self-motivation and determination of MOOC participants. Social distance exists between individual students as well as between students and course instructors. It decreases engagement and poses a threat to learning success. Within this research, we approach the identified issues within MOOCs and suggest scalable technical solutions, improving social interaction and balancing content difficulty.
Our contributions include situational interventions, approaches for personalizing educational content as well as concepts for fostering collaborative problem-solving. With these approaches, we reduce counterproductive struggles and create a universal improvement for future programming MOOCs. We evaluate our approaches and methods in detail to improve programming courses for students as well as instructors and to advance the state of knowledge in online education.
Data gathered from our experiments show that receiving peer feedback on one's programming problems improves overall course scores by up to 17%. Merely the act of phrasing a question about one's problem improved overall scores by about 14%. The rate of students reaching out for help was significantly improved by situational just-in-time interventions. Request for Comment interventions increased the share of students asking for help by up to 158%. Data from our four MOOCs further provide detailed insight into the learning behavior of students. We outline additional significant findings with regard to student behavior and demographic factors. Our approaches, the technical infrastructure, the numerous educational resources developed, and the data collected provide a solid foundation for future research.
One of the key challenges in modern Facility Management (FM) is to digitally reflect the current state of the built environment, referred to as-is or as-built versus as-designed representation. While the use of Building Information Modeling (BIM) can address the issue of digital representation, the generation and maintenance of BIM data requires a considerable amount of manual work and domain expertise. Another key challenge is being able to monitor the current state of the built environment, which is used to provide feedback and enhance decision making. The need for an integrated solution for all data associated with the operational life cycle of a building is becoming more pronounced as practices from Industry 4.0 are currently being evaluated and adopted for FM use. This research presents an approach for digital representation of indoor environments in their current state within the life cycle of a given building. Such an approach requires the fusion of various sources of digital data. The key to solving such a complex issue of digital data integration, processing and representation is with the use of a Digital Twin (DT). A DT is a digital duplicate of the physical environment, states, and processes. A DT fuses as-designed and as-built digital representations of built environment with as-is data, typically in the form of floorplans, point clouds and BIMs, with additional information layers pertaining to the current and predicted states of an indoor environment or a complete building (e.g., sensor data). The design, implementation and initial testing of prototypical DT software services for indoor environments is presented and described. These DT software services are implemented within a service-oriented paradigm, and their feasibility is presented through functioning and tested key software components within prototypical Service-Oriented System (SOS) implementations. The main outcome of this research shows that key data related to the built environment can be semantically enriched and combined to enable digital representations of indoor environments, based on the concept of a DT. Furthermore, the outcomes of this research show that digital data, related to FM and Architecture, Construction, Engineering, Owner and Occupant (AECOO) activity, can be combined, analyzed and visualized in real-time using a service-oriented approach. This has great potential to benefit decision making related to Operation and Maintenance (O&M) procedures within the scope of the post-construction life cycle stages of typical office buildings.
We investigate models for incremental binary classification, an example for supervised online learning. Our starting point is a model for human and machine learning suggested by E.M.Gold.
In the first part, we consider incremental learning algorithms that use all of the available binary labeled training data in order to compute the current hypothesis. For this model, we observe that the algorithm can be assumed to always terminate and that the distribution of the training data does not influence learnability. This is still true if we pose additional delayable requirements that remain valid despite a hypothesis output delayed in time. Additionally, we consider the non-delayable requirement of consistent learning. Our corresponding results underpin the claim for delayability being a suitable structural property to describe and collectively investigate a major part of learning success criteria. Our first theorem states the pairwise implications or incomparabilities between an established collection of delayable learning success criteria, the so-called complete map. Especially, the learning algorithm can be assumed to only change its last hypothesis in case it is inconsistent with the current training data. Such a learning behaviour is called conservative.
By referring to learning functions, we obtain a hierarchy of approximative learning success criteria. Hereby we allow an increasing finite number of errors of the hypothesized concept by the learning algorithm compared with the concept to be learned. Moreover, we observe a duality depending on whether vacillations between infinitely many different correct hypotheses are still considered a successful learning behaviour. This contrasts the vacillatory hierarchy for learning from solely positive information.
We also consider a hypothesis space located between the two most common hypothesis space types in the nearby relevant literature and provide the complete map.
In the second part, we model more efficient learning algorithms. These update their hypothesis referring to the current datum and without direct regress to past training data. We focus on iterative (hypothesis based) and BMS (state based) learning algorithms. Iterative learning algorithms use the last hypothesis and the current datum in order to infer the new hypothesis.
Past research analyzed, for example, the above mentioned pairwise relations between delayable learning success criteria when learning from purely positive training data. We compare delayable learning success criteria with respect to iterative learning algorithms, as well as learning from either exclusively positive or binary labeled data. The existence of concept classes that can be learned by an iterative learning algorithm but not in a conservative way had already been observed, showing that conservativeness is restrictive. An additional requirement arising from cognitive science research %and also observed when training neural networks is U-shapedness, stating that the learning algorithm does diverge from a correct hypothesis. We show that forbidding U-shapes also restricts iterative learners from binary labeled data.
In order to compute the next hypothesis, BMS learning algorithms refer to the currently observed datum and the actual state of the learning algorithm. For learning algorithms equipped with an infinite amount of states, we provide the complete map. A learning success criterion is semantic if it still holds, when the learning algorithm outputs other parameters standing for the same classifier. Syntactic (non-semantic) learning success criteria, for example conservativeness and syntactic non-U-shapedness, restrict BMS learning algorithms. For proving the equivalence of the syntactic requirements, we refer to witness-based learning processes. In these, every change of the hypothesis is justified by a later on correctly classified witness from the training data. Moreover, for every semantic delayable learning requirement, iterative and BMS learning algorithms are equivalent. In case the considered learning success criterion incorporates syntactic non-U-shapedness, BMS learning algorithms can learn more concept classes than iterative learning algorithms.
The proofs are combinatorial, inspired by investigating formal languages or employ results from computability theory, such as infinite recursion theorems (fixed point theorems).
Learning analytics at scale
(2021)
Digital technologies are paving the way for innovative educational approaches. The learning format of Massive Open Online Courses (MOOCs) provides a highly accessible path to lifelong learning while being more affordable and flexible than face-to-face courses. Thereby, thousands of learners can enroll in courses mostly without admission restrictions, but this also raises challenges. Individual supervision by teachers is barely feasible, and learning persistence and success depend on students' self-regulatory skills. Here, technology provides the means for support. The use of data for decision-making is already transforming many fields, whereas in education, it is still a young research discipline. Learning Analytics (LA) is defined as the measurement, collection, analysis, and reporting of data about learners and their learning contexts with the purpose of understanding and improving learning and learning environments. The vast amount of data that MOOCs produce on the learning behavior and success of thousands of students provides the opportunity to study human learning and develop approaches addressing the demands of learners and teachers.
The overall purpose of this dissertation is to investigate the implementation of LA at the scale of MOOCs and to explore how data-driven technology can support learning and teaching in this context. To this end, several research prototypes have been iteratively developed for the HPI MOOC Platform. Hence, they were tested and evaluated in an authentic real-world learning environment. Most of the results can be applied on a conceptual level to other MOOC platforms as well. The research contribution of this thesis thus provides practical insights beyond what is theoretically possible. In total, four system components were developed and extended:
(1) The Learning Analytics Architecture: A technical infrastructure to collect, process, and analyze event-driven learning data based on schema-agnostic pipelining in a service-oriented MOOC platform. (2) The Learning Analytics Dashboard for Learners: A tool for data-driven support of self-regulated learning, in particular to enable learners to evaluate and plan their learning activities, progress, and success by themselves. (3) Personalized Learning Objectives: A set of features to better connect learners' success to their personal intentions based on selected learning objectives to offer guidance and align the provided data-driven insights about their learning progress. (4) The Learning Analytics Dashboard for Teachers: A tool supporting teachers with data-driven insights to enable the monitoring of their courses with thousands of learners, identify potential issues, and take informed action.
For all aspects examined in this dissertation, related research is presented, development processes and implementation concepts are explained, and evaluations are conducted in case studies. Among other findings, the usage of the learner dashboard in combination with personalized learning objectives demonstrated improved certification rates of 11.62% to 12.63%. Furthermore, it was observed that the teacher dashboard is a key tool and an integral part for teaching in MOOCs. In addition to the results and contributions, general limitations of the work are discussed—which altogether provide a solid foundation for practical implications and future research.
Compound values are not universally supported in virtual machine (VM)-based programming systems and languages. However, providing data structures with value characteristics can be beneficial. On one hand, programming systems and languages can adequately represent physical quantities with compound values and avoid inconsistencies, for example, in representation of large numbers. On the other hand, just-in-time (JIT) compilers, which are often found in VMs, can rely on the fact that compound values are immutable, which is an important property in optimizing programs. Considering this, compound values have an optimization potential that can be put to use by implementing them in VMs in a way that is efficient in memory usage and execution time. Yet, optimized compound values in VMs face certain challenges: to maintain consistency, it should not be observable by the program whether compound values are represented in an optimized way by a VM; an optimization should take into account, that the usage of compound values can exhibit certain patterns at run-time; and that necessary value-incompatible properties due to implementation restrictions should be reduced.
We propose a technique to detect and compress common patterns of compound value usage at run-time to improve memory usage and execution speed. Our approach identifies patterns of frequent compound value references and introduces abbreviated forms for them. Thus, it is possible to store multiple inter-referenced compound values in an inlined memory representation, reducing the overhead of metadata and object references. We extend our approach by a notion of limited mutability, using cells that act as barriers for our approach and provide a location for shared, mutable access with the possibility of type specialization. We devise an extension to our approach that allows us to express automatic unboxing of boxed primitive data types in terms of our initial technique. We show that our approach is versatile enough to express another optimization technique that relies on values, such as Booleans, that are unique throughout a programming system. Furthermore, we demonstrate how to re-use learned usage patterns and optimizations across program runs, thus reducing the performance impact of pattern recognition.
We show in a best-case prototype that the implementation of our approach is feasible and can also be applied to general purpose programming systems, namely implementations of the Racket language and Squeak/Smalltalk. In several micro-benchmarks, we found that our approach can effectively reduce memory consumption and improve execution speed.
Virtualizing physical space
(2021)
The true cost for virtual reality is not the hardware, but the physical space it requires, as a one-to-one mapping of physical space to virtual space allows for the most immersive way of navigating in virtual reality. Such “real-walking” requires physical space to be of the same size and the same shape of the virtual world represented. This generally prevents real-walking applications from running on any space that they were not designed for.
To reduce virtual reality’s demand for physical space, creators of such applications let users navigate virtual space by means of a treadmill, altered mappings of physical to virtual space, hand-held controllers, or gesture-based techniques. While all of these solutions succeed at reducing virtual reality’s demand for physical space, none of them reach the same level of immersion that real-walking provides.
Our approach is to virtualize physical space: instead of accessing physical space directly, we allow applications to express their need for space in an abstract way, which our software systems then map to the physical space available. We allow real-walking applications to run in spaces of different size, different shape, and in spaces containing different physical objects. We also allow users immersed in different virtual environments to share the same space.
Our systems achieve this by using a tracking volume-independent representation of real-walking experiences — a graph structure that expresses the spatial and logical relationships between virtual locations, virtual elements contained within those locations, and user interactions with those elements. When run in a specific physical space, this graph representation is used to define a custom mapping of the elements of the virtual reality application and the physical space by parsing the graph using a constraint solver. To re-use space, our system splits virtual scenes and overlap virtual geometry. The system derives this split by means of hierarchically clustering of our virtual objects as nodes of our bi-partite directed graph that represents the logical ordering of events of the experience. We let applications express their demands for physical space and use pre-emptive scheduling between applications to have them share space. We present several application examples enabled by our system. They all enable real-walking, despite being mapped to physical spaces of different size and shape, containing different physical objects or other users.
We see substantial real-world impact in our systems. Today’s commercial virtual reality applications are generally designing to be navigated using less immersive solutions, as this allows them to be operated on any tracking volume. While this is a commercial necessity for the developers, it misses out on the higher immersion offered by real-walking. We let developers overcome this hurdle by allowing experiences to bring real-walking to any tracking volume, thus potentially bringing real-walking to consumers.
Die eigentlichen Kosten für Virtual Reality Anwendungen entstehen nicht primär durch die erforderliche Hardware, sondern durch die Nutzung von physischem Raum, da die eins-zu-eins Abbildung von physischem auf virtuellem Raum die immersivste Art von Navigation ermöglicht. Dieses als „Real-Walking“ bezeichnete Erlebnis erfordert hinsichtlich Größe und Form eine Entsprechung von physischem Raum und virtueller Welt. Resultierend daraus können Real-Walking-Anwendungen nicht an Orten angewandt werden, für die sie nicht entwickelt wurden.
Um den Bedarf an physischem Raum zu reduzieren, lassen Entwickler von Virtual Reality-Anwendungen ihre Nutzer auf verschiedene Arten navigieren, etwa mit Hilfe eines Laufbandes, verfälschten Abbildungen von physischem zu virtuellem Raum, Handheld-Controllern oder gestenbasierten Techniken. All diese Lösungen reduzieren zwar den Bedarf an physischem Raum, erreichen jedoch nicht denselben Grad an Immersion, den Real-Walking bietet.
Unser Ansatz zielt darauf, physischen Raum zu virtualisieren: Anstatt auf den physischen Raum direkt zuzugreifen, lassen wir Anwendungen ihren Raumbedarf auf abstrakte Weise formulieren, den unsere Softwaresysteme anschließend auf den verfügbaren physischen Raum abbilden. Dadurch ermöglichen wir Real-Walking-Anwendungen Räume mit unterschiedlichen Größen und Formen und Räume, die unterschiedliche physische Objekte enthalten, zu nutzen. Wir ermöglichen auch die zeitgleiche Nutzung desselben Raums durch mehrere Nutzer verschiedener Real-Walking-Anwendungen.
Unsere Systeme erreichen dieses Resultat durch eine Repräsentation von Real-Walking-Erfahrungen, die unabhängig sind vom gegebenen Trackingvolumen – eine Graphenstruktur, die die räumlichen und logischen Beziehungen zwischen virtuellen Orten, den virtuellen Elementen innerhalb dieser Orte, und Benutzerinteraktionen mit diesen Elementen, ausdrückt. Bei der Instanziierung der Anwendung in einem bestimmten physischen Raum wird diese Graphenstruktur und ein Constraint Solver verwendet, um eine individuelle Abbildung der virtuellen Elemente auf den physischen Raum zu erreichen. Zur mehrmaligen Verwendung des Raumes teilt unser System virtuelle Szenen und überlagert virtuelle Geometrie. Das System leitet diese Aufteilung anhand eines hierarchischen Clusterings unserer virtuellen Objekte ab, die als Knoten unseres bi-partiten, gerichteten Graphen die logische Reihenfolge aller Ereignisse repräsentieren. Wir verwenden präemptives Scheduling zwischen den Anwendungen für die zeitgleiche Nutzung von physischem Raum. Wir stellen mehrere Anwendungsbeispiele vor, die Real-Walking ermöglichen – in physischen Räumen mit unterschiedlicher Größe und Form, die verschiedene physische Objekte oder weitere Nutzer enthalten.
Wir sehen in unseren Systemen substantielles Potential. Heutige Virtual Reality-Anwendungen sind bisher zwar so konzipiert, dass sie auf einem beliebigen Trackingvolumen betrieben werden können, aber aus kommerzieller Notwendigkeit kein Real-Walking beinhalten. Damit entgeht Entwicklern die Gelegenheit eine höhere Immersion herzustellen. Indem wir es ermöglichen, Real-Walking auf jedes Trackingvolumen zu bringen, geben wir Entwicklern die Möglichkeit Real-Walking zu ihren Nutzern zu bringen.
Modern knowledge bases contain and organize knowledge from many different topic areas. Apart from specific entity information, they also store information about their relationships amongst each other. Combining this information results in a knowledge graph that can be particularly helpful in cases where relationships are of central importance. Among other applications, modern risk assessment in the financial sector can benefit from the inherent network structure of such knowledge graphs by assessing the consequences and risks of certain events, such as corporate insolvencies or fraudulent behavior, based on the underlying network structure. As public knowledge bases often do not contain the necessary information for the analysis of such scenarios, the need arises to create and maintain dedicated domain-specific knowledge bases.
This thesis investigates the process of creating domain-specific knowledge bases from structured and unstructured data sources. In particular, it addresses the topics of named entity recognition (NER), duplicate detection, and knowledge validation, which represent essential steps in the construction of knowledge bases.
As such, we present a novel method for duplicate detection based on a Siamese neural network that is able to learn a dataset-specific similarity measure which is used to identify duplicates. Using the specialized network architecture, we design and implement a knowledge transfer between two deduplication networks, which leads to significant performance improvements and a reduction of required training data.
Furthermore, we propose a named entity recognition approach that is able to identify company names by integrating external knowledge in the form of dictionaries into the training process of a conditional random field classifier. In this context, we study the effects of different dictionaries on the performance of the NER classifier. We show that both the inclusion of domain knowledge as well as the generation and use of alias names results in significant performance improvements.
For the validation of knowledge represented in a knowledge base, we introduce Colt, a framework for knowledge validation based on the interactive quality assessment of logical rules. In its most expressive implementation, we combine Gaussian processes with neural networks to create Colt-GP, an interactive algorithm for learning rule models. Unlike other approaches, Colt-GP uses knowledge graph embeddings and user feedback to cope with data quality issues of knowledge bases. The learned rule model can be used to conditionally apply a rule and assess its quality.
Finally, we present CurEx, a prototypical system for building domain-specific knowledge bases from structured and unstructured data sources. Its modular design is based on scalable technologies, which, in addition to processing large datasets, ensures that the modules can be easily exchanged or extended. CurEx offers multiple user interfaces, each tailored to the individual needs of a specific user group and is fully compatible with the Colt framework, which can be used as part of the system.
We conduct a wide range of experiments with different datasets to determine the strengths and weaknesses of the proposed methods. To ensure the validity of our results, we compare the proposed methods with competing approaches.
As part of our everyday life we consume breaking news and interpret it based on our own viewpoints and beliefs. We have easy access to online social networking platforms and news media websites, where we inform ourselves about current affairs and often post about our own views, such as in news comments or social media posts. The media ecosystem enables opinions and facts to travel from news sources to news readers, from news article commenters to other readers, from social network users to their followers, etc. The views of the world many of us have depend on the information we receive via online news and social media. Hence, it is essential to maintain accurate, reliable and objective online content to ensure democracy and verity on the Web. To this end, we contribute to a trustworthy media ecosystem by analyzing news and social media in the context of politics to ensure that media serves the public interest. In this thesis, we use text mining, natural language processing and machine learning techniques to reveal underlying patterns in political news articles and political discourse in social networks.
Mainstream news sources typically cover a great amount of the same news stories every day, but they often place them in a different context or report them from different perspectives. In this thesis, we are interested in how distinct and predictable newspaper journalists are, in the way they report the news, as a means to understand and identify their different political beliefs. To this end, we propose two models that classify text from news articles to their respective original news source, i.e., reported speech and also news comments. Our goal is to capture systematic quoting and commenting patterns by journalists and news commenters respectively, which can lead us to the newspaper where the quotes and comments are originally published. Predicting news sources can help us understand the potential subjective nature behind news storytelling and the magnitude of this phenomenon. Revealing this hidden knowledge can restore our trust in media by advancing transparency and diversity in the news.
Media bias can be expressed in various subtle ways in the text and it is often challenging to identify these bias manifestations correctly, even for humans. However, media experts, e.g., journalists, are a powerful resource that can help us overcome the vague definition of political media bias and they can also assist automatic learners to find the hidden bias in the text. Due to the enormous technological advances in artificial intelligence, we hypothesize that identifying political bias in the news could be achieved through the combination of sophisticated deep learning modelsxi and domain expertise. Therefore, our second contribution is a high-quality and reliable news dataset annotated by journalists for political bias and a state-of-the-art solution for this task based on curriculum learning. Our aim is to discover whether domain expertise is necessary for this task and to provide an automatic solution for this traditionally manually-solved problem. User generated content is fundamentally different from news articles, e.g., messages are shorter, they are often personal and opinionated, they refer to specific topics and persons, etc. Regarding political and socio-economic news, individuals in online communities make use of social networks to keep their peers up-to-date and to share their own views on ongoing affairs. We believe that social media is also an as powerful instrument for information flow as the news sources are, and we use its unique characteristic of rapid news coverage for two applications. We analyze Twitter messages and debate transcripts during live political presidential debates to automatically predict the topics that Twitter users discuss. Our goal is to discover the favoured topics in online communities on the dates of political events as a way to understand the political subjects of public interest. With the up-to-dateness of microblogs, an additional opportunity emerges, namely to use social media posts and leverage the real-time verity about discussed individuals to find their locations.
That is, given a person of interest that is mentioned in online discussions, we use the wisdom of the crowd to automatically track her physical locations over time. We evaluate our approach in the context of politics, i.e., we predict the locations of US politicians as a proof of concept for important use cases, such as to track people that
are national risks, e.g., warlords and wanted criminals.