Refine
Document Type
- Doctoral Thesis (17)
- Habilitation Thesis (1)
Is part of the Bibliography
- yes (18)
Keywords
- Design Thinking (2)
- computer vision (2)
- deep learning (2)
- tiefes Lernen (2)
- Access Control (1)
- Analyse (1)
- Angriffserkennung (1)
- Annahme (1)
- Archivanalyse (1)
- Artificial Intelligence (1)
Kohärenz und Kreativität
(2020)
The last years have shown an increasing sophistication of attacks against enterprises. Traditional security solutions like firewalls, anti-virus systems and generally Intrusion Detection Systems (IDSs) are no longer sufficient to protect an enterprise against these advanced attacks. One popular approach to tackle this issue is to collect and analyze events generated across the IT landscape of an enterprise. This task is achieved by the utilization of Security Information and Event Management (SIEM) systems. However, the majority of the currently existing SIEM solutions is not capable of handling the massive volume of data and the diversity of event representations. Even if these solutions can collect the data at a central place, they are neither able to extract all relevant information from the events nor correlate events across various sources. Hence, only rather simple attacks are detected, whereas complex attacks, consisting of multiple stages, remain undetected. Undoubtedly, security operators of large enterprises are faced with a typical Big Data problem.
In this thesis, we propose and implement a prototypical SIEM system named Real-Time Event Analysis and Monitoring System (REAMS) that addresses the Big Data challenges of event data with common paradigms, such as data normalization, multi-threading, in-memory storage, and distributed processing. In particular, a mostly stream-based event processing workflow is proposed that collects, normalizes, persists and analyzes events in near real-time. In this regard, we have made various contributions in the SIEM context. First, we propose a high-performance normalization algorithm that is highly parallelized across threads and distributed across nodes. Second, we are persisting into an in-memory database for fast querying and correlation in the context of attack detection. Third, we propose various analysis layers, such as anomaly- and signature-based detection, that run on top of the normalized and correlated events. As a result, we demonstrate our capabilities to detect previously known as well as unknown attack patterns. Lastly, we have investigated the integration of cyber threat intelligence (CTI) into the analytical process, for instance, for correlating monitored user accounts with previously collected public identity leaks to identify possible compromised user accounts.
In summary, we show that a SIEM system can indeed monitor a large enterprise environment with a massive load of incoming events. As a result, complex attacks spanning across the whole network can be uncovered and mitigated, which is an advancement in comparison to existing SIEM systems on the market.
The identification of vulnerabilities in IT infrastructures is a crucial problem in enhancing the security, because many incidents resulted from already known vulnerabilities, which could have been resolved. Thus, the initial identification of vulnerabilities has to be used to directly resolve the related weaknesses and mitigate attack possibilities. The nature of vulnerability information requires a collection and normalization of the information prior to any utilization, because the information is widely distributed in different sources with their unique formats. Therefore, the comprehensive vulnerability model was defined and different sources have been integrated into one database. Furthermore, different analytic approaches have been designed and implemented into the HPI-VDB, which directly benefit from the comprehensive vulnerability model and especially from the logical preconditions and postconditions.
Firstly, different approaches to detect vulnerabilities in both IT systems of average users and corporate networks of large companies are presented. Therefore, the approaches mainly focus on the identification of all installed applications, since it is a fundamental step in the detection. This detection is realized differently depending on the target use-case. Thus, the experience of the user, as well as the layout and possibilities of the target infrastructure are considered. Furthermore, a passive lightweight detection approach was invented that utilizes existing information on corporate networks to identify applications.
In addition, two different approaches to represent the results using attack graphs are illustrated in the comparison between traditional attack graphs and a simplistic graph version, which was integrated into the database as well. The implementation of those use-cases for vulnerability information especially considers the usability. Beside the analytic approaches, the high data quality of the vulnerability information had to be achieved and guaranteed. The different problems of receiving incomplete or unreliable information for the vulnerabilities are addressed with different correction mechanisms. The corrections can be carried out with correlation or lookup mechanisms in reliable sources or identifier dictionaries. Furthermore, a machine learning based verification procedure was presented that allows an automatic derivation of important characteristics from the textual description of the vulnerabilities.
Medical imaging plays an important role in disease diagnosis, treatment planning, and clinical monitoring. One of the major challenges in medical image analysis is imbalanced training data, in which the class of interest is much rarer than the other classes. Canonical machine learning algorithms suppose that the number of samples from different classes in the training dataset is roughly similar or balance. Training a machine learning model on an imbalanced dataset can introduce unique challenges to the learning problem.
A model learned from imbalanced training data is biased towards the high-frequency samples. The predicted results of such networks have low sensitivity and high precision. In medical applications, the cost of misclassification of the minority class could be more than the cost of misclassification of the majority class. For example, the risk of not detecting a tumor could be much higher than referring to a healthy subject to a doctor. The current Ph.D. thesis introduces several deep learning-based approaches for handling class imbalanced problems for learning multi-task such as disease classification and semantic segmentation.
At the data-level, the objective is to balance the data distribution through re-sampling the data space: we propose novel approaches to correct internal bias towards fewer frequency samples. These approaches include patient-wise batch sampling, complimentary labels, supervised and unsupervised minority oversampling using generative adversarial networks for all.
On the other hand, at algorithm-level, we modify the learning algorithm to alleviate the bias towards majority classes. In this regard, we propose different generative adversarial networks for cost-sensitive learning, ensemble learning, and mutual learning to deal with highly imbalanced imaging data.
We show evidence that the proposed approaches are applicable to different types of medical images of varied sizes on different applications of routine clinical tasks, such as disease classification and semantic segmentation. Our various implemented algorithms have shown outstanding results on different medical imaging challenges.
An ever-increasing number of prediction models is published every year in different medical specialties. Prognostic or diagnostic in nature, these models support medical decision making by utilizing one or more items of patient data to predict outcomes of interest, such as mortality or disease progression. While different computer tools exist that support clinical predictive modeling, I observed that the state of the art is lacking in the extent to which the needs of research clinicians are addressed. When it comes to model development, current support tools either 1) target specialist data engineers, requiring advanced coding skills, or 2) cater to a general-purpose audience, therefore not addressing the specific needs of clinical researchers. Furthermore, barriers to data access across institutional silos, cumbersome model reproducibility and extended experiment-to-result times significantly hampers validation of existing models. Similarly, without access to interpretable explanations, which allow a given model to be fully scrutinized, acceptance of machine learning approaches will remain limited. Adequate tool support, i.e., a software artifact more targeted at the needs of clinical modeling, can help mitigate the challenges identified with respect to model development, validation and interpretation. To this end, I conducted interviews with modeling practitioners in health care to better understand the modeling process itself and ascertain in what aspects adequate tool support could advance the state of the art. The functional and non-functional requirements identified served as the foundation for a software artifact that can be used for modeling outcome and risk prediction in health research. To establish the appropriateness of this approach, I implemented a use case study in the Nephrology domain for acute kidney injury, which was validated in two different hospitals. Furthermore, I conducted user evaluation to ascertain whether such an approach provides benefits compared to the state of the art and the extent to which clinical practitioners could benefit from it. Finally, when updating models for external validation, practitioners need to apply feature selection approaches to pinpoint the most relevant features, since electronic health records tend to contain several candidate predictors. Building upon interpretability methods, I developed an explanation-driven recursive feature elimination approach. This method was comprehensively evaluated against state-of-the art feature selection methods. Therefore, this thesis' main contributions are three-fold, namely, 1) designing and developing a software artifact tailored to the specific needs of the clinical modeling domain, 2) demonstrating its application in a concrete case in the Nephrology context and 3) development and evaluation of a new feature selection approach applicable in a validation context that builds upon interpretability methods. In conclusion, I argue that appropriate tooling, which relies on standardization and parametrization, can support rapid model prototyping and collaboration between clinicians and data scientists in clinical predictive modeling.
Distance Education or e-Learning platform should be able to provide a virtual laboratory to let the participants have hands-on exercise experiences in practicing their skill remotely. Especially in Cybersecurity e-Learning where the participants need to be able to attack or defend the IT System. To have a hands-on exercise, the virtual laboratory environment must be similar to the real operational environment, where an attack or a victim is represented by a node in a virtual laboratory environment. A node is usually represented by a Virtual Machine (VM). Scalability has become a primary issue in the virtual laboratory for cybersecurity e-Learning because a VM needs a significant and fix allocation of resources. Available resources limit the number of simultaneous users. Scalability can be increased by increasing the efficiency of using available resources and by providing more resources. Increasing scalability means increasing the number of simultaneous users.
In this thesis, we propose two approaches to increase the efficiency of using the available resources. The first approach in increasing efficiency is by replacing virtual machines (VMs) with containers whenever it is possible. The second approach is sharing the load with the user-on-premise machine, where the user-on-premise machine represents one of the nodes in a virtual laboratory scenario. We also propose two approaches in providing more resources. One way to provide more resources is by using public cloud services. Another way to provide more resources is by gathering resources from the crowd, which is referred to as Crowdresourcing Virtual Laboratory (CRVL).
In CRVL, the crowd can contribute their unused resources in the form of a VM, a bare metal system, an account in a public cloud, a private cloud and an isolated group of VMs, but in this thesis, we focus on a VM. The contributor must give the credential of the VM admin or root user to the CRVL system. We propose an architecture and methods to integrate or dis-integrate VMs from the CRVL system automatically. A Team placement algorithm must also be investigated to optimize the usage of resources and at the same time giving the best service to the user. Because the CRVL system does not manage the contributor host machine, the CRVL system must be able to make sure that the VM integration will not harm their system and that the training material will be stored securely in the contributor sides, so that no one is able to take the training material away without permission. We are investigating ways to handle this kind of threats.
We propose three approaches to strengthen the VM from a malicious host admin. To verify the integrity of a VM before integration to the CRVL system, we propose a remote verification method without using any additional hardware such as the Trusted Platform Module chip. As the owner of the host machine, the host admins could have access to the VM's data via Random Access Memory (RAM) by doing live memory dumping, Spectre and Meltdown attacks. To make it harder for the malicious host admin in getting the sensitive data from RAM, we propose a method that continually moves sensitive data in RAM. We also propose a method to monitor the host machine by installing an agent on it. The agent monitors the hypervisor configurations and the host admin activities.
To evaluate our approaches, we conduct extensive experiments with different settings. The use case in our approach is Tele-Lab, a Virtual Laboratory platform for Cyber Security e-Learning. We use this platform as a basis for designing and developing our approaches. The results show that our approaches are practical and provides enhanced security.
With the fast rise of cloud computing adoption in the past few years, more companies are migrating their confidential files from their private data center to the cloud to help enterprise's digital transformation process. Enterprise file synchronization and share (EFSS) is one of the solutions offered for enterprises to store their files in the cloud with secure and easy file sharing and collaboration between its employees. However, the rapidly increasing number of cyberattacks on the cloud might target company's files on the cloud to be stolen or leaked to the public. It is then the responsibility of the EFSS system to ensure the company's confidential files to only be accessible by authorized employees.
CloudRAID is a secure personal cloud storage research collaboration project that provides data availability and confidentiality in the cloud. It combines erasure and cryptographic techniques to securely store files as multiple encrypted file chunks in various cloud service providers (CSPs). However, several aspects of CloudRAID's concept are unsuitable for secure and scalable enterprise cloud storage solutions, particularly key management system, location-based access control, multi-cloud storage management, and cloud file access monitoring.
This Ph.D. thesis focuses on CloudRAID for Business (CfB) as it resolves four main challenges of CloudRAID's concept for a secure and scalable EFSS system. First, the key management system is implemented using the attribute-based encryption scheme to provide secure and scalable intra-company and inter-company file-sharing functionalities. Second, an Internet-based location file access control functionality is introduced to ensure files could only be accessed at pre-determined trusted locations. Third, a unified multi-cloud storage resource management framework is utilized to securely manage cloud storage resources available in various CSPs for authorized CfB stakeholders. Lastly, a multi-cloud storage monitoring system is introduced to monitor the activities of files in the cloud using the generated cloud storage log files from multiple CSPs.
In summary, this thesis helps CfB system to provide holistic security for company's confidential files on the cloud-level, system-level, and file-level to ensure only authorized company and its employees could access the files.
The demand for learning Design Thinking (DT) as a path towards acquiring 21st-century skills has increased globally in the last decade. Because DT education originated in the Silicon Valley context of the d.school at Stanford, it is important to evaluate how the teaching of the methodology adapts to different cultural contexts.The thesis explores the impact of the socio-cultural context on DT education.
DT institutes in Cape Town, South Africa and Kuala Lumpur, Malaysia, were visited to observe their programs and conduct 22 semistructured interviews with local educators regarding their adaption strategies. Grounded theory methodology was used to develop a model of Socio-Cultural Adaptation of Design Thinking Education that maps these strategies onto five dimensions: Planning, Process, People, Place, and Presentation. Based on this model, a list of recommendations is provided to help DT educators and practitioners in designing and delivering culturally inclusive DT education.
Generative adversarial networks (GANs) have been broadly applied to a wide range of application domains since their proposal. In this thesis, we propose several methods that aim to tackle different existing problems in GANs. Particularly, even though GANs are generally able to generate high-quality samples, the diversity of the generated set is often sub-optimal. Moreover, the common increase of the number of models in the original GANs framework, as well as their architectural sizes, introduces additional costs. Additionally, even though challenging, the proper evaluation of a generated set is an important direction to ultimately improve the generation process in GANs. We start by introducing two diversification methods that extend the original GANs framework to multiple adversaries to stimulate sample diversity in a generated set. Then, we introduce a new post-training compression method based on Monte Carlo methods and importance sampling to quantize and prune the weights and activations of pre-trained neural networks without any additional training. The previous method may be used to reduce the memory and computational costs introduced by increasing the number of models in the original GANs framework. Moreover, we use a similar procedure to quantize and prune gradients during training, which also reduces the communication costs between different workers in a distributed training setting. We introduce several topology-based evaluation methods to assess data generation in different settings, namely image generation and language generation. Our methods retrieve both single-valued and double-valued metrics, which, given a real set, may be used to broadly assess a generated set or separately evaluate sample quality and sample diversity, respectively. Moreover, two of our metrics use locality-sensitive hashing to accurately assess the generated sets of highly compressed GANs. The analysis of the compression effects in GANs paves the way for their efficient employment in real-world applications. Given their general applicability, the methods proposed in this thesis may be extended beyond the context of GANs. Hence, they may be generally applied to enhance existing neural networks and, in particular, generative frameworks.
Individuals have an intrinsic need to express themselves to other humans within a given community by sharing their experiences, thoughts, actions, and opinions. As a means, they mostly prefer to use modern online social media platforms such as Twitter, Facebook, personal blogs, and Reddit. Users of these social networks interact by drafting their own statuses updates, publishing photos, and giving likes leaving a considerable amount of data behind them to be analyzed. Researchers recently started exploring the shared social media data to understand online users better and predict their Big five personality traits: agreeableness, conscientiousness, extraversion, neuroticism, and openness to experience. This thesis intends to investigate the possible relationship between users’ Big five personality traits and the published information on their social media profiles. Facebook public data such as linguistic status updates, meta-data of likes objects, profile pictures, emotions, or reactions records were adopted to address the proposed research questions. Several machine learning predictions models were constructed with various experiments to utilize the engineered features correlated with the Big 5 Personality traits. The final predictive performances improved the prediction accuracy compared to state-of-the-art approaches, and the models were evaluated based on established benchmarks in the domain. The research experiments were implemented while ethical and privacy points were concerned. Furthermore, the research aims to raise awareness about privacy between social media users and show what third parties can reveal about users’ private traits from what they share and act on different social networking platforms.
In the second part of the thesis, the variation in personality development is studied within a cross-platform environment such as Facebook and Twitter platforms. The constructed personality profiles in these social platforms are compared to evaluate the effect of the used platforms on one user’s personality development. Likewise, personality continuity and stability analysis are performed using two social media platforms samples. The implemented experiments are based on ten-year longitudinal samples aiming to understand users’ long-term personality development and further unlock the potential of cooperation between psychologists and data scientists.