Refine
Has Fulltext
- yes (3)
Year of publication
- 2018 (3) (remove)
Document Type
- Doctoral Thesis (2)
- Postprint (1)
Language
- English (3)
Is part of the Bibliography
- yes (3)
Keywords
- machine learning (3) (remove)
The rapid development and integration of Information Technologies over the last decades influenced all areas of our life, including the business world. Yet not only the modern enterprises become digitalised, but also security and criminal threats move into the digital sphere. To withstand these threats, modern companies must be aware of all activities within their computer networks.
The keystone for such continuous security monitoring is a Security Information and Event Management (SIEM) system that collects and processes all security-related log messages from the entire enterprise network. However, digital transformations and technologies, such as network virtualisation and widespread usage of mobile communications, lead to a constantly increasing number of monitored devices and systems. As a result, the amount of data that has to be processed by a SIEM system is increasing rapidly. Besides that, in-depth security analysis of the captured data requires the application of rather sophisticated outlier detection algorithms that have a high computational complexity. Existing outlier detection methods often suffer from performance issues and are not directly applicable for high-speed and high-volume analysis of heterogeneous security-related events, which becomes a major challenge for modern SIEM systems nowadays.
This thesis provides a number of solutions for the mentioned challenges. First, it proposes a new SIEM system architecture for high-speed processing of security events, implementing parallel, in-memory and in-database processing principles. The proposed architecture also utilises the most efficient log format for high-speed data normalisation. Next, the thesis offers several novel high-speed outlier detection methods, including generic Hybrid Outlier Detection that can efficiently be used for Big Data analysis. Finally, the special User Behaviour Outlier Detection is proposed for better threat detection and analysis of particular user behaviour cases.
The proposed architecture and methods were evaluated in terms of both performance and accuracy, as well as compared with classical architecture and existing algorithms. These evaluations were performed on multiple data sets, including simulated data, well-known public intrusion detection data set, and real data from the large multinational enterprise. The evaluation results have proved the high performance and efficacy of the developed methods.
All concepts proposed in this thesis were integrated into the prototype of the SIEM system, capable of high-speed analysis of Big Security Data, which makes this integrated SIEM platform highly relevant for modern enterprise security applications.
Potato (Solanum tuberosum L.) is one of the most important food crops worldwide. Current potato varieties are highly susceptible to drought stress. In view of global climate change, selection of cultivars with improved drought tolerance and high yield potential is of paramount importance. Drought tolerance breeding of potato is currently based on direct selection according to yield and phenotypic traits and requires multiple trials under drought conditions. Marker‐assisted selection (MAS) is cheaper, faster and reduces classification errors caused by noncontrolled environmental effects. We analysed 31 potato cultivars grown under optimal and reduced water supply in six independent field trials. Drought tolerance was determined as tuber starch yield. Leaf samples from young plants were screened for preselected transcript and nontargeted metabolite abundance using qRT‐PCR and GC‐MS profiling, respectively. Transcript marker candidates were selected from a published RNA‐Seq data set. A Random Forest machine learning approach extracted metabolite and transcript markers for drought tolerance prediction with low error rates of 6% and 9%, respectively. Moreover, by combining transcript and metabolite markers, the prediction error was reduced to 4.3%. Feature selection from Random Forest models allowed model minimization, yielding a minimal combination of only 20 metabolite and transcript markers that were successfully tested for their reproducibility in 16 independent agronomic field trials. We demonstrate that a minimum combination of transcript and metabolite markers sampled at early cultivation stages predicts potato yield stability under drought largely independent of seasonal and regional agronomic conditions.
The purpose of Probabilistic Seismic Hazard Assessment (PSHA) at a construction site is to provide the engineers with a probabilistic estimate of ground-motion level that could be equaled or exceeded at least once in the structure’s design lifetime. A certainty on the predicted ground-motion allows the engineers to confidently optimize structural design and mitigate the risk of extensive damage, or in worst case, a collapse. It is therefore in interest of engineering, insurance, disaster mitigation, and security of society at large, to reduce uncertainties in prediction of design ground-motion levels.
In this study, I am concerned with quantifying and reducing the prediction uncertainty of regression-based Ground-Motion Prediction Equations (GMPEs). Essentially, GMPEs are regressed best-fit formulae relating event, path, and site parameters (predictor variables) to observed ground-motion values at the site (prediction variable). GMPEs are characterized by a parametric median (μ) and a non-parametric variance (σ) of prediction. μ captures the known ground-motion physics i.e., scaling with earthquake rupture properties (event), attenuation with distance from source (region/path), and amplification due to local soil conditions (site); while σ quantifies the natural variability of data that eludes μ. In a broad sense, the GMPE prediction uncertainty is cumulative of 1) uncertainty on estimated regression coefficients (uncertainty on μ,σ_μ), and 2) the inherent natural randomness of data (σ). The extent of μ parametrization, the quantity, and quality of ground-motion data used in a regression, govern the size of its prediction uncertainty: σ_μ and σ.
In the first step, I present the impact of μ parametrization on the size of σ_μ and σ. Over-parametrization appears to increase the σ_μ, because of the large number of regression coefficients (in μ) to be estimated with insufficient data. Under-parametrization mitigates σ_μ, but the reduced explanatory strength of μ is reflected in inflated σ. For an optimally parametrized GMPE, a ~10% reduction in σ is attained by discarding the low-quality data from pan-European events with incorrect parametric values (of predictor variables).
In case of regions with scarce ground-motion recordings, without under-parametrization, the only way to mitigate σ_μ is to substitute long-term earthquake data at a location with short-term samples of data across several locations – the Ergodic Assumption. However, the price of ergodic assumption is an increased σ, due to the region-to-region and site-to-site differences in ground-motion physics. σ of an ergodic GMPE developed from generic ergodic dataset is much larger than that of non-ergodic GMPEs developed from region- and site-specific non-ergodic subsets - which were too sparse to produce their specific GMPEs. Fortunately, with the dramatic increase in recorded ground-motion data at several sites across Europe and Middle-East, I could quantify the region- and site-specific differences in ground-motion scaling and upgrade the GMPEs with 1) substantially more accurate region- and site-specific μ for sites in Italy and Turkey, and 2) significantly smaller prediction variance σ. The benefit of such enhancements to GMPEs is quite evident in my comparison of PSHA estimates from ergodic versus region- and site-specific GMPEs; where the differences in predicted design ground-motion levels, at several sites in Europe and Middle-Eastern regions, are as large as ~50%.
Resolving the ergodic assumption with mixed-effects regressions is feasible when the quantified region- and site-specific effects are physically meaningful, and the non-ergodic subsets (regions and sites) are defined a priori through expert knowledge. In absence of expert definitions, I demonstrate the potential of machine learning techniques in identifying efficient clusters of site-specific non-ergodic subsets, based on latent similarities in their ground-motion data. Clustered site-specific GMPEs bridge the gap between site-specific and fully ergodic GMPEs, with their partially non-ergodic μ and, σ ~15% smaller than the ergodic variance.
The methodological refinements to GMPE development produced in this study are applicable to new ground-motion datasets, to further enhance certainty of ground-motion prediction and thereby, seismic hazard assessment. Advanced statistical tools show great potential in improving the predictive capabilities of GMPEs, but the fundamental requirement remains: large quantity of high-quality ground-motion data from several sites for an extended time-period.