Refine
Has Fulltext
- yes (2)
Document Type
- Doctoral Thesis (2) (remove)
Language
- English (2)
Is part of the Bibliography
- yes (2)
Keywords
- pattern recognition (2) (remove)
Computer Security deals with the detection and mitigation of threats to computer networks, data, and computing hardware. This
thesis addresses the following two computer security problems: email spam campaign and malware detection.
Email spam campaigns can easily be generated using popular dissemination tools by specifying simple grammars that serve as message templates. A grammar is disseminated to nodes of a bot net, the nodes create messages by instantiating the grammar at random. Email spam campaigns can encompass huge data volumes and therefore pose a threat to the stability of the infrastructure of email service providers that have to store them. Malware -software that serves a malicious purpose- is affecting web servers, client computers via active content, and client computers through executable files. Without the help of malware detection systems it would be easy for malware creators to collect sensitive information or to infiltrate computers.
The detection of threats -such as email-spam messages, phishing messages, or malware- is an adversarial and therefore intrinsically
difficult problem. Threats vary greatly and evolve over time. The detection of threats based on manually-designed rules is therefore
difficult and requires a constant engineering effort. Machine-learning is a research area that revolves around the analysis of data and the discovery of patterns that describe aspects of the data. Discriminative learning methods extract prediction models from data that are optimized to predict a target attribute as accurately as possible. Machine-learning methods hold the promise of automatically identifying patterns that robustly and accurately detect threats. This thesis focuses on the design and analysis of discriminative learning methods for the two computer-security problems under investigation: email-campaign and malware detection.
The first part of this thesis addresses email-campaign detection. We focus on regular expressions as a syntactic framework, because regular expressions are intuitively comprehensible by security engineers and administrators, and they can be applied as a detection mechanism in an extremely efficient manner. In this setting, a prediction model is provided with exemplary messages from an email-spam campaign. The prediction model has to generate a regular expression that reveals the syntactic pattern that underlies the entire campaign, and that a security engineers finds comprehensible and feels confident enough to use the expression to blacklist further messages at the email server. We model this problem as two-stage learning problem with structured input and output spaces which can be solved using standard cutting plane methods. Therefore we develop an appropriate loss function, and derive a decoder for the resulting optimization problem.
The second part of this thesis deals with the problem of predicting whether a given JavaScript or PHP file is malicious or benign. Recent malware analysis techniques use static or dynamic features, or both. In fully dynamic analysis, the software or script is executed and observed for malicious behavior in a sandbox environment. By contrast, static analysis is based on features that can be extracted directly from the program file. In order to bypass static detection mechanisms, code obfuscation techniques are used to spread a malicious program file in many different syntactic variants. Deobfuscating the code before applying a static classifier can be subjected to mostly static code analysis and can overcome the problem of obfuscated malicious code, but on the other hand increases the computational costs of malware detection by an order of magnitude. In this thesis we present a cascaded architecture in which a classifier first performs a static analysis of the original code and -based on the outcome of this first classification step- the code may be deobfuscated and classified again. We explore several types of features including token $n$-grams, orthogonal sparse bigrams, subroutine-hashings, and syntax-tree features and study the robustness of detection methods and feature types against the evolution of malware over time. The developed tool scans very large file collections quickly and accurately.
Each model is evaluated on real-world data and compared to reference methods. Our approach of inferring regular expressions to filter emails belonging to an email spam campaigns leads to models with a high true-positive rate at a very low false-positive rate that is an order of magnitude lower than that of a commercial content-based filter. Our presented system -REx-SVMshort- is being used by a commercial email service provider and complements content-based and IP-address based filtering.
Our cascaded malware detection system is evaluated on a high-quality data set of almost 400,000 conspicuous PHP files and a collection of more than 1,00,000 JavaScript files. From our case study we can conclude that our system can quickly and accurately process large data collections at a low false-positive rate.
Merapi volcano is one of the most active and dangerous volcanoes of the earth. Located in central part of Java island (Indonesia), even a moderate eruption of Merapi poses a high risk to the highly populated area. Due to the close relationship between the volcanic unrest and the occurrence of seismic events at Mt. Merapi, the monitoring of Merapi's seismicity plays an important role for recognizing major changes in the volcanic activity. An automatic seismic event detection and classification system, which is capable to characterize the actual seismic activity in near real-time, is an important tool which allows the scientists in charge to take immediate decisions during a volcanic crisis. In order to accomplish the task of detecting and classifying volcano-seismic signals automatically in the continuous data streams, a pattern recognition approach has been used. It is based on the method of hidden Markov models (HMM), a technique, which has proven to provide high recognition rates at high confidence levels in classification tasks of similar complexity (e.g. speech recognition). Any pattern recognition system relies on the appropriate representation of the input data in order to allow a reasonable class-decision by means of a mathematical test function. Based on the experiences from seismological observatory practice, a parametrization scheme of the seismic waveform data is derived using robust seismological analysis techniques. The wavefield parameters are summarized into a real-valued feature vector per time step. The time series of this feature vector build the basis for the HMM-based classification system. In order to make use of discrete hidden Markov (DHMM) techniques, the feature vectors are further processed by applying a de-correlating and prewhitening transformation and additional vector quantization. The seismic wavefield is finally represented as a discrete symbol sequence with a finite alphabet. This sequence is subject to a maximum likelihood test against the discrete hidden Markov models, learned from a representative set of training sequences for each seismic event type of interest. A time period from July, 1st to July, 5th, 1998 of rapidly increasing seismic activity prior to the eruptive cycle between July, 10th and July, 19th, 1998 at Merapi volcano is selected for evaluating the performance of this classification approach. Three distinct types of seismic events according to the established classification scheme of the Volcanological Survey of Indonesia (VSI) have been observed during this time period. Shallow volcano-tectonic events VTB (h < 2.5 km), very shallow dome-growth related seismic events MP (h < 1 km) and seismic signals connected to rockfall activity originating from the active lava dome, termed Guguran. The special configuration of the digital seismic station network at Merapi volcano, a combination of small-aperture array deployments surrounding Merapi's summit region, allows the use of array methods to parametrize the continuously recorded seismic wavefield. The individual signal parameters are analyzed to determine their relevance for the discrimination of seismic event classes. For each of the three observed event types a set of DHMMs has been trained using a selected set of seismic events with varying signal to noise ratios and signal durations. Additionally, two sets of discrete hidden Markov models have been derived for the seismic noise, incorporating the fact, that the wavefield properties of the ambient vibrations differ considerably during working hours and night time. A total recognition accuracy of 67% is obtained. The mean false alarm (FA) rate can be given by 41 FA/class/day. However, variations in the recognition capabilities for the individual seismic event classes are significant. Shallow volcano-tectonic signals (VTB) show very distinct wavefield properties and (at least in the selected time period) a stable time pattern of wavefield attributes. The DHMM-based classification performs therefore best for VTB-type events, with almost 89% recognition accuracy and 2 FA/day. Seismic signals of the MP- and Guguran-classes are more difficult to detect and classify. Around 64% of MP-events and 74% of Guguran signals are recognized correctly. The average false alarm rate for MP-events is 87 FA/day, whereas for Guguran signals 33 FA/day are obtained. However, the majority of missed events and false alarms for both MP and Guguran events are due to confusion errors between these two event classes in the recognition process. The confusion of MP and Guguran events is interpreted as being a consequence of the selected parametrization approach for the continuous seismic data streams. The observed patterns of the analyzed wavefield attributes for MP and Guguran events show a significant amount of similarity, thus providing not sufficient discriminative information for the numerical classification. The similarity of wavefield parameters obtained for seismic events of MP and Guguran type reflect the commonly observed dominance of path effects on the seismic wave propagation in volcanic environments. The recognition rates obtained for the five-day period of increasing seismicity show, that the presented DHMM-based automatic classification system is a promising approach for the difficult task of classifying volcano-seismic signals. Compared to standard signal detection algorithms, the most significant advantage of the discussed technique is, that the entire seismogram is detected and classified in a single step.