@phdthesis{Meier2017, author = {Meier, Sebastian}, title = {Personal Big Data}, url = {http://nbn-resolving.de/urn:nbn:de:kobv:517-opus4-406696}, school = {Universit{\"a}t Potsdam}, pages = {xxiv, 133}, year = {2017}, abstract = {Many users of cloud-based services are concerned about questions of data privacy. At the same time, they want to benefit from smart data-driven services, which require insight into a person's individual behaviour. The modus operandi of user modelling is that data is sent to a remote server where the model is constructed and merged with other users' data. This thesis proposes selective cloud computing, an alternative approach, in which the user model is constructed on the client-side and only an abstracted generalised version of the model is shared with the remote services. In order to demonstrate the applicability of this approach, the thesis builds an exemplary client-side user modelling technique. As this thesis is carried out in the area of Geoinformatics and spatio-temporal data is particularly sensitive, the application domain for this experiment is the analysis and prediction of a user's spatio-temporal behaviour. The user modelling technique is grounded in an innovative conceptual model, which builds upon spatial network theory combined with time-geography. The spatio-temporal constraints of time-geography are applied to the network structure in order to create individual spatio-temporal action spaces. This concept is translated into a novel algorithmic user modelling approach which is solely driven by the user's own spatio-temporal trajectory data that is generated by the user's smartphone. While modern smartphones offer a rich variety of sensory data, this thesis only makes use of spatio-temporal trajectory data, enriched by activity classification, as the input and foundation for the algorithmic model. The algorithmic model consists of three basal components: locations (vertices), trips (edges), and clusters (neighbourhoods). After preprocessing the incoming trajectory data in order to identify locations, user feedback is used to train an artificial neural network to learn temporal patterns for certain location types (e.g. work, home, bus stop, etc.). This Artificial Neural Network (ANN) is used to automatically detect future location types by their spatio-temporal patterns. The same is done in order to predict the duration of stay at a certain location. Experiments revealed that neural nets were the most successful statistical and machine learning tool to detect those patterns. The location type identification algorithm reached an accuracy of 87.69\%, the duration prediction on binned data was less successful and deviated by an average of 0.69 bins. A challenge for the location type classification, as well as for the subsequent components, was the imbalance of trips and connections as well as the low accuracy of the trajectory data. The imbalance is grounded in the fact that most users exhibit strong habitual patterns (e.g. home > work), while other patterns are rather rare by comparison. The accuracy problem derives from the energy-saving location sampling mode, which creates less accurate results. Those locations are then used to build a network that represents the user's spatio-temporal behaviour. An initial untrained ANN to predict movement on the network only reached 46\% average accuracy. Only lowering the number of included edges, focusing on more common trips, increased the performance. In order to further improve the algorithm, the spatial trajectories were introduced into the predictions. To overcome the accuracy problem, trips between locations were clustered into so-called spatial corridors, which were intersected with the user's current trajectory. The resulting intersected trips were ranked through a k-nearest-neighbour algorithm. This increased the performance to 56\%. In a final step, a combination of a network and spatial clustering algorithm was built in order to create clusters, therein reducing the variety of possible trips. By only predicting the destination cluster instead of the exact location, it is possible to increase the performance to 75\% including all classes. A final set of components shows in two exemplary ways how to deduce additional inferences from the underlying spatio-temporal data. The first example presents a novel concept for predicting the 'potential memorisation index' for a certain location. The index is based on a cognitive model which derives the index from the user's activity data in that area. The second example embeds each location in its urban fabric and thereby enriches its cluster's metadata by further describing the temporal-semantic activity in an area (e.g. going to restaurants at noon). The success of the client-side classification and prediction approach, despite the challenges of inaccurate and imbalanced data, supports the claimed benefits of the client-side modelling concept. Since modern data-driven services at some point do need to receive user data, the thesis' computational model concludes with a concept for applying generalisation to semantic, temporal, and spatial data before sharing it with the remote service in order to comply with the overall goal to improve data privacy. In this context, the potentials of ensemble training (in regards to ANNs) are discussed in order to highlight the potential of only sharing the trained ANN instead of the raw input data. While the results of our evaluation support the assets of the proposed framework, there are two important downsides of our approach compared to server-side modelling. First, both of these server-side advantages are rooted in the server's access to multiple users' data. This allows a remote service to predict spatio-in the user-specific data, which represents the second downside. While minor classes will likely be minor classes in a bigger dataset as well, for each class, there will still be more variety than in the user-specific dataset. The author emphasises that the approach presented in this work holds the potential to change the privacy paradigm in modern data-driven services. Finding combinations of client- and server-side modelling could prove a promising new path for data-driven innovation. Beyond the technological perspective, throughout the thesis the author also offers a critical view on the data- and technology-driven development of this work. By introducing the client-side modelling with user-specific artificial neural networks, users generate their own algorithm. Those user-specific algorithms are influenced less by generalised biases or developers' prejudices. Therefore, the user develops a more diverse and individual perspective through his or her user model. This concept picks up the idea of critical cartography, which questions the status quo of how space is perceived and represented.}, language = {en} } @article{SchudomaLarhlimiWalther2011, author = {Schudoma, Christian and Larhlimi, Abdelhalim and Walther, Dirk}, title = {The influence of the local sequence environment on RNA loop structures}, series = {RNA : a publication of the RNA Society}, volume = {17}, journal = {RNA : a publication of the RNA Society}, number = {7}, publisher = {Cold Spring Harbor Laboratory Press}, address = {Cold Spring Harbor, NY}, issn = {1355-8382}, doi = {10.1261/rna.2550211}, pages = {1247 -- 1257}, year = {2011}, abstract = {RNA folding is assumed to be a hierarchical process. The secondary structure of an RNA molecule, signified by base-pairing and stacking interactions between the paired bases, is formed first. Subsequently, the RNA molecule adopts an energetically favorable three-dimensional conformation in the structural space determined mainly by the rotational degrees of freedom associated with the backbone of regions of unpaired nucleotides (loops). To what extent the backbone conformation of RNA loops also results from interactions within the local sequence context or rather follows global optimization constraints alone has not been addressed yet. Because the majority of base stacking interactions are exerted locally, a critical influence of local sequence on local structure appears plausible. Thus, local loop structure ought to be predictable, at least in part, from the local sequence context alone. To test this hypothesis, we used Random Forests on a nonredundant data set of unpaired nucleotides extracted from 97 X-ray structures from the Protein Data Bank (PDB) to predict discrete backbone angle conformations given by the discretized eta/theta-pseudo-torsional space. Predictions on balanced sets with four to six conformational classes using local sequence information yielded average accuracies of up to 55\%, thus significantly better than expected by chance (17\%-25\%). Bases close to the central nucleotide appear to be most tightly linked to its conformation. Our results suggest that RNA loop structure does not only depend on long-range base-pairing interactions; instead, it appears that local sequence context exerts a significant influence on the formation of the local loop structure.}, language = {en} } @unpublished{PrasseGrubenMachlikaetal.2016, author = {Prasse, Paul and Gruben, Gerrit and Machlika, Lukas and Pevny, Tomas and Sofka, Michal and Scheffer, Tobias}, title = {Malware Detection by HTTPS Traffic Analysis}, url = {http://nbn-resolving.de/urn:nbn:de:kobv:517-opus4-100942}, pages = {10}, year = {2016}, abstract = {In order to evade detection by network-traffic analysis, a growing proportion of malware uses the encrypted HTTPS protocol. We explore the problem of detecting malware on client computers based on HTTPS traffic analysis. In this setting, malware has to be detected based on the host IP address, ports, timestamp, and data volume information of TCP/IP packets that are sent and received by all the applications on the client. We develop a scalable protocol that allows us to collect network flows of known malicious and benign applications as training data and derive a malware-detection method based on a neural networks and sequence classification. We study the method's ability to detect known and new, unknown malware in a large-scale empirical study.}, language = {en} } @article{KibrikKhudyakovaDobrovetal.2016, author = {Kibrik, Andrej A. and Khudyakova, Mariya V. and Dobrov, Grigory B. and Linnik, Anastasia and Zalmanov, Dmitrij A.}, title = {Referential Choice}, series = {Frontiers in psychology}, volume = {7}, journal = {Frontiers in psychology}, publisher = {Frontiers Research Foundation}, address = {Lausanne}, issn = {1664-1078}, doi = {10.3389/fpsyg.2016.01429}, year = {2016}, abstract = {We report a study of referential choice in discourse production, understood as the choice between various types of referential devices, such as pronouns and full noun phrases. Our goal is to predict referential choice, and to explore to what extent such prediction is possible. Our approach to referential choice includes a cognitively informed theoretical component, corpus analysis, machine learning methods and experimentation with human participants. Machine learning algorithms make use of 25 factors, including referent's properties (such as animacy and protagonism), the distance between a referential expression and its antecedent, the antecedent's syntactic role, and so on. Having found the predictions of our algorithm to coincide with the original almost 90\% of the time, we hypothesized that fully accurate prediction is not possible because, in many situations, more than one referential option is available. This hypothesis was supported by an experimental study, in which participants answered questions about either the original text in the corpus, or about a text modified in accordance with the algorithm's prediction. Proportions of correct answers to these questions, as well as participants' rating of the questions' difficulty, suggested that divergences between the algorithm's prediction and the original referential device in the corpus occur overwhelmingly in situations where the referential choice is not categorical.}, language = {en} } @misc{KibrikKhudyakovaDobrovetal.2016, author = {Kibrik, Andrej A. and Khudyakova, Mariya V. and Dobrov, Grigory B. and Linnik, Anastasia and Zalmanov, Dmitrij A.}, title = {Referential Choice}, url = {http://nbn-resolving.de/urn:nbn:de:kobv:517-opus4-100313}, pages = {21}, year = {2016}, abstract = {We report a study of referential choice in discourse production, understood as the choice between various types of referential devices, such as pronouns and full noun phrases. Our goal is to predict referential choice, and to explore to what extent such prediction is possible. Our approach to referential choice includes a cognitively informed theoretical component, corpus analysis, machine learning methods and experimentation with human participants. Machine learning algorithms make use of 25 factors, including referent's properties (such as animacy and protagonism), the distance between a referential expression and its antecedent, the antecedent's syntactic role, and so on. Having found the predictions of our algorithm to coincide with the original almost 90\% of the time, we hypothesized that fully accurate prediction is not possible because, in many situations, more than one referential option is available. This hypothesis was supported by an experimental study, in which participants answered questions about either the original text in the corpus, or about a text modified in accordance with the algorithm's prediction. Proportions of correct answers to these questions, as well as participants' rating of the questions' difficulty, suggested that divergences between the algorithm's prediction and the original referential device in the corpus occur overwhelmingly in situations where the referential choice is not categorical.}, language = {en} } @phdthesis{Haider2013, author = {Haider, Peter}, title = {Prediction with Mixture Models}, url = {http://nbn-resolving.de/urn:nbn:de:kobv:517-opus-69617}, school = {Universit{\"a}t Potsdam}, year = {2013}, abstract = {Learning a model for the relationship between the attributes and the annotated labels of data examples serves two purposes. Firstly, it enables the prediction of the label for examples without annotation. Secondly, the parameters of the model can provide useful insights into the structure of the data. If the data has an inherent partitioned structure, it is natural to mirror this structure in the model. Such mixture models predict by combining the individual predictions generated by the mixture components which correspond to the partitions in the data. Often the partitioned structure is latent, and has to be inferred when learning the mixture model. Directly evaluating the accuracy of the inferred partition structure is, in many cases, impossible because the ground truth cannot be obtained for comparison. However it can be assessed indirectly by measuring the prediction accuracy of the mixture model that arises from it. This thesis addresses the interplay between the improvement of predictive accuracy by uncovering latent cluster structure in data, and further addresses the validation of the estimated structure by measuring the accuracy of the resulting predictive model. In the application of filtering unsolicited emails, the emails in the training set are latently clustered into advertisement campaigns. Uncovering this latent structure allows filtering of future emails with very low false positive rates. In order to model the cluster structure, a Bayesian clustering model for dependent binary features is developed in this thesis. Knowing the clustering of emails into campaigns can also aid in uncovering which emails have been sent on behalf of the same network of captured hosts, so-called botnets. This association of emails to networks is another layer of latent clustering. Uncovering this latent structure allows service providers to further increase the accuracy of email filtering and to effectively defend against distributed denial-of-service attacks. To this end, a discriminative clustering model is derived in this thesis that is based on the graph of observed emails. The partitionings inferred using this model are evaluated through their capacity to predict the campaigns of new emails. Furthermore, when classifying the content of emails, statistical information about the sending server can be valuable. Learning a model that is able to make use of it requires training data that includes server statistics. In order to also use training data where the server statistics are missing, a model that is a mixture over potentially all substitutions thereof is developed. Another application is to predict the navigation behavior of the users of a website. Here, there is no a priori partitioning of the users into clusters, but to understand different usage scenarios and design different layouts for them, imposing a partitioning is necessary. The presented approach simultaneously optimizes the discriminative as well as the predictive power of the clusters. Each model is evaluated on real-world data and compared to baseline methods. The results show that explicitly modeling the assumptions about the latent cluster structure leads to improved predictions compared to the baselines. It is beneficial to incorporate a small number of hyperparameters that can be tuned to yield the best predictions in cases where the prediction accuracy can not be optimized directly.}, language = {en} } @phdthesis{Floeter2005, author = {Fl{\"o}ter, Andr{\´e}}, title = {Analyzing biological expression data based on decision tree induction}, url = {http://nbn-resolving.de/urn:nbn:de:kobv:517-opus-6416}, school = {Universit{\"a}t Potsdam}, year = {2005}, abstract = {Modern biological analysis techniques supply scientists with various forms of data. One category of such data are the so called "expression data". These data indicate the quantities of biochemical compounds present in tissue samples. Recently, expression data can be generated at a high speed. This leads in turn to amounts of data no longer analysable by classical statistical techniques. Systems biology is the new field that focuses on the modelling of this information. At present, various methods are used for this purpose. One superordinate class of these meth­ods is machine learning. Methods of this kind had, until recently, predominantly been used for classification and prediction tasks. This neglected a powerful secondary benefit: the ability to induce interpretable models. Obtaining such models from data has become a key issue within Systems biology. Numerous approaches have been proposed and intensively discussed. This thesis focuses on the examination and exploitation of one basic technique: decision trees. The concept of comparing sets of decision trees is developed. This method offers the pos­sibility of identifying significant thresholds in continuous or discrete valued attributes through their corresponding set of decision trees. Finding significant thresholds in attributes is a means of identifying states in living organisms. Knowing about states is an invaluable clue to the un­derstanding of dynamic processes in organisms. Applied to metabolite concentration data, the proposed method was able to identify states which were not found with conventional techniques for threshold extraction. A second approach exploits the structure of sets of decision trees for the discovery of com­binatorial dependencies between attributes. Previous work on this issue has focused either on expensive computational methods or the interpretation of single decision trees ­ a very limited exploitation of the data. This has led to incomplete or unstable results. That is why a new method is developed that uses sets of decision trees to overcome these limitations. Both the introduced methods are available as software tools. They can be applied consecu­tively or separately. That way they make up a package of analytical tools that usefully supplement existing methods. By means of these tools, the newly introduced methods were able to confirm existing knowl­edge and to suggest interesting and new relationships between metabolites.}, subject = {Molekulare Bioinformatik}, language = {en} }