Refine
Document Type
- Article (4)
- Doctoral Thesis (2)
Language
- English (6) (remove)
Is part of the Bibliography
- yes (6)
Keywords
- data (6) (remove)
Cloud model inversions of strong chromospheric absorption lines using principal component analysis
(2020)
High-resolution spectroscopy of strong chromospheric absorption lines delivers nowadays several millions of spectra per observing day, when using fast scanning devices to cover large regions on the solar surface. Therefore, fast and robust inversion schemes are needed to explore the large data volume. Cloud model (CM) inversions of the chromospheric H alpha line are commonly employed to investigate various solar features including filaments, prominences, surges, jets, mottles, and (macro-) spicules. The choice of the CM was governed by its intuitive description of complex chromospheric structures as clouds suspended above the solar surface by magnetic fields. This study is based on observations of active region NOAA 11126 in H alpha, which were obtained November 18-23, 2010 with the echelle spectrograph of the vacuum tower telescope at the Observatorio del Teide, Spain. Principal component analysis reduces the dimensionality of spectra and conditions noise-stripped spectra for CM inversions. Modeled H alpha intensity and contrast profiles as well as CM parameters are collected in a database, which facilitates efficient processing of the observed spectra. Physical maps are computed representing the line-core and continuum intensity, absolute contrast, equivalent width, and Doppler velocities, among others. Noise-free spectra expedite the analysis of bisectors. The data processing is evaluated in the context of "big data," in particular with respect to automatic classification of spectra.
The data quality of real-world datasets need to be constantly monitored and maintained to allow organizations and individuals to reliably use their data. Especially, data integration projects suffer from poor initial data quality and as a consequence consume more effort and money. Commercial products and research prototypes for data cleansing and integration help users to improve the quality of individual and combined datasets. They can be divided into either standalone systems or database management system (DBMS) extensions. On the one hand, standalone systems do not interact well with DBMS and require time-consuming data imports and exports. On the other hand, DBMS extensions are often limited by the underlying system and do not cover the full set of data cleansing and integration tasks.
We overcome both limitations by implementing a concise set of five data cleansing and integration operators on the parallel data analytics platform Stratosphere. We define the semantics of the operators, present their parallel implementation, and devise optimization techniques for individual operators and combinations thereof. Users specify declarative queries in our query language METEOR with our new operators to improve the data quality of individual datasets or integrate them to larger datasets. By integrating the data cleansing operators into the higher level language layer of Stratosphere, users can easily combine cleansing operators with operators from other domains, such as information extraction, to complex data flows. Through a generic description of the operators, the Stratosphere optimizer reorders operators even from different domains to find better query plans.
As a case study, we reimplemented a part of the large Open Government Data integration project GovWILD with our new operators and show that our queries run significantly faster than the original GovWILD queries, which rely on relational operators. Evaluation reveals that our operators exhibit good scalability on up to 100 cores, so that even larger inputs can be efficiently processed by scaling out to more machines. Finally, our scripts are considerably shorter than the original GovWILD scripts, which results in better maintainability of the scripts.
Stars are uniform spheres, but only to first order. The way in which stellar rotation and magnetism break this symmetry places important observational constraints on stellar magnetic fields, and factors in the assessment of the impact of stellar activity on exoplanet atmospheres. The spatial distribution of flares on the solar surface is well known to be nonuniform, but elusive on other stars. We briefly review the techniques available to recover the loci of stellar flares, and highlight a new method that enables systematic flare localization directly from optical light curves. We provide an estimate of the number of flares we may be able to localize with the Transiting Exoplanet Survey Satellite, and show that it is consistent with the results obtained from the first full sky scan of the mission. We suggest that nonuniform flare latitude distributions need to be taken into account in accurate assessments of exoplanet habitability.
The progress of science is tied to the standardization of measurements, instruments, and data. This is especially true in the Big Data age, where analyzing large data volumes critically hinges on the data being standardized. Accordingly, the lack of community-sanctioned data standards in paleoclimatology has largely precluded the benefits of Big Data advances in the field. Building upon recent efforts to standardize the format and terminology of paleoclimate data, this article describes the Paleoclimate Community reporTing Standard (PaCTS), a crowdsourced reporting standard for such data. PaCTS captures which information should be included when reporting paleoclimate data, with the goal of maximizing the reuse value of paleoclimate data sets, particularly for synthesis work and comparison to climate model simulations. Initiated by the LinkedEarth project, the process to elicit a reporting standard involved an international workshop in 2016, various forms of digital community engagement over the next few years, and grassroots working groups. Participants in this process identified important properties across paleoclimate archives, in addition to the reporting of uncertainties and chronologies; they also identified archive-specific properties and distinguished reporting standards for new versus legacy data sets. This work shows that at least 135 respondents overwhelmingly support a drastic increase in the amount of metadata accompanying paleoclimate data sets. Since such goals are at odds with present practices, we discuss a transparent path toward implementing or revising these recommendations in the near future, using both bottom-up and top-down approaches.
Business process management (BPM) is a systematic and structured approach to model, analyze, control, and execute business operations also referred to as business processes that get carried out to achieve business goals. Central to BPM are conceptual models. Most prominently, process models describe which tasks are to be executed by whom utilizing which information to reach a business goal. Process models generally cover the perspectives of control flow, resource, data flow, and information systems.
Execution of business processes leads to the work actually being carried out. Automating them increases the efficiency and is usually supported by process engines. This, though, requires the coverage of control flow, resource assignments, and process data. While the first two perspectives are well supported in current process engines, data handling needs to be implemented and maintained manually. However, model-driven data handling promises to ease implementation, reduces the error-proneness through graphical visualization, and reduces development efforts through code generation.
This thesis addresses the modeling, analysis, and execution of data in business processes and presents a novel approach to execute data-annotated process models entirely model-driven. As a first step and formal grounding for the process execution, a conceptual framework for the integration of processes and data is introduced. This framework is complemented by operational semantics through a Petri net mapping extended with data considerations. Model-driven data execution comprises the handling of complex data dependencies, process data, and data exchange in case of communication between multiple process participants. This thesis introduces concepts from the database domain into BPM to enable the distinction of data operations, to specify relations between data objects of the same as well as of different types, to correlate modeled data nodes as well as received messages to the correct run-time process instances, and to generate messages for inter-process communication. The underlying approach, which is not limited to a particular process description language, has been implemented as proof-of-concept.
Automation of data handling in business processes requires data-annotated and correct process models. Targeting the former, algorithms are introduced to extract information about data nodes, their states, and data dependencies from control information and to annotate the process model accordingly. Usually, not all required information can be extracted from control flow information, since some data manipulations are not specified. This requires further refinement of the process model. Given a set of object life cycles specifying allowed data manipulations, automated refinement of the process model towards containment of all data manipulations is enabled. Process models are an abstraction focusing on specific aspects in detail, e.g., the control flow and the data flow views are often represented through activity-centric and object-centric process models. This thesis introduces algorithms for roundtrip transformations enabling the stakeholder to add information to the process model in the view being most appropriate.
Targeting process model correctness, this thesis introduces the notion of weak conformance that checks for consistency between given object life cycles and the process model such that the process model may only utilize data manipulations specified directly or indirectly in an object life cycle. The notion is computed via soundness checking of a hybrid representation integrating control flow and data flow correctness checking. Making a process model executable, identified violations must be corrected. Therefore, an approach is proposed that identifies for each violation multiple, alternative changes to the process model or the object life cycles.
Utilizing the results of this thesis, business processes can be executed entirely model-driven from the data perspective in addition to the control flow and resource perspectives already supported before. Thereby, the model creation is supported by algorithms partly automating the creation process while model consistency is ensured by data correctness checks.
While supporting the execution of business processes, information systems record event logs. Conformance checking relies on these logs to analyze whether the recorded behavior of a process conforms to the behavior of a normative specification. A key assumption of existing conformance checking techniques, however, is that all events are associated with timestamps that allow to infer a total order of events per process instance. Unfortunately, this assumption is often violated in practice. Due to synchronization issues, manual event recordings, or data corruption, events are only partially ordered. In this paper, we put forward the problem of partial order resolution of event logs to close this gap. It refers to the construction of a probability distribution over all possible total orders of events of an instance. To cope with the order uncertainty in real-world data, we present several estimators for this task, incorporating different notions of behavioral abstraction. Moreover, to reduce the runtime of conformance checking based on partial order resolution, we introduce an approximation method that comes with a bounded error in terms of accuracy. Our experiments with real-world and synthetic data reveal that our approach improves accuracy over the state-of-the-art considerably.