The 10 most recently published documents
Data preparation stands as a cornerstone in the landscape of data science workflows, commanding a significant portion—approximately 80%—of a data scientist's time. The extensive time consumption in data preparation is primarily attributed to the intricate challenge faced by data scientists in devising tailored solutions for downstream tasks. This complexity is further magnified by the inadequate availability of metadata, the often ad-hoc nature of preparation tasks, and the necessity for data scientists to grapple with a diverse range of sophisticated tools, each presenting its unique intricacies and demands for proficiency.
Previous research in data management has traditionally concentrated on preparing the content within columns and rows of a relational table, addressing tasks, such as string disambiguation, date standardization, or numeric value normalization, commonly referred to as data cleaning. This focus assumes a perfectly structured input table. Consequently, the mentioned data cleaning tasks can be effectively applied only after the table has been successfully loaded into the respective data cleaning environment, typically in the later stages of the data processing pipeline.
While current data cleaning tools are well-suited for relational tables, extensive data repositories frequently contain data stored in plain text files, such as CSV files, due to their adaptable standard. Consequently, these files often exhibit tables with a flexible layout of rows and columns, lacking a relational structure. This flexibility often results in data being distributed across cells in arbitrary positions, typically guided by user-specified formatting guidelines.
Effectively extracting and leveraging these tables in subsequent processing stages necessitates accurate parsing. This thesis emphasizes what we define as the “structure” of a data file—the fundamental characters within a file essential for parsing and comprehending its content. Concentrating on the initial stages of the data preprocessing pipeline, this thesis addresses two crucial aspects: comprehending the structural layout of a table within a raw data file and automatically identifying and rectifying any structural issues that might hinder its parsing. Although these issues may not directly impact the table's content, they pose significant challenges in parsing the table within the file.
Our initial contribution comprises an extensive survey of commercially available data preparation tools. This survey thoroughly examines their distinct features, the lacking features, and the necessity for preliminary data processing despite these tools. The primary goal is to elucidate the current state-of-the-art in data preparation systems while identifying areas for enhancement. Furthermore, the survey explores the encountered challenges in data preprocessing, emphasizing opportunities for future research and improvement.
Next, we propose a novel data preparation pipeline designed for detecting and correcting structural errors. The aim of this pipeline is to assist users at the initial preprocessing stage by ensuring the correct loading of their data into their preferred systems. Our approach begins by introducing SURAGH, an unsupervised system that utilizes a pattern-based method to identify dominant patterns within a file, independent of external information, such as data types, row structures, or schemata. By identifying deviations from the dominant pattern, it detects ill-formed rows. Subsequently, our structure correction system, TASHEEH, gathers the identified ill-formed rows along with dominant patterns and employs a novel pattern transformation algebra to automatically rectify errors. Our pipeline serves as an end-to-end solution, transforming a structurally broken CSV file into a well-formatted one, usually suitable for seamless loading.
Finally, we introduce MORPHER, a user-friendly GUI integrating the functionalities of both SURAGH and TASHEEH. This interface empowers users to access the pipeline's features through visual elements. Our extensive experiments demonstrate the effectiveness of our data preparation systems, requiring no user involvement. Both SURAGH and TASHEEH outperform existing state-of-the-art methods significantly in both precision and recall.
Adults' ratings of children's personality have been found to be more closely associated with academic performance than children's self-reports. However, less is known about the relevance of the unique perspectives held by specific adult observers such as teachers and parents for explaining variance in academic performance. In this study, we applied bifactor (S-1) models for 1411 elementary school children to investigate the relative merits of teacher and parent ratings of children's personalities for academic performance above and beyond the children's self-reports. We examined these associations using standardized achievement test scores in addition to grades. We found that teachers' unique views on children's openness and conscientiousness had the strongest associations with academic performance. Parents' unique views on children's neuroticism showed incremental associations above teacher ratings or self-reports. For extraversion and agreeableness, however, children's self-reports were more strongly associated with academic performance than teacher or parent ratings. These results highlight the differential value of using multiple informants when explaining academic performance with personality traits.
Das Anliegen der vorliegenden Arbeit ist die Vermittlung des antiken Verhältnisses zwischen Mensch und natürlicher Umgebung im Lateinunterricht sowie ein Vergleich mit der heutigen Situation. Die Ergründung jenes Verhältnisses erfolgt am Beispiel des antiken Bergbaus, eines besonders anschaulichen Feldes der Umweltgeschichte. Denn es weist ein hohes Maß an Aktualität auf sowie ein großes Potential, aus der Beschäftigung mit ihm Erkenntnisse für die Gegenwart zu gewinnen.
Vorgelegt wird eine Unterrichtskonzeption, die zugleich eine Analyse der menschlichen Naturwahrnehmung vornimmt. Zunächst wird dabei die Heterogenität dieser Wahrnehmung in der Antike aufgezeigt und in Bezug zur damals geäußerten Kritik am Bergbau gesetzt. Anschließend werden folgende Teilaspekte behandelt: 1. die antike bergbauliche Technik und Praxis, 2. die damals herrschenden Arbeitsbedingungen, 3. die gewonnenen Rohstoffe und ihre Verwendung sowie 4. die Folgen des Bergbaus für Mensch und Umwelt. Der didaktische Teil besteht aus einem Entwurf für drei Doppelstunden. Er enthält die Lehrmaterialien, die jeweiligen Erläuterungen und den Erwartungshorizont.
Rising childhood obesity with its detrimental health consequences poses a challenge to the health care system. Community-based, multi-setting interventions with the participatory involvement of relevant stakeholders are emerging as promising. To gain insights into the structural and processual characteristics of stakeholder networks, conducting a network analysis (NA) is advisable. Within the program "Family+-Healthy Living Together in Families and Schools", a network analysis was conducted in two rural model regions and one urban model region. Relevant stakeholders were identified in 2020-2021 through expert interviews and interviewed by telephone to elicit key variables such as frequency of contact and intensity of collaboration. Throughout the NA, characteristics such as density, centrality, and connectedness were analyzed and are presented graphically. Due to the differences in the number of inhabitants and the rural or urban structure of the model regions, the three networks (network#1, network#2, and network#3) included 20, 14, and 12 stakeholders, respectively. All networks had similar densities (network#1, 48%; network#2, 52%; network#3, 42%), whereas the degree centrality of network#1 (0.57) and network#3 (0.58) was one-third higher compared with network#2 (0.39). All three networks differed in the distribution of stakeholders in terms of field of expertise and structural orientation. On average, stakeholders exchanged information quarterly and were connected on an informal level. Based on the results of the NA, it appears to be useful to initialize a community health facilitator to involve relevant stakeholders from the education, sports, and health systems in projects and to strive for the goal of sustainable health promotion, regardless of the rural or urban structure of the region. Participatory involvement of relevant stakeholders can have a positive influence on the effective dissemination of information and networking with other stakeholders.
The Alpine mountains in central Europe are characterized by a heterogeneous crust accumulating different tectonic units and blocks in close proximity to sedimentary foreland basins. Centroid moment tensor inversion provides insight into the faulting mechanisms of earthquakes and related tectonic processes but is significantly aggravated in such an environment. Thanks to the dense AlpArray seismic network and our flexible bootstrap-based inversion tool Grond, we are able to test different setups with respect to the uncertainties of the obtained moment tensors and centroid locations. We evaluate the influence of frequency bands, azimuthal gaps, input data types, and distance ranges and study the occurrence and reliability of non-double-couple (DC) components. We infer that for most earthquakes (M-w >= 3.3) a combination of time domain full waveforms and frequency domain amplitude spectra in a frequency band of 0.02-0.07 Hz is suitable. Relying on the results of our methodological tests, we perform deviatoric moment tensor (MT) inversions for events with M-w > 3.0. Here, we present 75 solutions for earthquakes between January 2016 and December 2019 and analyze our results in the seismotectonic context of historical earthquakes, seismic activity of the last 3 decades, and GNSS deformation data. We study regions of comparably high seismic activity during the last decades, namely the Western Alps, the region around Lake Garda, and the eastern Southern Alps, as well as clusters further from the study region, i.e., in the northern Dinarides and the Apennines. Seismicity is particularly low in the Eastern Alps and in parts of the Central Alps. We apply a clustering algorithm to focal mechanisms, considering additional mechanisms from existing catalogs. Related to the N-S compressional regime, E-W-to-ENE-WSW-striking thrust faulting is mainly observed in the Friuli area in the eastern Southern Alps. Strike-slip faulting with a similarly oriented pressure axis is observed along the northern margin of the Central Alps and in the northern Dinarides. NW-SE-striking normal faulting is observed in the NW Alps, showing a similar strike direction to normal faulting earthquakes in the Apennines. Both our centroid depths and hypocentral depths in existing catalogs indicate that Alpine seismicity is predominantly very shallow; about 80% of the studied events have depths shallower than 10 km.
The magnitude of earthquakes on continental normal faults rarely exceeds 7.0 Mw. However, because of their vicinity to large population centers they can be highly destructive.
Long recurrence time, relatively small deformations, and limited observations hinder our understanding of the deformation patterns and mechanisms controlling the magnitude of events.
Here, this problem is addressed with 2D thermomechanical modeling of normal fault seismic cycles.
The 2020 Samos, Greece Mw7.0 earthquake is used as an example as it is one of the largest and most studied continental normal fault earthquakes. The modeling approach employs visco-elasto-plastic rheology, compressibility, free surface, and a rate-and-state friction law for the fault.
Modeling of the Samos earthquake suggests the pore fluid pressure ratio on the fault ranges from 0 to 0.7. The model demonstrates that most of the deformation during interseismic and coseismic periods, besides on the fault, occurs in the hanging wall and footwall below the seismogenic part of the fault. The largest vertical surface displacement during the earthquake is the subsidence of the hanging wall in the vicinity of the fault, while the uplift of the footwall and remote part of the hanging wall is significantly smaller.
Modeling of the seismic cycles on normal faults with different setups shows the dependency of the magnitude on the thermal profile and dipping angle of the fault; low heat flow and low dipping angle are favorable conditions for the largest events, while steep normal faults in the areas of high heat flow tend to have the smallest magnitudes.
In a warming Arctic, permafrost-related disturbances, such as retrogressive thaw slumps (RTS), are becoming more abundant and dynamic, with serious implications for permafrost stability and bio-geochemical cycles on local to regional scales. Despite recent advances in the field of earth observation, many of these have remained undetected as RTS are highly dynamic, small, and scattered across the remote permafrost region. Here, we assessed the potential strengths and limitations of using deep learning for the automatic segmentation of RTS using PlanetScope satellite imagery, ArcticDEM and auxiliary datasets. We analyzed the transferability and potential for pan-Arctic upscaling and regional cross-validation, with independent training and validation regions, in six different thaw slump-affected regions in Canada and Russia. We further tested state-of-the-art model architectures (UNet, UNet++, DeepLabv3) and encoder networks to find optimal model configurations for potential upscaling to continental scales. The best deep learning models achieved mixed results from good to very good agreement in four of the six regions (maxIoU: 0.39 to 0.58; Lena River, Horton Delta, Herschel Island, Kolguev Island), while they failed in two regions (Banks Island, Tuktoyaktuk). Of the tested architectures, UNet++ performed the best. The large variance in regional performance highlights the requirement for a sufficient quantity, quality and spatial variability in the training data used for segmenting RTS across diverse permafrost landscapes, in varying environmental conditions. With our highly automated and configurable workflow, we see great potential for the transfer to active RTS clusters (e.g., Peel Plateau) and upscaling to much larger regions.
Background Mass gatherings (MGs) such as music festivals and sports events have been associated with a high risk of SARS-CoV-2 transmission. On-site research can foster knowledge of risk factors for infections and improve risk assessments and precautionary measures at future MGs. We tested a web-based participatory disease surveillance tool to detect COVID-19 infections at and after an outdoor MG by collecting self-reported COVID-19 symptoms and tests. Methods We conducted a digital prospective observational cohort study among fully immunized attendees of a sports festival that took place from September 2 to 5, 2021 in Saxony-Anhalt, Germany. Participants used our study app to report demographic data, COVID-19 tests, symptoms, and their contact behavior. This self-reported data was used to define probable and confirmed COVID-19 cases for the full "study period" (08/12/2021 - 10/31/2021) and within the 14-day "surveillance period" during and after the MG, with the highest likelihood of an MG-related COVID-19 outbreak (09/04/2021 - 09/17/2021). Results A total of 2,808 of 9,242 (30.4%) event attendees participated in the study. Within the study period, 776 individual symptoms and 5,255 COVID-19 tests were reported. During the 14-day surveillance period around and after the MG, seven probable and seven PCR-confirmed COVID-19 cases were detected. The confirmed cases translated to an estimated seven-day incidence of 125 per 100,000 participants (95% CI [67.7/100,000, 223/100,000]), which was comparable to the average age-matched incidence in Germany during this time. Overall, weekly numbers of COVID-19 cases were fluctuating over the study period, with another increase at the end of the study period. Conclusion COVID-19 cases attributable to the mass gathering were comparable to the Germany-wide age-matched incidence, implicating that our active participatory disease surveillance tool was able to detect MG-related infections. Further studies are needed to evaluate and apply our participatory disease surveillance tool in other mass gathering settings.
The formation of the Central Andes dates back to similar to 50 Ma, but its most pronounced episode, including the growth of the Altiplano-Puna Plateau and pulsatile tectonic shortening phases, occurred within the last 25 Ma.
The reason for this evolution remains unexplained. Using geodynamic numerical modeling we infer that the primary cause of the pulses of tectonic shortening and growth of the Central Andes is the changing geometry of the subducted Nazca plate, and particularly the steepening of the mid-mantle slab segment which results in a slowing down of the trench retreat and subsequent increase in shortening of the advancing South America plate.
This steepening first happens after the end of the flat slab episode at similar to 25 Ma, and later during the buckling and stagnation of the slab in the mantle transition zone. Processes that mechanically weaken the lithosphere of the South America plate, as suggested in previous studies, enhance the intensity of the shortening events.
These processes include delamination of the mantle lithosphere and weakening of foreland sediments.
Our new modeling results are consistent with the timing and amplitude of the deformation from geological data in the Central Andes at the Altiplano latitude.
Plain Language Summary
The Central Andes is a subduction-type orogeny that formed as a result of the interaction between the Nazca oceanic plate and the South American continental plate over the last 50 million years. Growth of the Andes is primarily the result of crustal shortening. Nevertheless, "geological" data compiled from previous studies have shown that phases of drastic pulsatile shortening occur at 15 and 5 Ma.
In this study, we used high-resolution 2D numerical geodynamic simulations to investigate the link between oceanic and continental plate dynamics and their interaction. We find that when the oceanic plate steepens in the mantle transition zone, the trench retreat is hindered. Coupled with the weakening of the continental plate through the slab flattening and subsequent delamination of the lithospheric mantle, this leads to pulsatile shortening phases of a magnitude equivalent to that suggested by the data.
Magma-filled dikes may feed erupting fissures that lead to alignments of craters developing at the surface, yet the details of activity and migrating eruptions at the crater row are difficult to monitor and are hardly understood.
The 2021 Tajogaite eruption at the Cumbre Vieja, La Palma (Spain), lasted 85 days and developed a pronounced alignment of craters that may be related to changes within the volcano edifice.
Here, we use COSMO-SkyMed satellite radar data and ground-based time-lapse photographs, offering a high-resolution dataset to explore the locations and characteristics of evolving craters.
Our results show that the craters evolve both gradually and suddenly and can be divided into three main phases. Phase 1, lasting the first 6 weeks of the eruption, was characterized by a NW-SE linear evolution of up to seven craters emerging on the growing cone.
Following two partial collapses of the cone to the northwest and a seismicity increase at depth, Phase 2 started and caused a propagation of the main activity toward the southeastern side, together with the presence of up to 11 craters along this main NW-SE trend. Associated with strong deep and shallow earthquakes, Phase 3 was initiated and continued for the final 2 weeks of the eruption, expressed by the development of up to 18 craters, which became dominant and clustered in the southeastern sector in early December 2021. In Phase 3, a second and oblique alignment and surface fracture was identified.
Our findings that crater and eruption changes coincide together with an increase in seismic activity at depth point to a deep driver leading to crater and morphology changes at the surface.
These also suggest that crater distributions might allow for improved monitoring of changes occurring at depth, and vice versa, such that strong seismicity changes at depth may herald the migration and new formation of craters, which have major implications for the assessment of tephra and lava flow hazards on volcanoes.