Background:
Plant phenotypic data shrouds a wealth of information which, when accurately analysed and linked
to other data types, brings to light the knowledge about the mechanisms of life. As phenotyping is a field of research
comprising manifold, diverse and time
‑consuming experiments, the findings can be fostered by reusing and combin‑
ing existing datasets. Their correct interpretation, and thus replicability, comparability and interoperability, is possible
provided that the collected observations are equipped with an adequate set of metadata. So far there have been no
common standards governing phenotypic data description, which hampered data exchange and reuse.
Results:
In this paper we propose the guidelines for proper handling of the information about plant phenotyping
experiments, in terms of both the recommended content of the description and its formatting. We provide a docu‑
ment called “Minimum Information About a Plant Phenotyping Experiment”, which specifies what information about
each experiment should be given, and a Phenotyping Configuration for the ISA
‑Tab format, which allows to practically
organise this information within a dataset. We provide examples of ISA
‑Tab
‑formatted phenotypic data, and a general
description of a few systems where the recommendations have been implemented.
Conclusions:
Acceptance of the rules described in this paper by the plant phenotyping community will help to
achieve findable, accessible, interoperable and reusable data.
Measures for interoperability of phenotypic data: minimum information requirements and formatting
(2016)
Background: Plant phenotypic data shrouds a wealth of information which, when accurately analysed and linked to other data types, brings to light the knowledge about the mechanisms of life. As phenotyping is a field of research comprising manifold, diverse and time-consuming experiments, the findings can be fostered by reusing and combining existing datasets. Their correct interpretation, and thus replicability, comparability and interoperability, is possible provided that the collected observations are equipped with an adequate set of metadata. So far there have been no common standards governing phenotypic data description, which hampered data exchange and reuse. Results: In this paper we propose the guidelines for proper handling of the information about plant phenotyping experiments, in terms of both the recommended content of the description and its formatting. We provide a document called "Minimum Information About a Plant Phenotyping Experiment", which specifies what information about each experiment should be given, and a Phenotyping Configuration for the ISA-Tab format, which allows to practically organise this information within a dataset. We provide examples of ISA-Tab-formatted phenotypic data, and a general description of a few systems where the recommendations have been implemented. Conclusions: Acceptance of the rules described in this paper by the plant phenotyping community will help to achieve findable, accessible, interoperable and reusable data.
A new large set of reciprocal recombinant inbred lines (RILs) was created between the Arabidopsis accessions Col-0 and C24 for quantitative trait mapping approaches, consisting of 209 Col-0 x C24 and 214 C24 x Col-0 F-7 RI lines. Genotyping was performed using 110 evenly distributed framework single nucleotide polymorphism markers, yielding a genetic map of 425.70 cM, with an average interval of 3.87 cM. Segregation distortion (SD) was observed in several genomic regions during the construction of the genetic map. Linkage disequilibrium analysis revealed an association between a distorted region at the bottom of chromosome V and a non-distorted region on chromosome IV. A detailed analysis of the RILs for these two regions showed that an SD occurred when homozygous Col-0 alleles on chromosome IV coincided with homozygous C24 alleles at the bottom of chromosome V. Using nearly isogenic lines segregating for the distorted region we confirmed that this genotypic composition leads to reduced fertility and fitness.
Rising demand for food and bioenergy makes it imperative to breed for increased crop yield. Vegetative plant growth could be driven by resource acquisition or developmental programs. Metabolite profiling in 94 Arabidopsis accessions revealed that biomass correlates negatively with many metabolites, especially starch. Starch accumulates in the light and is degraded at night to provide a sustained supply of carbon for growth. Multivariate analysis revealed that starch is an integrator of the overall metabolic response. We hypothesized that this reflects variation in a regulatory network that balances growth with the carbon supply. Transcript profiling in 21 accessions revealed coordinated changes of transcripts of more than 70 carbon-regulated genes and identified 2 genes (myo-inositol-1- phosphate synthase, a Kelch-domain protein) whose transcripts correlate with biomass. The impact of allelic variation at these 2 loci was shown by association mapping, identifying them as candidate lead genes with the potential to increase biomass production.
Prediction of hybrid biomass in Arabidopsis thaliana by selected parental SNP and metabolic markers
(2009)
A recombinant inbred line (RIL) population, derived from two Arabidopsis thaliana accessions, and the corresponding testcrosses with these two original accessions were used for the development and validation of machine learning models to predict the biomass of hybrids. Genetic and metabolic information of the RILs served as predictors. Feature selection reduced the number of variables (genetic and metabolic markers) in the models by more than 80% without impairing the predictive power. Thus, potential biomarkers have been revealed. Metabolites were shown to bear information on inherited macroscopic phenotypes. This proof of concept could be interesting for breeders. The example population exhibits substantial mid-parent biomass heterosis. The results of feature selection could therefore be used to shed light on the origin of heterosis. In this respect, mainly dominance effects were detected.
Prediction of hybrid biomass in Arabidopsis thaliana by selected parental SNP and metabolic markers
(2009)
A recombinant inbred line (RIL) population, derived from two Arabidopsis thaliana accessions, and the corresponding testcrosses with these two original accessions were used for the development and validation of machine learning models to predict the biomass of hybrids. Genetic and metabolic information of the RILs served as predictors. Feature selection reduced the number of variables (genetic and metabolic markers) in the models by more than 80% without impairing the predictive power. Thus, potential biomarkers have been revealed. Metabolites were shown to bear information on inherited macroscopic phenotypes. This proof of concept could be interesting for breeders. The example population exhibits substantial mid-parent biomass heterosis. The results of feature selection could therefore be used to shed light on the origin of heterosis. In this respect, mainly dominance effects were detected.
Population-based methods for the genetic mapping of adaptive traits and the analysis of natural selection require that the population structure and demographic history of a species are taken into account. We characterized geographic patterns of genetic variation in the model plant Arabidopsis thaliana by genotyping 115 genome-wide single nucleotide polymorphism (SNP) markers in 351 accessions from the whole species range using a matrix-assisted laser desorption/ionization time-of-flight assay, and by sequencing of nine unlinked short genomic regions in a subset of 64 accessions. The observed frequency distribution of SNPs is not consistent with a constant-size neutral model of sequence polymorphism due to an excess of rare polymorphisms. There is evidence for a significant population structure as indicated by differences in genetic diversity between geographic regions. Accessions from Central Asia have a low level of polymorphism and an increased level of genome-wide linkage disequilibrium (LD) relative to accessions from the Iberian Peninsula and Central Europe. Cluster analysis with the structure program grouped Eurasian accessions into K=6 clusters. Accessions from the Iberian Peninsula and from Central Asia constitute distinct populations, whereas Central and Eastern European accessions represent admixed populations in which genomes were reshuffled by historical recombination events. These patterns likely result from a rapid postglacial recolonization of Eurasia from glacial refugial populations. Our analyses suggest that mapping populations for association or LD mapping should be chosen from regional rather than a species-wide sample or identified genetically as sets of individuals with similar average genetic distances
The gene family of subtilisin-like serine proteases (subtilases) in Arabidopsis thaliana comprises 56 members, divided into six distinct subfamilies. Whereas the members of five subfamilies are similar to pyrolysins, two genes share stronger similarity to animal kexins. Mutant screens confirmed 144 T-DNA insertion lines with knockouts for 55 out of the 56 subtilases. Apart from SDD1, none of the confirmed homozygous mutants revealed any obvious visible phenotypic alteration during growth under standard conditions. Apart from this specific case, forward genetics gave us no hints about the function of the individual 54 non-characterized subtilase genes. Therefore, the main objective of our work was to overcome the shortcomings of the forward genetic approach and to infer alternative experimental approaches by using an integrative biolinformatics and biological approach. Computational analyses based on transcriptional co-expression and co-response pattern revealed at least two expression networks, suggesting that functional redundancy may exist among subtilases with limited similarity. Furthermore, two hubs were identified, which may be involved in signalling or may represent higher-order regulatory factors involved in responses to environmental cues. A particular enrichment of co- regulated genes with metabolic functions was observed for four subtilases possibly representing late responsive elements of environmental stress. The kexin homologs show stronger associations with genes of transcriptional regulation context. Based on the analyses presented here and in accordance with previously characterized subtilases, we propose three main functions of subtilases: involvement in (i) control of development, (ii) protein turnover, and (iii) action as downstream components of signalling cascades
Home range estimation is routine practice in ecological research. While advances in animal tracking technology have increased our capacity to collect data to support home range analysis, these same advances have also resulted in increasingly autocorrelated data. Consequently, the question of which home range estimator to use on modern, highly autocorrelated tracking data remains open. This question is particularly relevant given that most estimators assume independently sampled data. Here, we provide a comprehensive evaluation of the effects of autocorrelation on home range estimation. We base our study on an extensive data set of GPS locations from 369 individuals representing 27 species distributed across five continents. We first assemble a broad array of home range estimators, including Kernel Density Estimation (KDE) with four bandwidth optimizers (Gaussian reference function, autocorrelated‐Gaussian reference function [AKDE], Silverman's rule of thumb, and least squares cross‐validation), Minimum Convex Polygon, and Local Convex Hull methods. Notably, all of these estimators except AKDE assume independent and identically distributed (IID) data. We then employ half‐sample cross‐validation to objectively quantify estimator performance, and the recently introduced effective sample size for home range area estimation ( N̂ area
) to quantify the information content of each data set. We found that AKDE 95% area estimates were larger than conventional IID‐based estimates by a mean factor of 2. The median number of cross‐validated locations included in the hold‐out sets by AKDE 95% (or 50%) estimates was 95.3% (or 50.1%), confirming the larger AKDE ranges were appropriately selective at the specified quantile. Conversely, conventional estimates exhibited negative bias that increased with decreasing N̂ area. To contextualize our empirical results, we performed a detailed simulation study to tease apart how sampling frequency, sampling duration, and the focal animal's movement conspire to affect range estimates. Paralleling our empirical results, the simulation study demonstrated that AKDE was generally more accurate than conventional methods, particularly for small N̂ area. While 72% of the 369 empirical data sets had >1,000 total observations, only 4% had an N̂ area >1,000, where 30% had an N̂ area <30. In this frequently encountered scenario of small N̂ area, AKDE was the only estimator capable of producing an accurate home range estimate on autocorrelated data.