Deciphering drought‐induced metabolic responses and regulation in developing maize kernels

Summary Drought stress conditions decrease maize growth and yield, and aggravate preharvest aflatoxin contamination. While several studies have been performed on mature kernels responding to drought stress, the metabolic profiles of developing kernels are not as well characterized, particularly in germplasm with contrasting resistance to both drought and mycotoxin contamination. Here, following screening for drought tolerance, a drought‐sensitive line, B73, and a drought‐tolerant line, Lo964, were selected and stressed beginning at 14 days after pollination. Developing kernels were sampled 7 and 14 days after drought induction (DAI) from both stressed and irrigated plants. Comparative biochemical and metabolomic analyses profiled 409 differentially accumulated metabolites. Multivariate statistics and pathway analyses showed that drought stress induced an accumulation of simple sugars and polyunsaturated fatty acids and a decrease in amines, polyamines and dipeptides in B73. Conversely, sphingolipid, sterol, phenylpropanoid and dipeptide metabolites accumulated in Lo964 under drought stress. Drought stress also resulted in the greater accumulation of reactive oxygen species (ROS) and aflatoxin in kernels of B73 in comparison with Lo964 implying a correlation in their production. Overall, field drought treatments disordered a cascade of normal metabolic programming during development of maize kernels and subsequently caused oxidative stress. The glutathione and urea cycles along with the metabolism of carbohydrates and lipids for osmoprotection, membrane maintenance and antioxidant protection were central among the drought stress responses observed in developing kernels. These results also provide novel targets to enhance host drought tolerance and disease resistance through the use of biotechnologies such as transgenics and genome editing.

Assure that all aspects of the Metabolon process are operating within specifications.

CMTRX
Pool created by taking a small aliquot from every customer sample.
Assess the effect of a non-plasma matrix on the Metabolon process and distinguish biological variability from process variability.
PRCS Aliquot of ultra-pure water Process Blank used to assess the contribution to compound signals from the process.

SOLV
Aliquot of solvents used in extraction.
Solvent Blank used to segregate contamination sources in the extraction.

RS Recovery Standard
Assess variability and verify performance of extraction and instrumentation.

IS Internal Standard
Assess variability and performance of instrument.

Figure 1. Preparation of client-specific technical replicates.
A small aliquot of each client sample (colored cylinders) is pooled to create a CMTRX technical replicate sample (multi-colored cylinder), which is then injected periodically throughout the platform run. Variability among consistently detected biochemicals can be used to calculate an estimate of overall process and platform variability. Study samples randomized and balanced CMTRX: Technical replicates created from an aliquot of all client study samples contains the retention time/index (RI), mass to charge ratio (m/z), and chromatographic data (including MS/MS spectral data) on all molecules present in the library. Furthermore, biochemical identifications are based on three criteria: retention index within a narrow RI window of the proposed identification, accurate mass match to the library +/-10 ppm, and the MS/MS forward and reverse scores between the experimental data and authentic standards. The MS/MS scores are based on a comparison of the ions present in the experimental spectrum to the ions present in the library spectrum. While there may be similarities between these molecules based on one of these factors, the use of all three data points can be utilized to distinguish and differentiate biochemicals. More than 3300 commercially available purified standard compounds have been acquired and registered into LIMS for analysis on all platforms for determination of their analytical characteristics. Additional mass spectral entries have been created for structurally unnamed biochemicals, which have been identified by virtue of their recurrent nature (both chromatographic and mass spectral). These compounds have the potential to be identified by future acquisition of a matching purified standard or by classical structural analysis.
Curation: A variety of curation procedures were carried out to ensure that a high quality data set was made available for statistical analysis and data interpretation. The QC and curation processes were designed to ensure accurate and consistent identification of true chemical entities, and to remove those representing system artifacts, mis-assignments, and background noise. Metabolon data analysts use proprietary visualization and interpretation software to confirm the consistency of peak identification among the various samples. Library matches for each compound were checked for each sample and corrected if necessary.

Metabolite Quantification and Data Normalization:
Peaks were quantified using area-underthe-curve. For studies spanning multiple days, a data normalization step was performed to correct variation resulting from instrument inter-day tuning differences. Essentially, each compound was corrected in run-day blocks by registering the medians to equal one (1.00) and normalizing each data point proportionately (termed the "block correction"; Figure 2). For studies that did not require more than one day of analysis, no normalization is necessary, other than for purposes of data visualization. In certain instances, biochemical data may have been normalized to an additional factor (e.g., cell counts, total protein as determined by Bradford assay, osmolality, etc.) to account for differences in metabolite levels due to differences in the amount of material present in each sample.

Statistical Methods and Terminology
Statistical Calculations: For many studies, two types of statistical analysis are usually performed: (1) significance tests and (2) classification analysis. Standard statistical analyses are performed in ArrayStudio on log transformed data. For those analyses not standard in ArrayStudio, the programs R (http://cran.r-project.org/) or JMP are used. Below are examples of frequently employed significance tests and classification methods followed by a discussion of p-and q-value significance thresholds.

Welch's two-sample t-test
Welch's two-sample t-test is used to test whether two unknown means are different from two independent populations.
This version of the two-sample t-test allows for unequal variances (variance is the square of the standard deviation) and has an approximate t-distribution with degrees of freedom estimated using Satterthwaite's approximation. The test statistic is given by t= ( ̅ 1 − ̅ 2 )/√ 1 2 / 1 + 2 2 / 2 , and the degrees of freedom is given by , where ̅ 1 , ̅ 2 are the sample means, s1, s2, are the sample standard deviations, and n1, n2 are the samples sizes from groups 1 and 2, respectively. We typically use a twosided test (tests whether the means are different) as opposed to a one-sided test (tests whether one mean is greater than the other).

Matched pairs t-test
The matched pairs t-test is used to test whether two unknown means are different from paired observations taken on the same subjects.
The matched pairs t-test is equivalent to the one-sample t-test performed on the differences of the observations taken on each subject (i.e., calculate (x1 -x2) for each subject; test whether the mean difference is zero or not). The test statistic is given by = ( ̅ 1 − ̅ 2 )/ , with n -1 degrees of freedom, where ̅ 1 , ̅ 2 are the sample means for groups 1 and 2, respectively, sd is the standard deviation of the differences, n is the number of subjects (so there are 2n observations).

One-way ANOVA
ANOVA stands for analysis of variance. For ANOVA, it is assumed that all populations have the same variances. One-way ANOVA is used to test whether at least two unknown means are all equal or whether at least one pair of means is different. For the case of two means, ANOVA gives the same result as a two-sided t-test with a pooled estimate of the variance.
An ANOVA uses an F-test which has two parameters -the numerator degrees of freedom and the denominator degrees of freedom. The degrees of freedom in the numerator are equal to g -1, where g is the number of groups. If n is the total number of observations (n1 + n2), then, the denominator degrees of freedom is equal to n -g. The F-statistic is the ratio of the between-groups variance to the within-groups variance, hence the higher the F-statistic the more evidence we have that the means are different.
Often within ANOVA, one performs linear contrasts for specific comparisons of interest. For example, suppose we have three groups A, B, C, then examples of some contrasts are A vs. B, the average of A and B vs. C, etc. For single-degree of freedom contrasts, these give the same result as a two-sided t-test with the pooled estimate of the variance from the ANOVA and degrees of freedom n -g. Below, we show the three formulas for A vs. B from a three group design as shown above. The numerator is same in each case, but the denominator differs by the estimates of the variances, and the degrees of freedom are different for each (if the theoretical assumptions hold, then the contrast has the most power, as it has the largest degrees of freedom).

Two-way ANOVA
ANOVA stands for analysis of variance. For ANOVA, it is assumed that all populations have the same variances. For a two-way ANOVA, three statistical tests are typically performed: the main effect of each factor and the interaction. Suppose we have two factors A and B, where A represent the genotype and B represent the diet in a mouse study. Suppose each of these factors has two levels (A: wild type, knock out; B: standard diet, high fat diet). For this example, there are 4 combinations ("treatments"): A1B1, A1B2, A2B1, A2B2. The overall ANOVA F-test gives the p-value for testing whether all four of these means are equal or whether at least one pair is different. However, we are also interested in the effect of the genotype and diet. A main effect is a contrast that tests one factor across the levels of the other factor. Hence the A main effect compares (A1B1 + A1B2)/2 vs. (A2B1 + A2B2)/2, and the B-main effect compares (A1B1 + A2B2)/2 vs. (A1B2 + A2B2)/2. The interaction is a contrast that tests whether the mean difference for one factor depends on the level of the other factor, which is (A1B2 + A2B1)/2 vs. (A1B1 + A2B2)/2.
Some sample plots follow. For the first plot, there is a B main effect, but no A main effect and no interaction, as the effect of B does not depend on the level of A. For the second plot, notice how the mean difference for B is the same at each level of A and the difference in A is the same for each level of B, hence there is no statistical interaction. The final plot also has main effects for A and B, but here also has an interaction: we see the effect of B depends on the level of A (0 for A1 but 2 for A2), i.e., the effect of the diet depends on the genotype. We also see here the interpretation of the main effects depends on whether there is an interaction or not.

Two-way Repeated Measures ANOVA
This is typically an ANOVA where one factor is applied to each subject and the second factor is a time point. See two-way ANOVA as many of the details are similar except that the model takes into account the repeated measures, i.e., the treatments are given to the same subject over time. The two main effects and the interaction are assessed, with particular interest to the interaction, as this shows where the time profiles are parallel or not for the treatments (parallel mean no interaction).
One additional note, the standard analysis assumes a condition referred to as compound symmetry, which assumes the correlation between each pair of levels of the repeatedmeasures factor is the same. Thus, for the case of time, it assumes the correlation is the same between time points 1 and 2, 1 and 3, and 2 and 3.

Correlation
Correlation measures the strength and direction of a linear association between two variables. The statistical test for correlation tests whether the true correlation is zero or not.
The square of the correlation is the percentage of the total variation explained by a linear relationship between the two variables. Thus, with large sample sizes there may be a sample correlation of 0.1 that is statistically significant. This means we have high confidence that the true correlation is zero, however, only 100*(0.1*0.1)% = 1% of the variation of one variable is explained by a linear relationship with the other variable, so while there is an association, it has little predictive ability.

Hotelling's T 2 test
The Hotelling's T 2 test is a multivariate generalization of the t-test, but here we are testing whether the mean vectors are different or not (the vector consists of multiple metabolites).
The Hotelling statistic is: 2 = ( + ) * (̅ − ̅) −1 (̅ -̅), where nx and ny are the numbers of samples in each group, ̅ is the mean vector of the variables from group 1, ̅ is the mean vector of variables from group 2 and S is the pooled estimate of the variancecovariance matrix of the variables. This analysis assumes the underlying variancecovariance matrix is the same for each group. Notice that in the case of uncorrelated variables, this is simply a weighted average of the squared mean differences with weights inversely proportional to the sample variances (i.e., the metabolites less variable within a group are given higher weights).

p-values
For statistical significance testing, p-values are given. The lower the p-value, the more evidence we have that the null hypothesis (typically that two population means are equal) is not true. If "statistical significance" is declared for p-values less than 0.05, then 5% of the time we incorrectly conclude the means are different, when actually they are the same.
The p-value is the probability that the test statistic is at least as extreme as observed in this experiment given that the null hypothesis is true. Hence, the more extreme the statistic, the lower the p-value and the more evidence the data gives against the null hypothesis.

q-values
The level of 0.05 is the false positive rate when there is one test. However, for a large number of tests we need to account for false positives. There are different methods to correct for multiple testing. The oldest methods are family-wise error rate adjustments (Bonferroni, Tukey, etc.), but these tend to be extremely conservative for a very large number of tests. With gene arrays, using the False Discovery Rate (FDR) is more common. The family-wise error rate adjustments give one a high degree of confidence that there are zero false discoveries. However, with FDR methods, one can allow for a small number of false discoveries. The FDR for a given set of compounds can be estimated using the qvalue (see Storey J and Tibshirani R. In order to interpret the q-value, the data must first be sorted by the p-value then choose the cutoff for significance (typically p<0.05). The q-value gives the false discovery rate for the selected list (i.e., an estimate of the proportion of false discoveries for the list of compounds whose p-value is below the cutoff for significance). For Table 1 below, if the whole list is declared significant, then the false discovery rate is approximately 10%. If everything from Compound 079 and above is declared significant, then the false discovery rate is approximately 2.5%. build the tree ("bootstrap sample" or "training set"), and then the remaining data, the "out-of-bag" (OOB) variables, are passed down the tree to obtain a class prediction for each sample. This process is repeated thousands of times to produce the forest. The final classification of each sample is determined by computing the class prediction frequency ("votes") for the OOB variables over the whole forest. For example, suppose the random forest consists of 50,000 trees and that 25,000 trees had a prediction for sample 1. Of these 25,000, suppose 15,000 trees classified the sample as belonging to Group A and the remaining 10,000 classified it as belonging to Group B. Then the votes are 0.6 for Group A and 0.4 for Group B, and hence the final classification is Group A. This method is unbiased since the prediction for each sample is based on trees built from a subset of samples that do not include that sample. When the full forest is grown, the class predictions are compared to the true classes, generating the "OOB error rate" as a measure of prediction accuracy. Thus, the prediction accuracy is an unbiased estimate of how well one can predict sample class in a new data set. Random forest has several advantages -it makes no parametric assumptions, variable selection is not needed, it does not overfit, it is invariant to transformation, and it is fairly easy to implement with R.
To determine which variables (biochemicals) make the largest contribution to the classification, a "variable importance" measure is computed. We use the "Mean Decrease Accuracy" (MDA) as this metric. The MDA is determined by randomly permuting a variable, running the observed values through the trees, and then reassessing the prediction accuracy. If a variable is not important, then this procedure will have little change in the accuracy of the class prediction (permuting random noise will give random noise). By contrast, if a variable is important to the classification, the prediction accuracy will drop after such a permutation, which we record as the MDA. Thus, the random forest analysis provides an "importance" rank ordering of biochemicals; we typically output the top 30 biochemicals in the list as potentially worthy of further investigation.

Hierarchical Clustering
Hierarchical clustering is an unsupervised method for clustering the data, and can show large-scale differences. There are several types of hierarchical clustering and many distance metrics that can be used. A common method is complete clustering using the Euclidean distance, where each sample is a vector with all of the metabolite values. The differences seen in the cluster may be unrelated to the treatment groups or study design.

Z-scores
An intensity measurement for a metabolite by itself does not tell much. If for example a patient contains a blood glucose level of 300, this could be very good news if most people have blood glucose levels around 300, but less so if most people have levels around 100. In other words a measurement is meaningful only relative to the means of the sample or the population. This can be achieved by transforming the measurements into Z-scores which are expressed as standard deviations from the mean.
The Z-score, also called the standard score or normal score, is a dimensionless quantity derived by subtracting the control population mean from an individual raw score and then dividing the difference by the control population standard deviation. The Z-score indicates how many standard deviations an observation is above or below the mean of the control group. The Z-score is negative when the raw score is below the mean, positive when above. Since knowing the true mean and standard deviation of a control population is often unrealistic, the mean and standard deviation of the control population may be estimated using a random control sample.
Z-score = where: x is a raw score to be standardized, μ is the mean of the control population, σ is the standard deviation of the control population.
Subtracting the mean centers the distribution, and dividing by the standard deviation standardizes the distribution. The interesting properties of Z-scores are that they have a zero mean (effect of "centering") and a variance and standard deviation of 1 (effect of "standardizing"). This is because all distributions expressed in Z-scores have the same mean (0) and the same variance (1), so we can use Z-scores to compare observations coming from different distributions. When a distribution is normal most of the Z-scores (more than 99%) lay between the values of -3 and +3.