Abstract
Genetic selection efficiency is measured by accuracy. Model selection relies on hypothesis testing with effectiveness given by statistical significance (p-value). Estimates of selection accuracy are based on variance parameters and precision. Model selection considers the amount of genetic variability and significance of effects. Questions arise as to which one to use: accuracy or p-value? We show there is a link between the two and both may be used. We derive equations for accuracy in multi-environment trials and determine numbers of repetitions and environments to reach accuracy. We propose a new methodology for accuracy classification based on p-values. This enables a better understanding of the level of accuracy being accepted when certain p-value is used. Accuracy of 90% is associated with p-value of 2%. Use of p-values up to 20% (accuracies above 50%) are acceptable to verify significance of genetic effects. Sample sizes for desired p-values are found via accuracy values.
Keywords: Enhancing breeding efficacy; experimental statistics; mixed models; number of repetitions; number of trials
INTRODUCTION
Statistical significance, selection accuracy, and experimental precision are concepts used to assess experimental efficiency and the effectiveness of genetic selection in plant breeding (Resende and Alves 2020). The most important parameters and concepts in quantitative genetics and plant breeding are: genetic gain with selection ( ); accuracy ( ); heritability ( ); genetic variance ( ); and genetic value ( ). Genetic gain with selection ( ) measures the genetic and practical gains obtained with genetic improvement. Accuracy (correlation between predicted and true genetic values) and individual or plot level heritability (which is itself a component of accuracy) enables a predictive estimate of genetic gains with selection. The predicted genetic value and accuracy are essential in genetic evaluation catalogues on which selection decisions are based.
Genetic selection involves the prediction and ranking of genetic materials and is central to genetic improvement. Its efficiency is measured by selection accuracy. Tangential to genetic improvement is model selection, which is based on inference and hypothesis testing. Its effectiveness is inferred from statistical significance (p-value).
Accuracy is useful for making inferences about the quality of experiments, the reliability of predictions of genotypic values, and the statistical validity of predictive and inferential results. In practical terms, accuracy is also used to compare alternative selection methods, to calculate genetic gains with selection and to plan experiments. Thus, it is one of the building blocks of statistical and genetic analyses.
In single-environment trials, the accuracy values are obtained considering the heritability ( ) of the trait and the number of repetitions ( ) of each genotype. In multi-environment trials, the accuracy is estimated considering the heritability of the trait ( ), genotypic correlation across environments ( ), number of repetitions, and number of environments.
Conversely, a desired accuracy can be used to determine the size of experiments, inferred by choosing the number of repetitions and environments (total sample size). In this case, an optimized sample size can be obtained from the expected accuracy, heritability, and genotypic correlation across environments. In this study, we extend the work of Resende and Duarte (2007) and derived equations for the accuracy in multi-environment trials, modeling the effects of the genotype x environment (GxE) interaction using estimates of genetic parameters and Snedecor's F distribution. These equations were used to define the optimal sample sizes.
Until recently, identifying an adequate number of repetitions was mainly based on minimizing or reducing the residual variance in experimental statistics and quantitative genetics. However, this method is inefficient given the limited capacity of the coefficient of experimental variation ( ) to provide information about accuracy, as demonstrated by Resende and Duarte (2007). Another approach is to minimize the phenotypic variation of treatment means. This is also not entirely adequate, as a fraction of the phenotypic variance is genetic in nature. Another approach assumes the effects of genotypes as fixed and is based on maximizing the probability of detecting significant differences between treatments.
Recently, efforts have been made to determine an adequate number of repetitions ( ) and environments ( ) (Xu et al. 2016, Baxevanos et al. 2017a, Baxevanos et al. 2017b, George and Lundy 2019, Zhang et al. 2020, Woyann et al. 2020). Two important contributions were provided by Yan et al. (2015) and Yan (2021), who used a similar approach to the one herein but through different equations. Nevertheless, these previous studies did not contemplate a statistical way to express the equations in terms of the F test and the p-value. Storck et al. (2011), following Resende and Duarte (2007) for single-environment trials, extended the approach to determine plot size in agronomic crops. Yan’s two article (2015 and 2021) were based on reliability (which is the square of the accuracy and for balanced data is equivalent to the heritability at the means level) of the prediction, called H and fixed at 0.75 as a general suitable value. Besides, they did not express the equations results in terms of the individual heritability.
Considering statistical significance, selection accuracy, and experimental precision in plant breeding, this study aims: i) to obtain accuracy estimators for multi-environment trials; ii) to obtain estimators for the number of replications and environments to maximize the selection accuracy in multi-environment trials; iii) propose a new methodology for classifying accuracy based on statistical significance via p-value. Our study extends the work of Resende and Duarte (2007) to maximize accuracy and optimize the definition of the number of replications and environments. An original approach was applied to multi-environment trials, which included deriving accuracy estimators and expressing them in terms of the F-test of the joint analysis of variance of multi-environment trials. This was then related to statistical significance via p-value. Here, quantitative genetics intersects with experimental statistics, advancing work in both areas.
ACCURACY AND ITS RELATIONSHIP WITH OTHER MEASURES OF EXPERIMENTAL QUALITY
The quality of genotypic evaluation should be inferred based on accuracy ( ). In balanced experiments, Snedecor's F distribution can also be used, as (Resende and Duarte 2007). The mathematical expression that relates the appropriate values of F to the required accuracy is given as: . The F statistics is the proportion between the mean square of treatments and the residuals mean square from an analysis of variance. To achieve an accuracy of 90%, an F value equal to 5.26 must be obtained. This value is independent of the species and trait evaluated and can be considered a standard value for any species and a reference value in tests of value for cultivation and use (VCU).
This statistic simultaneously contemplates the coefficient of experimental variation ( ), the number of replications (n), and the coefficient of genotypic variation ( ), as can be seen through the expression . Although traditionally used to evaluate experimental quality, the coefficient of experimental variation alone is inadequate. All three parameters are necessary because accuracy depends on them simultaneously, as shown through an alternative expression:
For the selection process in breeding programs, the aim should be to achieve accuracy values above 70% (Resende and Duarte 2007). This is equivalent to F values greater than 2. Therefore, F values less than 2 provide low accuracy (Resende and Alves 2020). Another statistic commonly calculated in the context of genetic evaluation, proposed by Vencovsky (1987), is the relative coefficient of variation ( ). By fixing the number (n) of repetitions or individuals per treatment, the magnitude of the relative coefficient of variation ( ) can be used to infer the accuracy and precision of the genetic evaluation. With , a provides high accuracy.
PROOF OF THE RELATIONSHIP BETWEEN ACCURACY AND F TEST
From an analysis of variance, the components of the accuracy can be expressed in terms of variance components (as used by Fisher, Kempthorne, Henderson and Robertson) or intraclass correlation coefficients (determination coefficients or proportions between variance components; as used by Lush and Wright) (Table 1).
At the individual (common in perennial plants) or plot (common in annual plants) levels, F is given as or , where is the shrinkage factor in the mixed model equations. F will be greater than 1 only if is greater than zero. Since , the number of repetitions is given as: . The significance of F indicates that is non-zero.
Increasing the number of repetitions ( ), increases the value and power of the F test in detecting significance. It also increases the reliability or heritability at the treatment mean level, given as and the accuracy given as (Resende and Duarte 2007). The variance components enable us to estimate heritability or coefficients of determination at the individual plot and treatment mean levels, given as: and , respectively. The can also be estimated as and by , as a function of F and .
High reliability and accuracy can be achieved using an adequate number of repetitions or individuals ( ) per treatment. An is reached, for example, with , for . It can be inferred that and provides high accuracy ( ). From the desired reliability ( ), and according to the heritability of the trait ( ), is given as:
Yan et al. (2015) used the same approach to obtain the optimal number of repetitions ( ) but fixed the reliability ( ) at 0.75, which led to more restricted results. Furthermore, they did not express the equations results in terms of individual heritability.
NEW ACCURACY CLASSIFICATION BASED ON STATISTICAL SIGNIFICANCE
The quality of genetic evaluation in the context of plant breeding and experimentation is generally based on the statistical significance (p-value) of the genetic effects of the statistical model and on the accuracy of the genetic values. Initially, significance levels of 1% and 5% were considered as sufficient to statistically validate the comparison between genetic treatments (genotypes, varieties, cultivars, clones) (Fisher 1925). These cut-off points have also been used in the comparison and selection of statistical models with a hierarchical or nested structure, for example, likelihood ratio test (LRT) or deviance analysis (Resende 2007). Measures of significance associated with genetic variance ( ) and individual heritability ( ) are also used, for which values must be statistically different from zero for acceptance and validity of the experiment, considering the possibility of sufficient genetic variability for genotype selection.
Geneticists also rely on the magnitude of accuracy (correlation between predicted and parametric values) to infer about the effectiveness of selection and consequent genetic improvement. Reference values were suggested by Resende and Duarte (2007), with an accuracy of ≥90% necessary for recommending cultivars, and a desirable accuracy of ≥70% for improvement in the context of recurrent selection.
In technical works and in the practice of genetic improvement, it is a common doubt to know in which situations (in terms of magnitudes of genetic variance ( ), individual heritability ( ), significance of genetic effects and/or statistical differences of fit between prediction models), the selection and validation of models as well as the acceptability of the levels of genetic variability and heritability present in the breeding populations are reasonable to lead to adequate genetic gains. Which magnitudes are acceptable? For example, is a heritability value at the mean level of 70% favorable for selection? Is an individual heritability equal to 5% valid for selection? To respond these questions, information on the magnitude of the accuracy associated with these situations is necessary. Thus, the key questions are: what is the relationship between accuracy and significance (p-value) in an experiment, and which criterion should be given preference? Discussions related to such misgivings are absent from the scientific literature. This study aimed to address these issues.
Beginning from the fact that for each p-value there is a test statistic of the data distribution, some associations between the test statistic and p-value can be stipulated in experimental evaluation. The genetic values estimated from the statistical analysis can be tested against zero, using the Student’s t test with infinite degrees of freedom (Van Vleck et al. 1987). To perform this test, a significance level (or the complement called degree of confidence) must be chosen, which is usually 5% (95% confidence) and associated with a Student’s t test value equal to 1.96. Snedecor’s F distribution is also used in the analysis of experiments and is asymptotically (tends to infinite degrees of freedom for the residual) equivalent to the square of the Student’s t distribution, i.e., . Asymptotic equivalence also exists between the Chi-square ( distribution with one degree of freedom and Snedecor's F distribution, with one degree of freedom for the numerator and infinite degrees of freedom for the residual. Resende and Duarte (2007) showed the following relationships between F and the square of accuracy: and . Based on these relationships, knowing the value of F allows us to estimate the accuracy via . Also,
and . Thus, the p-value can be inferred from tables of Snedecor’s F, Student’s t, and Chi-square statistics, with large (tending to infinite) number of degrees of freedom for the residual, thus establishing a bridge between p-value and accuracy. A relationship also exists between F and the non-centrality parameter (NCP= ) via
, that is, (Resende and Alves 2020), discussed further below.
In Table 2, we present accuracy values in the first column and associated p-values in the second column. These two columns offer information about which p-value is being accepted when practicing a certain accuracy. For example, typical values of 0.70, 0.80 and 0.90 are associated with p-values of 16% (0.16), 10% (0.10) and 2% (0.02), respectively. An accuracy of 0.50 is associated with a p-value of 25% (0.25). Values less than 0.50 for accuracy are unacceptable as they lead to a selective coincidence of less than 50%. In this case, selection would result in more mistakes than successes. Thus, the highest acceptable p-value is less than 25%. Traditionally p-values greater than 5% are not allowed in selection. The results presented here suggest that p-values between 5% and 20% are also adequate in some situations. Accuracy values should be interpreted as: useless, leading to more wrong than correct selections (bellow 0.5); useless, leading to selection at random (equal 0.5); useful, leading to more correct than wrong selections (above 0.5).
In Table 3, p-values are presented in the first column and associated accuracy values in the second column. These two columns show the accepted selective accuracy when using a certain p-value. For example, typical p-values of 0.10, 0.05, and 0.01 are associated with accuracies of 79%, 86%, and 92%, respectively. These traditional 10%, 5% and 1% cut-off points for significance were recently revised, and a p-value of 0.5% (0.005) is currently widely accepted (Benjamin et al. 2018). With this p-value, the associated accuracy is 93%. Using this approach seems appropriate and can be recommended with confidence. Thus, in the final stages of breeding programs a pair p-value = 0.005 / = 93% is strongly indicated. The criteria presented can revive the use of the F distribution and its p-value in the current analytical context of genetic improvement. However, this does not mean that a traditional analysis of variance (ANOVA) is necessary, as an analysis using mixed models provides (Resende and Alves 2020), where PEV is the prediction error variance associated with the empirical best linear unbiased prediction (E-BLUP) of a genetic effect. It is an element extracted from the diagonal of the generalized inverse of the coefficient matrix of the mixed model equations (Fisher information matrix).
The columns of p-values and accuracy ( in the Table 3 were plotted in Figure 1. The curve equation describing as a function of the p-value is given by
, where -1.521 is the regression coefficient. The fitting was good, with R2 = 0.99 and mean absolute error of 0.01. From this function we create the estimated equivalences between p-value and accuracy (Table 4).
From a Bayesian perspective, a more direct measure of the strength of evidence for relative to is their odds ratio (ratio of their probabilities). By Bayes’ rule, this ratio can be written as: BF x (Pr (H1) / Pr (H0)) = BF x (a priori probabilities), where BF is the Bayes Factor (also related to Bayesian information criteria (BIC); according to Resende and Alves 2020) that represents the evidence of the data, and the prior probabilities can be informed by the researchers’ beliefs, scientific consensus, and validated evidence of similar issues in the same research field. Multiple hypothesis testing, P-hacking, and bias affect the credibility of the evidence, with some of these practices reducing the prior odds of over . Analyses of results from reproducibility studies suggest that, for experiments in some fields, the prior odds of with respect to may only be about 1:10. Therefore, for fields where the threshold for defining statistical significance for new discoveries is a p-value < 0.05, Benjamin et al. (2018) proposed a change to a p-value < 0.005, i.e., p-value = 0.05 x 0.10 = 0.005. This simple step would immediately improve the reproducibility of scientific studies in many fields.
PAIRWISE COMPARISONS AND THE MULTIPLICITY PROBLEM IN THE VALIDITY OF THE NEW CLASSIFICATION METHOD
The relations (shown in Tables 2 and 3) between accuracy (via F values associated to 1 degree of freedom for genotypes) and p values hold, as we will show bellow. Even with treatments number higher than 2 in the trials, the precision, reliability and selection accuracy rely on precision of pairwise comparisons as the basic quantities to be averaged aiming to obtain the accuracy or its squared value (also called reliability or broad sense total heritability at mean genotype level).
We seek for a relation between reliability (squared accuracy ) of the predictions and statistical significance (probability of type I error) of the difference of genotypic treatments. Such reliability can be viewed as a generalized coefficient of determination of treatments and also as a proportional reduction of errors. Piepho (2019) addressed the subject in statistics and gives the generalized coefficient of determination as From this, we can arrive at as given by the BLUP method. F-statistic for comparing the full and reduced models can be written as a function of and vice versa (Edwards et al. 2008).
In statistics, [1], where and are the error variances of the full and of the null model, respectively (Piepho 2019). From this and in the genetics context, [2], where PEV is the genetic prediction error variance and is the true genotypic variance (Searle et al. 1992). According to Resende and Duarte (2007), [3], where F is the calculated (from experimental data) Snedecor F statistics.
From [2] and [3] it can be perceived that = [4]. From [4], we have [5] (Resende and Alves 2020). According to Cullis et al. (2006), the reliability is [6], where is the mean (across all pair of genotypes combinations) variance of the difference of two treatments BLUP.
From [6] and [3], we have [7], which leads to = [8], and = [9]. From [8] or [9], we get = [10]. Rearranging [9], we arrive at = [11], which gives = [12]. From [12] and [5], we have PEV= [13].
For n replication number of each genotype and according to Resende (2007), [14], and we have PEV= [15], as we intended to prove. The equation in [15] holds for uncorrelated fixed genetic effects (which produces BLUE of the genotypes effects) (Resende 2002, 2007).
Defining the quantity as the mean (across all pair of genotypes combinations) variance of the difference of two treatments BLUE and after noticing the similarity between and , another reliability estimation method arises, according to Piepho and Mohring (2007) and also used by Dias et al. (2020), in which the reliability is given by [16], which is similar to [6].
The equivalence between the three approaches ( as given by the BLUP method, and ) was demonstrated above. Schmidt et al. (2019) also had an empirical evidence of this approximated equivalence. In this way, the overall squared accuracy comes from the average of all pair (two by two combinations) of comparisons based on the difference between each two genotype means. The two approaches ( and ) showed to be equivalent as obtaining the squared accuracy from the mixed model analysis using the PEV provided by the inversion of the Fisher information matrix.
The results coming from pairwise comparisons are exact for only single comparison between two treatments. For higher number of treatments there is the multiplicity problem of all pairwise comparisons. For circumventing this, a global protection of the significance level to test the null hypothesis concerning treatments effects should be considered. Then, we adopted the Bonferroni correction and protection (Bonferroni 1936). For T treatments, this approach changes the v distribution of the p-value to that given by distribution. Then the significance based on v* should be attained in order to keep the overall v-based significance. For example, for v equal to 10% and T = 10, the v* distribution should be 1% and the F on v* is 6.63 (and rgg = 0.92), which is necessary and sufficient to keep the original v distribution on 10%. So, according to the stochastic probability laws (Papoulis and Pilla 1965, Mood et al. 1974) and mathematical logic rules (Lightstone 1978), this leads to the combined accuracy of rggc = rggv rggv* = 0.79 x 0.92 = 0.73. From these considerations we have the accuracies varying with the number of treatments (and so with the number of degrees of freedom for treatments), as expected in F tests obtained in practical experimentation involving diverse number of treatments. Residual degrees of freedom numbers were taken as infinity in the paper overall, as they quickly approach (according to asymptotic theory of convergence in distribution and in probability) the typically infinity value with relatively small numbers (120 for example, as is showed by Steel and Torrie (1980)). In plant breeding, given the high numbers of treatments and of replications, this assumption of approaching infinity residual degrees of freedom is easily met.
From Table 5 and from rggc = rggv rggv* = 0.79 x 0.92 = 0.73 as above, it can be seen that the Bonferroni correction and protection is conservative, reducing the original (unprotected) accuracy from 0.79 to the combined accuracy of 0.73, in the example of p-value of 10% and T = 10. The Bonferroni correction is also more rigorous in providing significance than is the unprotected t test. Table 5 must be read in the triples (p-value; T number; rgg on v|T). The amount rgg on v|T stands for rgg on the conditional v|T, which is v given T. We also have rggc = rgg on v|T. Then, in the example, it can be learnt the triple (p-value = 10%; T number = 10; rgg on v|T = 0.73).
Other relevant triples are extracted from and highlighted in the Table 5: (10%; 200; 0.76); (5%; 100; 0.83); (2.5%; 50; 0.85); (1%; 20; 0.88); (0.5%; 10; 0.89). It can be noticed that, if the T values are sufficient high, the corrected rgg are not very different from that obtained in Table 3. In these bases it can be concluded that p-values equal or lower than 10% can display high and very high accuracies (Tables 3 and 5). It can also be learnt from Table 5 that the v p-values of 15% and of 20% (with T > 4) can display moderate accuracies and so can be considered in selection. On the other hand, p value of 25% showed to be provided low accuracies whatever the T are. So, under these circumstances p value of 25% is unsuitable for selection.
SAMPLE SIZES FOR DETECTING SIGNIFICANCE OF TREATMENT EFFECTS
Statistical reference books (Snedecor and Cochran 1967, Steel and Torrie 1980) provide the general expression to calculate the sample size (n) needed to detect significance of treatment effects, as: , where and are values of the cumulative distribution functions of Type I (α) and Type II (β) errors, under one-sided hypothesis tests; is the variance of the difference between the means of two treatments; and 𝛿 is the size of the actual difference between two means that is intended to be declared significant. The quantity (1-β) is the probability (power) that the experiment presents a significant difference between the means of the treatments. In practice, powers of 80% and 90% are common and suitable. The is a function of the residual variance (given as a function of ) and can be taken as the squared difference between an effect and the mass zero point (given as a function of ).
We then have . Considering (discussed in the previous topic), we thus have , which is the non-centrality parameter. The values of were determined by Snedecor and Cochran (1967) as presented in Table 6. Therefore, we also have .
Values of in one-sided test for significance levels 𝛼 determined by Snedecor and Cochran (1967)
From Table 6, an accuracy of 90% is associated with 𝛼 equal to 10% and 𝛽 equal to 80%, among other combinations of 𝛼 and 𝛽. A summary of these results is presented in Table 7.
Significance level and power of the F test associated with the required accuracy levels of 0.90, 0.93, and 0.95
We can see that, to perform an experiment with the desired power (1-β) of 0.90 of the F test and significance of 0.05, an accuracy of 0.95 is necessary. In this case, the probability of detecting a true difference between the genotypes is 0.90 when the significance level is set at 0.05. As expected, there is a proximity between the accuracy ( ) and the degree of confidence (1-α). Lee and Bjornstad (2013) demonstrated that hypothesis testing is equivalent to predicting discrete random effects, while Pawitan and Lee (2020) showed that confidence is likelihood and confidence density is, in fact, an extended likelihood.
Furthermore, a relationship between power and the coefficient of determination or square correlation ( ) seems to exist for these high accuracy values. The coefficient of determination is also called the proportional error reduction (Linder 1951, Ceapoiu 1968) and is a measure of the proportion of coincidence, hits, correctness, or effectiveness.
NUMBER OF REPLICATIONS PER GENETIC TREATMENT
The results of the numerical evaluations to determine the number of repetitions in experiments in a single-environment trial are presented in Table 8. Individual heritability ( ) ranging from 0.05 to 0.95 were considered to achieve accuracies ( ) ranging from 0.50 to 0.99. To determine the number of repetitions in experiments in a single-environment trial, the equation
was used (Resende et al. 2014, Resende 2015).
Number of repetitions (n) in a single-environment trial for traits with individual heritability ( ) ranging from 0.05 to 0.95 to reach accuracies ( ) ranging from 0.50 to 0.99
For traits with an individual heritability ( ) of 0.20, an n equal to 17, 37, and 197 repetitions of single tree plots are required to achieve accuracies ( ) equal to 0.90, 0.95, and 0.99, respectively (Table 8). As 90% is a very high accuracy (Resende and Duarte 2007, Resende and Alves 2020), 17 repetitions can be recommended. For traits with high heritability, for example 0.50, the recommended number of repetitions to achieve 90% accuracy is four (Table 8). Another way to apply these results is to use the estimated heritability of the breeding program itself in the environment in which it is conducted.
The results of the numerical evaluations to determine the number of repetitions in multi-environment trials are presented in Table 9. Individual heritability ( ) ranging from 0.20 to 0.50 and genetic correlation across environments ( ) ranging from 0.60 to 1.00 were considered to achieve accuracies ( ) of 0.70, 0.80, and 0.90. The number of repetitions in multi-environment trials is given as the expression ), which is a function of individual heritability ( ), genetic correlation across environments ( ), accuracy ( ), and number of environments (l).
To achieve accuracies ( ) equal to 0.90, 0.80, and 0.70 for traits with an individual heritability ( ) of 0.20, genetic correlation across sites ( ) of 0.80, in three environments (l), an n equal to 8.3, 2.6, and 1.3, respectively, is required per environment. Thus, for an accuracy of 0.90, across all environments, 8.3 * 3 = 24.9 repetitions of each genetic material in the entire experimental network is required (Table 9). For traits with high heritability, for example 0.50, the recommended number of repetitions is 1.7 * 3 = 5.1 to achieve 90% accuracy (Table 9).
Number of repetitions (n) in multi-environment trials (l environments), for traits with individual heritability ( ) ranging from 0.20 to 0.50, and genetic correlation across environments ( ) ranging from 0.60 to 1.00, to achieve accuracies ( ) of 0.70, 0.80, and 0.90
The values found for multiple environments (24.9 and 5.1) (Table 9) differ from the values found for single environments (17.0 and 4.0) (Table 8) as the genetic correlation across environments ( ) in Table 9 is taken as 0.80 while in Table 8 is implicitly equivalent to 1.00. Table 9 shows that with = 1, the values 17.1 and 4.3 are obtained, suggesting a coherence between the two alternative approaches to determine the number of repetitions (n).
NUMBER OF TRIALS AS A FUNCTION OF GENOTYPE X ENVIRONMENT CORRELATION
The results of the simulations to determine the number of sites (l) in multi-environment trials are presented in Table 10. We consider individual heritability ( ) ranging from 0.20 to 0.50 and genetic correlation across environments ( ) ranging from 0.60 to 1.00 to reach accuracies ( ) of 0.70, 0.80, and 0.90. The expression was used.
Number of sites (l), conditional on the number of repetitions taken as n=2, n=3, n=4 and n=5, for traits with individual heritability ( ) ranging from 0.20 to 0.50 and genetic correlation across environments ( ) ranging from 0.60 to 1.00, to achieve accuracies ( ) of 0.70, 0.80, and 0.90
For traits with of 0.20, a equal to 0.80, and n equal to five per trial, and to achieve accuracies ( ) equal to 0.90, 0.80, and 0.70, the number of trials required (l) is 4.3, 1.8, and 1.0. Thus, choosing an accuracy of 0.90 requires 4.3 * 5 = 21.3 repetitions of each genetic material in the entire experimental network (Table 10). For traits with high heritability, for example 0.50, the recommended number of repetitions is 1.7 * 5 = 8.5 to achieve 90% accuracy (Table 10).
The resulting values of 24.9 and 5.1 differ from the values 17.0 and 4.0 in Table 8 as here the genetic correlation across environments ( ) is taken as 0.80 and in Table 8 is implicitly equivalent to 1. Table 10 shows that with = 1, values of 17.1 and 4.3 are obtained, demonstrating the coherence between the two alternative approaches to determine the number of repetitions (n).
USE OF ACCURACY IN DETERMINING THE OPTIMAL PLOT SIZE
Appropriate approaches to determine the optimal plot size to evaluate p progenies, should be performed by setting the total area (p * n * k) of the experiment and conditioning the number of plants per plot (k) to the number of blocks (n) necessary to obtain an optimal accuracy, typically 0.90. This can be done following Storck et al. (2011) using the maximum curvature method conditioned to the desired accuracy value according to Resende and Duarte (2007). This accuracy depends on and , which provide a link between the maximum curvature (CV) and the accuracy based on the mean of the evaluated genotypes. Accuracy depends on the magnitude of the coefficient of experimental variation ( ), the number of repetitions (n), and the coefficient of genetic variation ( ), according to the alternative formula .
On the other hand, alternative methods and applications to estimate the optimal plot size are based on the nonlinear relationship , where is the coefficient of variation for plots planned of different sizes (x), expressed as a number of base units. The maximum curvature point (Xo) of the function is considered the optimal plot size (Meier and Lessman 1971). In this method, for values of X greater than Xo, the drop in is minimal and not efficient to reduce the experimental error. Considering that the accuracy is a function of and n (Resende and Duarte 2007), it is possible to rewrite the function , incorporating the values of , with predefined values of n and accuracy (Storck et al. 2011). Thus, by fixing the magnitude of the selective accuracy and the number of repetitions in the design of an experiment and knowing the environmental variability (A and B) of the chosen area, we can prepare a suitable experimental plan by combining the number of repetitions and the plot size. Thus, this approach estimates the optimal plot size, relating the variability of the experimental area to the predetermined accuracy (Storck et al. 2011). Another option is to consider the desired accuracy as a function of individual heritability ( ) and the coefficient of determination of plot effects ( ), which measures the degree of environmental variation of the plot, indicating the appropriate values of k and n. Accuracy is given as: (Resende et al. 2001).
DISCUSSION
Genetic selection is the result of prediction and ranking and is central to genetic improvement programs. To measure the efficiency of such improvement, we must consider selection accuracy. Meanwhile, model selection is related to inference and hypothesis testing and is tangential to genetic improvement. Its effectiveness can be measured by the p-value, among other techniques. To estimate accuracy, models are fit via the estimation/prediction of their effects, variance parameters, and their precision. Model selection is associated with inferences about the presence of sufficient genetic variability and significance of the effects of other factors in the model, using hypothesis tests, associated with p-values or significance levels. However, questions often arise as to which one to use: accuracy or p-value? The present study shows that there is a link between the two and that both can be used simultaneously. High accuracy and effective model selection enhances the efficacy of the whole breeding program (Resende and Alves 2020).
Accuracy is one of the most important parameters in quantitative genetics and plant breeding. It is used to assess the quality of experiments and infer the reliability of predicted genotypic values and the statistical validity of the predictive and inferred results. In practical terms, accuracy is also used to compare alternative selection methods, to compute genetic gains with selection, and to plan the size of experiments. Thus, it constitutes the building blocks of statistical and genetic analyses (Resende 2002).
In a single-environment trial, accuracy values are obtained considering the heritability ( ) and the number of repetitions (n) of each genotype. In multi-environment trials, accuracy is estimated considering the heritability ( ), genotypic correlation across environments ( ), number of repetitions (n), and number of experimental environments (l). Conversely, an expected accuracy can be used to plan experimental size and can be inferred by choosing the number of replications (n) and experimental environments (l) (total sample size of a genotype). Selections must be based on several traits. In such a case, the most economically important and with lowest heritability is the most suitable choice for determining the replications and locations numbers.
This position paper aimed to situate and reflect on statistical significance, selection accuracy, and experimental precision in connection with the efficiency of experimentation as applied to genetic selection in plants. We derive equations for accuracy in multi-environment trials, extending the work of Resende and Duarte (2007), and develop a model with GxE interaction effects using genetic parameters and Snedecor's F statistic. Also, we consider estimators for n and l in single- and multi-environment trials. Furthermore, we propose a new methodology to classify accuracy based on statistical significance via the p-value.
CONCLUSIONS
The results referring to the number of repetitions (n) and environments (l) were given according to the coefficients of heritability ( ) and genetic correlations across environments ( ). For traits with equal to 0.20, of 0.80, and l equal to three, and to achieve equal to 0.90, an n equal to 8.3 per environment is required. Thus, across all environments, n*l = 8.3 * 3 = 24.9 repetitions of each genetic material is required.
The p-value can be inferred from tables of Snedecor/Fisher´s F, Student’s t, and Bartlett/Pearson´s Chi-square test statistics, with large (tending to infinite) number of degrees of freedom for the residual. Therefore, a bridge between the p-value and accuracy can be established, expressing as a function of one of these three statistics. This link provides statisticians with information on the accuracy being accepted when practicing a certain p-value. For example, typical p-values of 0.10, 0.05, and 0.01 are associated with accuracies of 79%, 86%, and 92%, respectively. These traditional values for the 10%, 5%, and 1% cut-off points for significance were recently revised and a p-value that is now widely accepted is 0.5% (0.005) (Benjamin et al. 2018). With this p-value, the associated accuracy is 93%. This approach seems appropriate and can be recommended with confidence. Thus, for the final stages of breeding programs the pair p-value = 0.005 / = 93% is strongly suggested. Conversely, typical accuracy values of 50%, 70%, 80%, 90%, 93%, and 95% are associated with p-values of 25%, 16%, 10%, 2%, 0.5%, and 0.1 %, respectively. With the Bonferroni protection, p-values of up to 20% are acceptable to attest to the significance of genetic effects in models and to proceed with selection between models and between genotypes. The p-values below 20% provide above 50%, which are suitable to enable genetic gain.
ACKNOWLEDGMENTS
We acknowledge financial support from the Fundação de Amparo à Pesquisa do Estado de Minas Gerais (FAPEMIG).
REFERENCES
- Baxevanos D, Korpetis E, Irakli M, Tsialtas IT2017a Evaluation of a durum wheat selection scheme under Mediterranean conditions: adjusting trial locations and replications. Euphytica 213:1-14
- Baxevanos D, Tsialtas J, Vlachostergios D, Goulas C2017b Optimum replications and locations for cotton cultivar trials under Mediterranean conditions. Journal of Agricultural Science 155:1553-1564
- Benjamin DJ, Berger JO, Johannesson M, Nosek BA, Wagenmakers EJ, Berk R, Bollen KA, Brembs B, Brown L, Camerer C, Cesarini D, Chambers CD, Clyde M, Cook TD, De Boeck P, Dienes Z, Dreber A, Easwaran K, Efferson C, Fehr E, Fidler F, Field AP, Forster M, George EI, Gonzalez R, Goodman S, Green E, Green DP, Greenwald AG, Hadfield JD, Hedges LV, Held L, Ho TH, Hoijtink H, Hruschka DJ, Imai K, Imbes G, Ioannidis JPA, Jeon M, Jones JH, Kirchler M, Laibson D, List J, Little R, Lupia A, Machery E, Maxwell SE, McCarthy M, Moore DA, Morgan SL, Munafó M, Nakagawa S, Nyhan B, Parker TH, Pericchi L, Perugini M, Rouder J, Rousseau J, Savalei V, Schonbrodt FD, Sellke T, Sinclair B, Tingley D, Van Zandt T, Vazire S, Watts DJ, Winship C, Wolpert RL, Xie Y, Young C, Zinman J, Johnson VE2018 Redefine statistical significance. Nature Human Behaviour 2:6-10
- Bonferroni CE1936 Teoria statistica delle classi e calcolo delle probabilità. Pubblicazioni del R. Istituto Superiore di Scienze Economiche e Commerciali di Firenze 8:3-62
- Ceapoiu N1968 Metode statistice aplicate in experientele agricole si biologice. Agro-Silvica Bucuresti, 550p
- Cullis BR, Smith AB, Coombes NE2006 On the design of early generation variety trials with correlated data. Journal of Agricultural, Biological and Environmental Statistics 11:381-393
- Dias KOG, Piepho HP, Guimaraes LJM, Guimaraes PEO, Parentoni SN, Pinto MO, Noda RW, Magalhaes JV, Guimarães CT, Garcia AAF, Pastina MM2020 Novel strategies for genomic prediction of untested single-cross maize hybrids using unbalanced historical data. Theoretical and Applied Genetics 133:443-445
- Edwards LJ, Muller KE, Wolfinger RD, Qaqish BF, Schabenberger O2008 An R2 statistic for fixed effects in the linear mixed model. Statistics in Medicine 27:6137-6157
- Fisher RA1925 Statistical methods for research workers. Oliver and Boyd, Edinburgh and London, 239p
- George N, Lundy M2019 Quantifying genotype x environment effects in long-term common wheat yield trials from an agroecologically diverse production region. Crop Science 59:1960-1972
- Lee Y, Bjornstad JF2013 Extended likelihood approach to large-scale multiple testing. Journal of the Royal Statistical Society, Series B (Statistical Methodology) 75:553-575
- Lightstone AH1978 Mathematical logic: an introduction to model theory. Springer, New York, 338p
- Linder A1951 Statistische methoden für naturwissenschaftler, mediziner und ingenieure. Birkhäuser Verlag, Basel, 200p
- Meier VD, Lessman KJ1971 Estimation of optimum field plot shape and size for testing yield in Crambe abyssinica Hochst. Crop Science 11:648-650
- Mood AM, Graybill FA, Boes DC1974 Introduction to the theory of statistics. McGraw-Hill, Tokyo, 564p
- Papoulis A, Pilla SU1965 Probability, random variables, and stochastic processes. McGraw Hill, New York, 583p
- Pawitan Y, Lee Y2020 Confidence as likelihood. Statistical Science 36:509-517
- Piepho HP2019 A coefficient of determination (R2) for generalized linear mixed model. Biometrical Journal 62:860-872
- Piepho HP, Mohring J2007 Computing heritability and selection response from unbalanced plant breeding trials. Genetics 177:1881-1888
- Resende MDV2002 Genética biométrica e estatística no melhoramento de plantas perenes. Embrapa Informação Tecnológica, Brasília, 975p
- Resende MDV2007 Matemática e estatística na análise de experimentos e no melhoramento genético. Embrapa Florestas, Colombo, 362p
- Resende MDV2015 Genética quantitativa e de populações. Suprema, Visconde do Rio Branco, 463p
- Resende MDV, Alves RS2020 Linear, generalized, hierarchical, bayesian and random regression mixed models in genetics/genomics in plant breeding. Functional Plant Breeding Journal 2:1-31
- Resende MDV, Duarte JB2007 Precisão e controle de qualidade em experimentos de avaliação de cultivares. Pesquisa Agropecuária Tropical 37:182-194
- Resende MDV, Furlani-Junior E, Moraes MLT, Fazuoli LC2001 Estimativas de parâmetros genéticos e predição de valores genotípicos no melhoramento do cafeeiro pelo procedimento REML/BLUP. Bragantia 60:185-193
- Resende MDV, Silva FF, Azevedo CF2014 Estatística matemática, biométrica e computacional: modelos mistos, categóricos e generalizados (REML/BLUP), inferência Bayesiana, regressão aleatória, seleção genômica, QTL-GWAS, estatística espacial e temporal, competição, sobrevivência. Suprema, Visconde do Rio Branco , 881p
- Schmidt P, Hartung J, Rath J, Piepho HP2019 Estimating broad-sense heritability with unbalanced data from agricultural cultivar trials. Crop Science 59:525-536
- Searle SR, Casella G, McCulloch CE1992 Variance components. Wiley, New York, 536p
- Snedecor GW, Cochran WR1967 Statistical methods. Iowa State University Press, Ames, 274p
- Steel RGD, Torrie JH1980 Principles and procedures of statistics: a biometrical approach. McGraw-Hill, New York, 666p
- Storck L, Lopes SJ, Lúcio ADC, Cargnelutti Filho A2011 Optimum plot size and number of replications related to selective precision. Ciência Rural 41:390-396
- Van Vleck LD, Pollak EJ, Pollak EJ, Branford EA1987 Genetics for animal sciences. W.H. Freeman, San Francisco, 391p
- Vencovsky R1987 Herança quantitativa. In Paterniani E and Viégas GP (Org) Melhoramento e produção do milho no Brasil. Fundação Cargil, Campinas, p. 122-199
- Woyann LG, Zdziarski AD, Zanella R, Rosa AC, Conte J, Meira D, Storck L, Benin G2020 Optimal number of replications and test locations for soybean yield trials in Brazil. Euphytica 216:1-9
- Xu N, Jin SQ, Li J2016 Designing the national cotton variety trials regarding the number of replicates and number of test locations in China. Acta Agronomica Sinica 42:43-50
- Yan W2021 Estimation of the optimal number of replicates in crop variety trials. Frontiers in Plant Science 11:2231
- Yan W, Frégeau-Reid J, Martin R, Pageau D, Mitchell-Fetch J2015 How many test locations and replications are needed in crop variety trials for a target region? Euphytica 202:361-372
- Zhang Y, Xu NY, Guo LL, Yang ZG, Zhang XQ, Yang XN2020 Optimization of test locations number and replication number in regional winter wheat variety trials in northern China. Acta Agronomica Sinica 46:1166-1173
Publication Dates
-
Publication in this collection
28 Oct 2022 -
Date of issue
2022
History
-
Received
06 June 2022 -
Accepted
31 Aug 2022 -
Published
22 Sept 2022