ABSTRACT:
In the last decades, a new trend to use more refined analytical procedures, such as artificial neural networks (ANN), has emerged to be most accurate, efficient, and extensively applied for mining and data prediction in different contexts, including plant breeding. Thus, this study was developed to establish a new classification proposal for targeting genotypes in breeding programs to approach classical models, such as a complete diallel and modern prediction techniques. The study was based on the standard deviation values of an interpopulation diallel and it also verified the possibility of training a neural network with the standardized genetic parameters for a discrete scale. We used 12 intercrossed maize populations in a complete diallel scheme (66 hybrids), evaluated during the 2005/2006 crop season in three different environments in southern Brazil. The implemented MLP architecture and other associated parameters allowed the development of a generalist model of genotype classification. The MLP neural network model was efficient in predicting parental and interpopulation hybrid classifications from average genetic components from a complete diallel, regardless of the evaluation environment.
Keywords: genetic selection; neural networks; genetic parameters; maize breeding; combining ability
Introduction
In plant breeding projects, techniques to reduce the number of crosses or predict the performance of missing hybrids efficiently tend to optimize breeding programs to focus on promising crosses. According to Peixoto et al. (2015), a new trend with the use of enhanced analytical procedures, such as artificial neural networks-ANN, has emerged and these techniques are considered more accurate, efficient, and extensively applied for mining and data prediction, according to Rad (2018).
The multilayer perceptron neural networks (MLP) are considered a derivation of ANNs, since they involve more than one layer hidden in the modeling process. According to Cunningham (1995), both intersect in Computer Science and Statistics, called Machine Learning (ML). One of the advantages of using ML techniques is the ability to capture characteristics of interest, even when distributions of probability are unknown (Duda et al., 2012).
In an agronomic context, the use of single or multilayer neural networks has gained relevance in recent years and has shown to be efficient in analyzing complex systems to predict yield (Leal et al., 2015; Soares et al., 2015), determine physiological activities of plants (Feng et al., 2017; Abrishami et al., 2019), and identify diseases through images (Zhang et al., 2018), among other applications.
In plant breeding, studies have focused on plant identification (Pandolfi et al., 2009), classifying genotypes for stability and adaptability (Nascimento et al., 2013), evaluating genetic diversity (Sant’Anna et al., 2015), estimating genetic values (Peixoto et al., 2015; Silva et al., 2014), and in genomic selection (González-Camacho et al., 2018; Montesinos-Lopez et al., 2018). However, the literature does not report studies that associate classification of genotypes through average genetic components from diallel crossings using MLP techniques, which can be a differential in the automation of genotype classification.
In order to approach classical models, such as a complete diallel and modern prediction techniques, this study was developed to establish a new classification proposal to target genotypes in breeding programs, based on standard deviation values of an interpopulation diallel, and to verify the possibility of training a neural network with the standardized genetic parameters for a discrete scale.
Materials and Methods
Genetic material and data used in the analyses
Twelve maize populations (PC9703, PC9702, PC9502, PC9901; PC9902, PC9903, PC0201, PC0202, PC0203, PMI 8701, PMI 0301; GI045) developed by the Agronomic Institute of Parana (IAPAR) were intercrossed in the 2004/2005 crop season, following a complete diallel model, for obtaining 66 interpopulation hybrids without reciprocals. The interpopulation hybrids and their parents were evaluated during the 2005/2006 crop season in three different environments in southern Brazil (Table 1).
Evaluation environments of 66 interpopulation hybrids and 12 parental populations in southern Brazil. 2005/2006 crop season.
The experimental design comprised randomized blocks with two replications per site. The experimental plot consisted of a 5-m long line spaced at 0.80 m between rows and five plants per linear meter after thinning. We evaluated grain yield (GY), adjusted for kg ha–1 and corrected for moisture to a standard of 13.5 % by weighing grains of all corn ears harvested in the experimental plots.
Statistical analysis
We performed the individual analysis of variance, Hartley F maximum test, and the joint analysis of the three sites, considering p ≤ 0.05, in the ExpDes package in R software (R core team, version 3.6.1). The analysis of the complete diallel model was performed using the computer program Genes (Cruz, 2013) through the model proposed by Gardner and Eberhart (1966) for parents, and F1 for the evaluation in different environments, adapted by Morais et al. (1991).
The estimate of specific heterosis is given by Gardner and Eberhart (1966) model 4 by the following formula: . where: is the estimate of specific heterosis; hij heterosis of hybrid ij; , mean heterosis of all populations; and , estimates of the variety heterosis effects from populations i and j. The estimate of general combining ability (GCA) for each population is obtained by dividing the variety effect (υi) by two and adding this result to the hi effect.
Multilayer perceptron network prediction models
For predicting classes via MLP, considering supervised machine-learning models, the standard deviations of GCA effects of populations (gi and gi) and of hybrids ij were initially calculated. Subsequently, standards were established for each value found, as follows: standard 1: for GCA or values positive, greater than or equal to the standard deviation; standard 2: for positive values below standard deviation though greater than or equal to zero; standard 3: below-zero values. Based on this information, four classes (A, B, C and D) were established according to the purpose of selection to be used (Table 2).
Proposal classification of populations and interpopulation hybrids of maize based on the General Combining Ability-GCA (gi and gj) standards of the parents (1 to 3) and specific heterosis (sij).
The 198 data points concerning the evaluation of 66 hybrids in three environments were divided into training and testing sets. Different sample sizes were used in the training set, representing 20-80 % of the total set, and the remaining percentage was applied to the test set. The MLP network used in the current study (Figure 1) was composed in one input layer with four neurons, where each neuron is related to the coding of environments (1 to 3) and effects of GCA (gi and gi) and ; three hidden layers composed of 5, 10, and 5 neurons each, and one output layer containing four classifications (A, B, C and D), as described in Table 2.
Multilayer perceptron network architecture consisting of one input layer (I.L.), three hidden layers (H.L.), and one output layer (O.L.).
We considered 5000 iterations (epochs) performed in each of the 100 simulations using the Rectified Linear Unit (ReLU) activation functions on the hidden layers and the Softmax function on the output layer. To avoid overfitting, we used the early stopping method as a form of regularization. To perform the analyses, the H2O package (Ledell et al., 2020) was used in R software (R core team, 2019).
Results and Discussion
Diallel analysis
Based on the Hartley F maximum test, no exclusion of any environment from the analysis was required. The joint diallel analysis (Table 3) allowed to verify for the GY variable, low CVs%, indicating an experimental high precision. In relation to the parents, there was no homogeneous group, since the source of variation “varieties” was significant. According to Hallauer et al. (2010), the effect of varieties is related to additive components, in contrast to heterosis, which is related to dominance components.
Mean squares (MS) obtained through the joint diallel analysis, according to the Gardner and Eberhart (1966) model and adapted by Morais et al. (1991), for grain yield in kg ha–1 (GY). The 2005/2006 crop season.
In the heterosis source decomposition, significance was verified for the source average heterosis (), demonstrating that, in terms of overall averages, interpopulation hybrids were higher than parental. The significance of variety heterosis (hi)allowed the analysis of heterotic response on population, which was used as parent, that is, the significance of this source of variation (SV) indicates that the populations differ in their respective average gene frequencies (Vencovsky, 1970). Therefore, the non-significance of specific Heterosis (sij) indicates that the composition of the final average of hybrids (in all three environments) is not affected, partly because of the effects of allelic complementarity. However, environmental effects should be analyzed together with their interactions on the main effects (h, hi, and sij).
The significance of Environments (E) shows that the evaluation environments are distinct in terms of overall averages, in part due to the edaphoclimatic differences. When decomposing the environmental effects, a significant interaction was observed for υi × E, indicating that parental populations presented a differential productive response in different environments. For the hij × E effect and its decompositions, a non-significance was observed for hi × E; however, as the GCA effect is directly associated to both hi and υi, and υi × E presented a sum of squares value 2.41 times higher than hi × E, the discussion focused on GCA was directed within each evaluation environment.
Populations PC 0202, PC 0203, and PMI 0301 showed positive GCA values in the three evaluated environments (Table 4). In these three populations, only PC 0202 and PC 0203 had positive effects of varieties (υi) in at least two environments, indicating that such populations have potential for use in grain yield, based on intrapopulation improvement.
Estimates of variety (υi), effect, variety heterosis (hi), and general combining ability (gi) for grain yield (kg ha–1) at three sites in Paraná State. The 2005/2006 crop season.
Estimates for the PMI 0301 population, despite showing positive GCA values in all three environments, were mostly based on the effect of hi, as the effects υi expressed were negative in the three environments, as well as to the PMI 8701 populations and GI 045. This was expected, given that the genetic bases of the three populations were formed by commercial materials or old populations with different agronomic patterns from the others evaluated.
In the three environments, positive and high ui values relative to the population PC 9902 were verified; however, GCA values showed an increase of 167.9 kg ha–1 in LD and 173.5 kg ha–1 in PG and a negative value in GUA (–50.5 kg ha–1) for the set of crosses involving this parent, which may be associated to the significantly negative hi values, restricting the use of this material to intra-population breeding.
The results of sij were used to detect the response of the best crosses between populations (Table 5). The values obtained ranged from –3856 kg ha–1 (PC9502 × PC0201) to 1872 kg ha–1 (PC9502 × PC0203) considering the three environments and these extreme values were detected in the same environment (GUA). Similar studies for hybrid selection per se have been based on SCA and GCA for different variables in order to better understand the demand in the corn grain trade (Amiruzzaman et al., 2013; Gralak et al., 2015; Nardino et al., 2016).
General Combining Ability (gi and gj) for each parent population i and j, estimates for positive specific heterosis in the three environments for eight interpopulation hybrids (sij) and classification based on the average genetic components.
From the selection based on sij within each environment, a set of eight interpopulation hybrids kept positive values of this average component in all three environments, even with different magnitudes. Due to the effects of hij × E interaction, and its decomposition in the source of sij × E, the same interpopulation hybrid was classified differently between the environments.
The PC 9902 × PC 0203 and PC 9902 × PMI 0301 hybrids were classified as “A” in PG, that is, both populations presented a considerable performance per se, as the hybrids indicate a parental crossing with an allelic complementation of interest.
However, in the LD environment, the hybrids were classified as “B”, since the three average components, remained within the range between 0 and the standard deviation of the three components, although positive. In the GUA environment, the same hybrids were classified as “C”, mainly due to the negative effects of gj presented for PC 9902.
In addition to the interaction and dominance effects (directly related to sij), the epistasis effect, also expressed in the differences in allelic frequencies from parents to loci (Hallauer et al., 2010), should be considered for the control of a particular feature. Another point to consider for specific heterosis is presented by Sprague and Tatum (1942) regarding the specific combining ability of line crosses that represent the deviation of the hybrid from what is expected in the general combining ability of parents.
In this case, as environmental variations are expected to contribute to a differential expression of additive and non-additive effects, the breeding method used may be distinct for the same parent set and resulting hybrids, where the focus may be on intrapopulation improvement for a given environment, while in another environment, the reciprocal recurrent selection may be more advantageous in the long term.
MLP network classification
In order to build an automated classifier from an MLP network, the lowest percentages also showed the highest variation width (Figure 2). Using only 20 % of the data as a model training set, the width of coincidence for the prediction of test data ranged from 64.6-98.1 %, with an average of 87.6 %. The use of training sets above 70 % of the total data reduced the amplitude 95-100 %, with an absolute average of 99.5 %, reaching an average of 99.7 % coincidence when using a model with 80 % of the training data.
Percentage of coincidences in relation to the size of the training set used in the selection class prediction process.
Other classification studies on breeding using neural network techniques, especially with MLP modeling, have shown the potential of the technique with high associated accuracy rate. Sant’Anna et al. (2015) used neural networks to evaluate genetic diversity in simulated populations searching for classification and formation of divergent groups and noted that procedures based on multivariate discriminant functions (Fisher-Anderson) presented unsatisfactory results to discriminate populations derived from controlled crosses. The authors also concluded that the classification analyses through neural networks were higher than the conventional discriminating multivariate methods.
Oda et al. (2007) mention that selecting superior genotypes requires methods capable of efficiently exploiting the available genetic material and maximizing the genetic gain for the different characteristics of interest. In this sense, it is understood that biases should be minimal in any selection process and, in the case of an automated classification process, such as the use of RNAs, the adopted model must be highly reliable in view of the subsequent directions adopted in the breeding program.
The average bias rates found, involving 20 % of the least coincident simulations, remained between 20.9-1.6 % for sample sizes between 20-80 %, respectively (Table 6). When 50 % of less coincident simulations were used, the biases varied from 17.1 % to 0.6 %, for sets from 20 % to 80 % of the training data, respectively, demonstrating the high efficacy of the model adopted to classify the populations and their respective hybrids in relation to the breeding strategy (Table 6).
Average percentage of coincidences (predicted value equal to observed value for training sets) associated to 20 % and 50 % less coincident simulations for different sample sizes for training sets.
The model used is the supervised type, that is, classifications are known a priori by the researcher; however, this column of information is omitted for the model to process the prediction based only on gi and sij. In the end, the original classifications are retrieved and compared with the model predictions in order to construct the confusion matrix between the predicted and observed classes.
According to Peixoto et al. (2015), RNAs are a promising tool for predicting genetic values in balanced experiments. The authors evaluated the efficiency of neural networks for the prediction and genetic values for different heritability estimates and coefficients of variation in 16 randomized block experiments and found that RNAs were efficient in predicting genetic values with 0.64-10.3 % of gain compared to phenotypic values.
Montesinos-Lopez et al. (2018) also observed the possibilities of a dense learning architecture network, called deep learning (DL), and compared it with a better non-biased genomic prediction model (GBLUP). The GBLUP method presented better performance than the DL network, based on reports in the literature. As noted by the authors, in addition to scarce data in terms of the number of observations, a major challenge in training a DL network is linked to the risk of overfitting when the errors associated to training set are low; however, in relation to the test set, these errors are considered high. In this case, the method is unable to learn how to properly generalize from the information in the data.
Thus, this study aimed to show that for a less complex network than DL, such as MLP, the algorithm and the other activation functions implemented allowed the development of a generalist classification model from different training sets. We observed that the higher the training set, the greater the degree of coincidence, as expected.
Thus, the MLP neural network model was efficient in predicting parental and interpopulation hybrid selection classifications by automating selection from average genetic components obtained from a complete diallel, regardless of the evaluation environment.
References
- Abrishami, N.; Sepaskhah, A.R.; Shahrokhnia, M.H. 2019. Estimating wheat and maize daily evapotranspiration using artificial neural network. Theoretical and Applied Climatology 135: 945-958.
- Amiruzzaman, M.; Islam, M.A.; Hasan, L.; Kadir, M.; Rohman, M.M. 2013. Heterosis and combining ability in a diallel among elite inbred lines of maize (Zea mays L.). Emirates Journal of Food and Agriculture 25: 132-137.
- Cunningham, S.J. 1995. Machine Learning and Statistics: A Matter of Perspective. University of Waikato, Hamilton, New Zealand.
- Cruz, C.D. 2013. Genes: a software package for analysis in experimental statistics and quantitative genetics. Acta Scientiarum. Agronomy 35: 271-276.
- Duda, R.O.; Hart, P.E.; Stork, D.G. 2012. Pattern Classification. John Wiley, Hoboken, NJ, USA.
- Feng, Y.; Peng, Y.; Cui, N.; Gong, D.; Zhang, K. 2017. Modeling reference evapotranspiration using extreme learning machine and generalized regression neural network only with temperature data. Computers and Electronics in Agriculture 136: 71-78.
- Ferreira, E.B.; Cavalcanti, P.P.; Nogueira, D.A. 2018. ExpDes: Experimental Designs pacakge. Universidade Federal de Alfenas, MG, Brazil.
- Gardner, E.J.; Eberhart, S.A. 1966. A analysis and interpretation of the variety cross diallel and related populations. Biometrics 22: 439-452.
- González-Camacho, J.M.; Ornella, L.; Pérez-Rodríguez, P.; Gianola, D.; Dreisigacker, S.; Crossa, J. 2018. Applications of machine learning methods to genomic selection in breeding wheat for rust resistance. The Plant Genome 11: 12-15.
- Gralak, E.; Faria, M.V.; Rossi, E.S.; Possatto Junior, O.; Gabriel, A.; Mendes, M.C.; Scapim, C.A.; Neumann, M. 2015. Combining ability of maize hibrids for grain yield and severity of leaf deseases in circulant diallel. Brazilian Journal of Maize and Sorghum 14: 116-129.
- Hallauer, A.R.; Carena, M.J.; Miranda Filho, J.B. 2010. Quantitative Genetics in Maize Breeding. Springer Science, New York, NY, USA.
- Leal, A.J.F.; Miguel, E.P.; Baio, F.H.R.; Neves, D.D.C.; Leal, U.A.S. 2015. Artificial neural networks for corn yield prediction and definition of site-specific crop management through soil properties. Bragantia 74: 436-444 (in Portuguese, with abstract in English).
-
Ledell, E.; Gill, N.; Aiello, S.; Fu, A.; Candel, A.; Click, C.; Kraljevic, T.; Nykodym, T.; Aboyoun, P.; Kurka, M.; Malohlava, M. 2020. R interface for the ‘H2O’ scalable machine learning platform. Available at: https://cran.r-project.org/web/packages/h2o/h2o.pdf [Accessed Feb 16, 2020]
» https://cran.r-project.org/web/packages/h2o/h2o.pdf - Montesinos-López, A.; Montesinos-López, O.A.; Gianola, D.; Crossa, J.; Hernández-Suárez, C.M. 2018 Multi-environment genomic prediction of plant traits using deep learners with dense architecture. G3 - Genes, Genomes, Genetics 8: 3813-3828.
- Morais, A.R.; Oliveira, A.C.; Gama, E.; Souza Junior, C.L. 1991. A method for combined analysis of the diallel crosses repeated in several environments. Pesquisa Agropecuária Brasileira 26: 371-381.
- Nardino, M.; Souza, V.Q.; Baretta, D.; Konflanz, V.A. 2016. Partial diallel analysis among maize lines for characteristics related to the tassel and the productivity. African Journal of Agricultural Research 11: 974-982.
- Nascimento, M.; Peternelli, L.A.; Cruz, C.D.; Nascimento, A.C.C.; Ferreira, R.D.P.; Bhering, L.L.; Salgado, C.C. 2013. Artificial neural networks for adaptability and stability evaluation in alfalfa genotypes. Crop Breeding and Applied Biotechnology 13: 152-156.
- Oda, S.; Mello, E.J.; Silva, J.F.; Souza, I.C.G. 2007. Forest improvement. p. 51-71. In: Borém, A, ed. Forest biotechnology. Universidade Federal de Viçosa, Viçosa, MG, Brazil.
- Pandolfi, C.; Mugnai, S.; Azzarello, E.; Bergamasco, S.; Masi, E.; Mancuso, S. 2009. Artificial neural networks as a tool for plant identification: a case study on Vietnamese tea accessions. Euphytica 166: 411-421.
- Peixoto, L.A.; Bhering, L.L.; Cruz, C.D. 2015. Artificial neural networks reveal efficiency in genetic value prediction. Genetics and Molecular Research 14: 6796-6807.
- Rad, M.R.N. 2018. Artificial neural networks and its role in plant breeding under drought stress. Current Investigations in Agriculture and Current Research 1: 33-35.
- Sant’Anna, I.C.; Tomaz, R.S.; Silva, G.N.; Nascimento, M.; Bhering, L.L.; Cruz, C.D. 2015. Superiority of artificial neural networks for a genetic classification procedure. Genetics and Molecular Research 14: 9898-9906.
- Silva, G.N.; Tomaz, R.S.; Sant’Anna, I.D.C.; Nascimento, M.; Bhering, L.L.; Cruz, C.D. 2014. Neural networks for predicting breeding values and genetic gains. Scientia Agricola 71: 494-498.
- Soares, F.C.; Robaina, A.D.; Peiter, M.X.; Russi, J.L. 2015. Corn crop production prediction using artificial neural network. Ciência Rural 45: 1987-1993 (in Portuguese, with abstract in English).
- Sprague, G.F.; Tatum, L.A. 1942. General vs specific combining ability in single crosses of corn. Journal of the American Society of Agronomy 34: 923-932.
- Zhang, X.; Qiao, Y.; Meng, F.; Fan, C.; Zhang, M. 2018. Identification of maize leaf diseases using improved deep convolutional neural networks. IEEE Access Journal 6: 30370-30377.
Edited by
-
Edited by: Leonardo Oliveira Medici
Publication Dates
-
Publication in this collection
14 June 2021 -
Date of issue
2022
History
-
Received
12 Nov 2020 -
Accepted
15 Jan 2021