Acessibilidade / Reportar erro

Multiple Linear Regression versus Automatic Linear Modelling

[Regressão Linear Múltipla versus Modelagem Linear Automática]

ABSTRACT

In this study, performances of Multiple Linear Regression and Automatic Linear Modelling are compared for different sample sizes and number of predictors. A comprehensive Monte Carlo simulation study was carried out for this purpose. Random numbers generated from multivariate normal distribution by using RNMVN function of IMSL library of Microsoft FORTRAN Developer Studio composed the material of this study. Results of the simulation study showed that the sample size and the number of predictors are the main factors that lead to produce different results. Although both methods gave very similar results especially when studied with large sample sizes (n≥100), the Automatic linear modelling is preferred for analyzing data sets due to its simplicity in analyzing data and interpreting the results, ability to present results visually and providing more detailed information especially studying large complex data sets. It will be beneficial to use the Automatic linear modelling especially in analyzing massive and complex data sets for the purposes of investigating the relationships between one continuous dependent and 10 or more predictors and determine the factors that affect the response or target variable. At the same time, it will also be possible to evaluate the effect of each predictor with a more detailed response.

Keywords:
multiple regression; automatic linear modelling; simulation; R2

RESUMO

Neste estudo, os desempenhos da Regressão Linear Múltipla e da Modelagem Linear Automática são comparados para diferentes tamanhos de amostra e número de preditores. Para isso, foi realizado um estudo abrangente de simulação de Monte Carlo. Os números aleatórios gerados a partir da distribuição normal multivariada usando a função RNMVN da biblioteca IMSL do Microsoft FORTRAN Developer Studio compuseram o material deste estudo. Os resultados do estudo de simulação mostraram que o tamanho da amostra e o número de preditores são os principais fatores que levam à produção de resultados diferentes. Embora ambos os métodos tenham apresentado resultados muito semelhantes, especialmente quando estudados com amostras de tamanho grande (n≥100), a modelagem linear automática é preferida para a análise de conjuntos de dados devido à sua simplicidade na análise de dados e na interpretação dos resultados, à capacidade de apresentar os resultados visualmente e ao fornecimento de informações mais detalhadas, especialmente no estudo de conjuntos de dados grandes e complexos. Será vantajoso usar a modelagem linear automática, especialmente na análise de conjuntos de dados maciços e complexos com o objetivo de investigar as relações entre um dependente contínuo e 10 ou mais preditores e determinar os fatores que afetam a resposta ou a variável-alvo. Ao mesmo tempo, também será possível avaliar o efeito de cada preditor com uma resposta mais detalhada.

Palavras-chave:
regressão múltipla; modelagem linear automática; simulação; R2

INTRODUCTION

Investigating relationships between/among variables is of great interest for practitioners (Mendeş, 2019MENDEŞ, M. Statistical methods and experimental design. İstanbul: Kriter Yayınevi, 2019.; Temizhan et al., 2022TEMİZHAN, E.; MİRTAGİOĞLU, H.; MENDEŞ, M. Which correlation coefficient should be used for ınvestigating relations between quantitative variables? Am. Acad. Sci. Res. J. Eng. Technol. Sci., v.85, p.265-277, 2022.). Multiple Linear Regression (MLR) is the most used technique in investigating relations between one dependent and several independent variables. Despite its widespread use and a great tool, the presence of some disadvantages of the MLR limits its use. These disadvantages become more obvious especially when the number of variables is greater than the number of observations, high correlation between the predictors (multicollinearity problem), and presence of outliers in the data set. At the same time, although the MLR is a very beneficial technique to investigate the relationships between one dependent and several independent variables it isn’t recommended for many cases in practice due to over-simplifying real world problems by assuming a linear relationship among the variables (Johnson, 1991JOHNSON, J.D. Applied multivariate data analysis. New York: Springer-Verlag, 1991., Yan and Su, 2009; Mendeş, 2009) Therefore, the MLR should not be used for such cases; otherwise it may lead to over fit. In case of such problems, either these problems are tried to be solved by using different methods and applying some transformations, or alternative methods that are not affected by such problems are applied. Due to its ease of application, its ability to visually present the results, and its ability to automatically determine the best sub-datasets and important independent variables it is possible to benefit from the Automatic Linear Modelling (ALM) Analysis efficiently when such problems exists (IBM..., 2012; Field, 2013FİELD, A. Discovering statistics using IBM SPSS statistics. 4.ed. Los Angeles: SAGE, 2013. 952p.; Yang, 2013YANG, H. The case for being automatic: ıntroducing the automatic linear modeling (LINEAR) procedure in SPSS Statistics. Multiple Linear Regression Viewpoints, v.39, p. 27-37, 2013.; Rahnama and Rajabpour, 2016RAHNAMA, H.; RAJABPOUR, S. Identifying effective factors on consumers’ choice behavior toward green products: the case of Tehran, the capital of Iran. Environ. Sci. Pollut. Res., v.24, p.911-925, 2016.; Yakubu et al., 2018YAKUBU, A.; OLUREMİ, O.I.A.; EKPO, E.I. Predicting heat stress index in Sasso hens using automatic linear modeling and artificial neural network. Int. J. Biometeorol., v.62, p.1181-1186, 2018.; Genç and Mendeş, 2021aGENÇ, S.; MENDEŞ, M. Linear modeling analysis using for determining the factors affecting 305-day milk yield. Arq. Bras. Med. Vet. Zootec., v.73, p.949-954, 2021a.). This study aimed to compare the performance of the MLR and the ALM under different experimental conditions via a comprehensive simulation study.

MATERIAL AND METHODS

Random numbers generated from multivariate normal distribution by using the RNMVN function of the IMSL library of Microsoft FORTRAN Developer Studio composed the material of this study. Three goodness-of-fit criteria (i.e. R2, accuracy level or R2 adj., and the rank of the place of importance of the independent variables) were used in evaluating the appropriateness of the models. To determine reference values for actual R2 and R2 adj, 1000,000 random numbers were generated from multivariate normal distribution for different variance-covariance matrixes. Randomly generated numbers were then transferred into SPSS package and the MLR and ALM were performed. Then, R2 and R2 adj. values were computed. These values were accepted as the reference or actual values for the R2 and R2 adj. Later, for p=4, 10, 15, and 20, different samples with the sizes of 20, 30, 50, 100, and 500 were taken from 1000,000 random numbers. The RT and ALM procedures were applied to those samples and the R2 and R2 adj. values were estimated. These processes were repeated 100 times. Therefore, each estimate was made based on 100 trials. Then, the numbers of trials given below were determined (Genç and Mendeş, 2021bMENDEŞ, M. Re-evaluating the Monte Carlo simulation results by using graphical techniques. Türkiye Klinikleri J. Biostatistics, v.13, p.28-38, 2021.).

  • a) The number of trials where both methods were found to have the same variables as important.

  • b) The number of trials where both the importance of variables and the order of importance were found to be the same.

  • c) The number of trials where the same variables were found to be important, but the order of importance was not the same.

  • d) The number of trials where both methods produced different results. These numbers were then converted to %.

Correlations between the predictors ranged from -0.20 to 0.90. Detailed information about experimental conditions simulated are given in Table 1.

Letter p denotes number of variables, n denotes number of observations, and Xij is the ith observation of the jth variable. Then mean vector and variance-covariance matrix will be as below:

μ = [ μ 1 μ 2 . . μ p ]

Where μi=E(Xi)=Xif(x)d(x) is the mean of the ith component of X.

Covariance between Xi and Xj is σij=E(Xiμi)(Xjμj)=E(XiXj)μiμj and variance of each Xi is σii=E(Xiμi)2=E(Xi2)μi2

In this case, the variance-covariance matrix will be as follow:

= [ σ 11 σ 12 σ 1 p σ 21 σ 22 σ 2 p σ p 1 σ p 2 σ p p ]

Table 1
Characteristics of Simulation Study

Multiple linear regression (MLR) is one of the widely used statistical techniques to explain the relationship between one continuous dependent variable and two or more independent variables. If Y is a dependent or response variable and X1, X2, X3…, Xp are independent or predictor variables, then the multiple regression model will be as follows, and it provides a prediction of Y values from the X values.

Y i = β 0 + β 1 X 1 + β 2 X 2 + + β p X p + Ɛ i

where

Yi is the response variable, X1, X2, …, Xp are the independent variables, β0is the constant term or intercept, β1, β2, …, βp are the regression coefficients and εis the random error.3,4

Automatic Linear Modeling (ALM) was introduced in version 19 of IBM SPSS, enabling researchers to select the best subset automatically. In ALM, to provide an improvement in data fit the predictors are directly converted. SPSS uses rescaling of time, other measurements, outlier trimming, category merging and other methods in performing ALM analysis. Although the ALM can be used on small and medium-sized data sets, it is more useful especially when working with large and complex data sets. Thus, it is possible to say that the advantages of the ALM become more obvious, especially in cases where there are many estimators. On the other hand, the fact that the ALM has the potential to be misused should not be overlooked due to its simplicity and the convenience it provides in automatically identifying important variables. It is because the ALM includes automatic data preparation steps. Therefore, after the final candidate models are determined, it is of great benefit to carefully evaluate these models by considering various criteria and asking some important questions (IBM..., 2012; Yang, 2013YANG, H. The case for being automatic: ıntroducing the automatic linear modeling (LINEAR) procedure in SPSS Statistics. Multiple Linear Regression Viewpoints, v.39, p. 27-37, 2013.; Genç and Mendeş, 2021bMENDEŞ, M. Re-evaluating the Monte Carlo simulation results by using graphical techniques. Türkiye Klinikleri J. Biostatistics, v.13, p.28-38, 2021.; Mendeş, 2021).

RESULTS

Descriptive statistics for p=4, 10, 15, and 20 are given in Table 2 and the results for performances of the MLR and ALM are given in Table 3.

Table 2
Descriptive Statistics for R2 estimates of MLR and ALL for the sample sizes of 20, 30, 50, 100, 500 when p=4, 10, 15, 20

When Table 2 is examined, it is seen that the means of R2 estimates of the MLR and ALM are generally similar, and this similarity becomes more evident as the sample size increases. It is seen that the estimations tend to increase as the number of predictors increases. However, this increase is more pronounced especially when the number of predictors is between 4 and 10.

Table 3
Simulation results for evaluating performances of MLR and ALM

When table 3 is examined, it is seen that the probabilities of finding the same variables as important for both techniques are quite similar and generally ranged from 92% to 100%. As the sample size increases, the probability of giving the same results for both techniques reaches a very high level. This probability reaches 100% for the sample sizes of 100 and more regardless of the number of predictors.

When the percentage of the experiments where the same predictors are found to be important, and the order of importance is the same, it can be easily seen that as the number of predictors increased this percentage decreased. This decrease is even more pronounced when p≥15.

When the percentage of the numbers of trials in which the same variables were found to be important but the order of importance was not found to be the same is examined this probability increases as the number of predictors and sample size increase. Increase in this probability becomes more prominent especially when p≥15 and n≥100.

On the other hand, as expected the percentage of trials where both methods produce different results is very close to each other. As can be seen from the last column of the Table 1, the MLR and ALM gave different results especially when sample size is small. For large sample sizes (n>100) no difference has been observed between the two methods regardless of number of predictors. When all results are evaluated together it is possible to conclude that differences in the sample size and the number of predictors may lead to produce different results. However, in general, usage of the ALM provides more detailed information along with its simplicity in analyzing data and interpreting the results especially studying with large complex data sets. Therefore, it will be beneficial to use the ALM especially in analyzing massive and complex data sets for the purposes of investigating the relationships between one continuous dependent and 10 and more predictors and determine the factors that affect the response or target variable. At the same time, it will also be possible to evaluate the effect of each predictor’s response in more detail.

DISCUSSION

Although MLR is widely used in practice and a great tool for investigating the relationships of dependent and independent variables, it isn’t recommended for many cases due to over-simplifying real-world problems by assuming a linear relationship among the variables. There are some situations that limit it’s use. Linear Regression is a great tool to analyze the relationships among the variables, but it isn’t recommended for most practical applications because it over-simplifies real world problems by assuming a linear relationship among the variables. At the same time, it requires some assumptions to be provided in the data set, and these assumptions are generally not fulfilled in practice. Therefore, there are some situations that limit the use of the multiple linear regression analysis. The first limitation of Multiple Linear Regression (MLR) is the assumption of linearity between the dependent variable and the independent variables. In the real world, the relationship between dependent and independent variables is not linear in many cases. This assumption is not fulfilled for many cases and that limits the use of MLR in investigating the relationships between a dependent and several independent variables. It is because accuracy decreases as the linearity of the dataset decreases. The second limitation of the MLR appears when the number of observations is lesser than the number of predictors. Since it might cause overfitting or overestimating problem, the MLR should not be used for such cases (n<p). The third limitation of the MLR is that it assumes that there is no multicollinearity problem among the predictors. If this problem occurs in the dataset, there should be an attempt to handle it. The fourth limitation of the MLR is that since the MLR is very sensitive to outliers, it should be so careful against outliers and thus before performing MLR analysis it should be tested if there is an outlier. In cases where the number of variables is bigger than the number of observations the usage of MLR is not also appropriate even if all above assumptions are fulfilled (Johnson, 1991JOHNSON, J.D. Applied multivariate data analysis. New York: Springer-Verlag, 1991.; Mendeş, 2009MENDEŞ, M. Determination of minimum sample size for testing effect of ındependent variables in multiple linear regression analysis: a Monte Carlo simulation study. Türkiye Klinikleri Biyoistatistik, v.1, p.38-44, 2009.; Yakubu et al., 2018YAKUBU, A.; OLUREMİ, O.I.A.; EKPO, E.I. Predicting heat stress index in Sasso hens using automatic linear modeling and artificial neural network. Int. J. Biometeorol., v.62, p.1181-1186, 2018.).This case becomes more obvious especially when studied with complex data sets. The ALM which is a member of linear modelling might be used efficiently instead of MLR. The ALM has three main features: a) predictors can be both continuous and categorical b) ALM automatically finds the most important predictors and eliminates the predictors which are of little or no importance in predicting the dependent variable, and c) it automatically determines if the data set contains an outlier. One of the other important features of the ALM is that since it presents the results graphically it is very easy to interpret the results.

The results of this study showed that the MLR and ALM gave different results especially when sample size is small. But any difference has not been observed between two methods regardless of number of predictors when studied with large sample sizes (n>100). However, the usage of ALM provides more detailed information along with its simplicity in analyzing data and interpreting the results especially studying with large complex data sets in general. A few previous studies where the ALM was used in investigating relationships between dependent and independent variables also reported that the ALM could be considered as a great tool especially when studying with large and complex data sets. For example, Oshima and Dell-Ross reported that the Automatic Linear Modeling can be an indispensable screening tool especially when there are many predictors (Oshima and Dell-Ross, 2016). However, once a handful of final candidates are chosen, it is the researcher’s responsibility to carefully evaluate those models with various criteria along with substantive questions. Likewise, Yakubu et al. used ALM to predict heat stress index in Sasso hens and they reported that the ALM can be used efficiently in predicting heat stress index (Yakubu et al., 2018). Genç and Mendeş used ALM to model the factors affecting the 305-day milk yield of dairy cows by using Automatic Linear Modeling Technique (ALM). They reported that the ALM can be efficiently used for investigating the relationships between one continuous response and more predictors which had different measurement scale (Genç and Mendeş, 2021a). Mendeş used ALM for evaluating results of Monte Carlo Simulation Studies and he informed that the ALM could be used efficiently to determine the factors that affect the response variable when there is a large and complex data set (Mendeş, 2021). When results of this study and previous studies are evaluated together it is possible to conclude the following: a) usage of the ALM in analyzing large and complex data sets might enable us to interpret the results easily, b) Preferring the ALM enables us to evaluate higher order interactions among the independent variables and, c) The ALM can be efficiently used in analyzing data sets obtained from all kinds researches as long as there are many predictors. However, a potential threat of misuse of the ALM due to its simplicity should not be ignored.

CONCLUSION

As a result, it is possible to conclude that the ALM can be efficiently used in investigating relationships between one dependent and several predictors especially when used on large and complex data sets.

REFERENCES

  • FİELD, A. Discovering statistics using IBM SPSS statistics. 4.ed. Los Angeles: SAGE, 2013. 952p.
  • GENÇ, S.; MENDEŞ, M. Evaluating performance and determining optimum sample size for regression tree and automatic linear modeling. Arq. Bras. Med. Vet. Zootec., v.73, p.1391-1402, 2021b.
  • GENÇ, S.; MENDEŞ, M. Linear modeling analysis using for determining the factors affecting 305-day milk yield. Arq. Bras. Med. Vet. Zootec., v.73, p.949-954, 2021a.
  • IBM SPSS statistics 21 algorithms. Chicago: IBM SPSS Inc., 2012.
  • JOHNSON, J.D. Applied multivariate data analysis. New York: Springer-Verlag, 1991.
  • MENDEŞ, M. Determination of minimum sample size for testing effect of ındependent variables in multiple linear regression analysis: a Monte Carlo simulation study. Türkiye Klinikleri Biyoistatistik, v.1, p.38-44, 2009.
  • MENDEŞ, M. Re-evaluating the Monte Carlo simulation results by using graphical techniques. Türkiye Klinikleri J. Biostatistics, v.13, p.28-38, 2021.
  • MENDEŞ, M. Statistical methods and experimental design. İstanbul: Kriter Yayınevi, 2019.
  • OSHİMA, T.C.; DELL-ROSS, T. All possible regressions using IBM SPSS: a practitioner’s guide to automatic linear modeling. 2016. In: GEORGİA EDUCATİONAL RESEARCH ASSOCİATİON CONFERENCE. Proceeding... Georgia: GERA, 2016.
  • RAHNAMA, H.; RAJABPOUR, S. Identifying effective factors on consumers’ choice behavior toward green products: the case of Tehran, the capital of Iran. Environ. Sci. Pollut. Res., v.24, p.911-925, 2016.
  • TEMİZHAN, E.; MİRTAGİOĞLU, H.; MENDEŞ, M. Which correlation coefficient should be used for ınvestigating relations between quantitative variables? Am. Acad. Sci. Res. J. Eng. Technol. Sci., v.85, p.265-277, 2022.
  • YAKUBU, A.; OLUREMİ, O.I.A.; EKPO, E.I. Predicting heat stress index in Sasso hens using automatic linear modeling and artificial neural network. Int. J. Biometeorol., v.62, p.1181-1186, 2018.
  • YAN, X.; SU, X.G. Linear regression analysis: theory and computing. Singapore: World Scientific Publishing Co. Pte., 3009. 315p.
  • YANG, H. The case for being automatic: ıntroducing the automatic linear modeling (LINEAR) procedure in SPSS Statistics. Multiple Linear Regression Viewpoints, v.39, p. 27-37, 2013.

Publication Dates

  • Publication in this collection
    09 Feb 2024
  • Date of issue
    Jan-Feb 2024

History

  • Received
    26 June 2023
  • Accepted
    20 Sept 2023
Universidade Federal de Minas Gerais, Escola de Veterinária Caixa Postal 567, 30123-970 Belo Horizonte MG - Brazil, Tel.: (55 31) 3409-2041, Tel.: (55 31) 3409-2042 - Belo Horizonte - MG - Brazil
E-mail: abmvz.artigo@gmail.com