Acessibilidade / Reportar erro

Investigation of using missing data imputation methodologies effect on the SARIMA model performance: application to average monthly flows

Investigação do efeito do uso de metodologias de imputação de dados faltantes no desempenho do modelo SARIMA: aplicação para vazões médias mensais

ABSTRACT

Accuracy in river flows forecasts is crucial for Hydrology, but is challenged by fluviometric data quality. This study investigates the impact of different missing data imputation methods on the Seasonal Autoregressive Integrated Moving Average (SARIMA) model performance. SARIMA (1,1,1)(0,1,1)12 was selected using semi-automated criteria, such as lowest AIC, significant parameters (p-value < 0.05) and residuals adequacy. This model was then compared with reconstructed series using different imputation methods such as Mean (AM), Median (M), Spline and Stinemann Interpolations, Regional Weighting (RW), Multiple Linear Regression (MLR), Multiple Imputation (MI) and Maximum Likelihood (ML). The data were analyzed considering scenarios of 5, 20 and 40% missing data, following random and block patterns, using data from the Doce River, in Southeast Brazil. Results obtained by the performance indicators and, their respective relative differences, indicated that, univariate (AM and M) and multivariate (PW and RLM) methods limited the model's performance, while univariate Spline and Stine and multivariate IM and ML methods didn't present significant limitations, except Spline for the block pattern. It is concluded that, future predictions accuracy depends, not only on a well-trained and validated model, but also on the appropriate use of missing data imputation methods.

Keywords:
Missing data imputation methodologies; Forecast; SARIMA

RESUMO

A precisão nas previsões de vazão dos rios é crucial para a Hidrologia, mas é desafiada pela qualidade dos dados fluviométricos. Este estudo investiga o impacto de diferentes métodos de imputação de dados faltantes no desempenho do modelo Autoregressivo Integrado de Médias Móveis Sazonal (SARIMA). O modelo SARIMA (1,1,1)(0,1,1)12 foi selecionado usando critérios semi-automatizados, como menor AIC, parâmetros significativos (p-valor < 0,05) e adequação dos resíduos. Este modelo foi então comparado com séries reconstruídas usando diferentes métodos de imputação, como Média (AM), Mediana (M), Interpolações Spline e Stinemann, Ponderação Regional (RW), Regressão Linear Múltipla (MLR), Imputação Múltipla (MI) e Máxima Verossimilhança (ML). Os dados foram analisados considerando cenários de 5, 20 e 40% de dados faltantes, seguindo padrões aleatórios e de blocos, utilizando dados do Rio Doce, no Sudeste do Brasil. Os resultados obtidos pelos indicadores de desempenho e suas respectivas diferenças relativas, indicaram que, métodos univariados (AM e M) e multivariados (PW e RLM) limitaram o desempenho do modelo, enquanto os métodos univariados Spline e Stine e multivariados IM e ML não apresentaram limitações significativas, exceto Spline para o padrão de blocos. Conclui-se que a precisão das previsões futuras depende, não apenas de um modelo bem treinado e validado, mas também, do uso adequado de métodos de imputação de dados faltantes.

Palavras-chave:
Metodologias de imputação de dados faltantes; Previsão; SARIMA

INTRODUCTION

Although flow forecasting constitutes a relevant focus of interest for Hydrology, its conduct presents substantial challenges, mainly the increase in its quality. Successful forecasts not only make it possible to contribute to important sectors of water planning, such as human supply (Liu et al., 2021Liu, X., Sang, X., Chang, J., & Zheng, Y. (2021). Multi-model coupling water demand prediction optimization method for megacities based on time series decomposition. Water Resources Management, 35(12), 4021-4041. http://doi.org/10.1007/s11269-021-02927-y.
http://doi.org/10.1007/s11269-021-02927-...
), control of floods and droughts (Ahmad et al., 2022Ahmad, I., Waseem, M., & Zhang, J. (2022). Developing monthly hydrometeorological timeseries forecasts to reservoir operation in a transboundary river catchment. Theoretical and Applied Climatology, 147(3-4), 1663-1674. http://doi.org/10.1007/s00704-021-03901-9.
http://doi.org/10.1007/s00704-021-03901-...
), reservoirs operation for water storage and/or hydroelectric power generation, irrigation (Aghelpour et al., 2021Aghelpour, P., Bahrami-Pichaghchi, H., & Varshavian, V. (2021). Hydrological drought forecasting using multi-scalar streamflow drought index, stochastic models and machine learning approaches, in northern Iran. Stochastic Environmental Research and Risk Assessment, 35(8), 1615-1635. http://doi.org/10.1007/s00477-020-01949-z.
http://doi.org/10.1007/s00477-020-01949-...
) and industry (Schäfer et al., 2016Schäfer, M. P., Dietrich, O., & Mbilinyi, B. (2016). Streamflow and lake water level changes and their attributed causes in Eastern and Southern Africa: state of the art review. International Journal of Water Resources Development, 32(6), 853-880. http://doi.org/10.1080/07900627.2015.1091289.
http://doi.org/10.1080/07900627.2015.109...
), but they also anticipate water use conflicts and are fundamental to mitigating the impacts of climate change (Musa, 2013Musa, J. J. (2013). Stochastic modeling of Shiroro River stream flow process. American Journal of Engineering Research, 2(6), 49-54.).

Among applied forecasting methodologies, physical, conceptual and data-based methods stand out (Apaydin et al., 2021Apaydin, H., Sattari, M. T., Falsafian, K., & Prasad, R. (2021). Artificial intelligence modelling integrated with Singular Spectral analysis and Seasonal-Trend decomposition using Loess approaches for streamflow predictions. Journal of Hydrology, 600, 126506. http://doi.org/10.1016/j.jhydrol.2021.126506.
http://doi.org/10.1016/j.jhydrol.2021.12...
). While the first two require complex variables that are difficult to appropriate, data-based methods offer direct analyzes of the variable of interest, being advantageous in this sense (Khodakhah et al., 2022Khodakhah, H., Aghelpour, P., & Hamedi, Z. (2022). Comparing linear and non-linear data-driven approaches in monthly river flow prediction, based on the models SARIMA, LSSVM, ANFIS, and GMDH. Environmental Science and Pollution Research International, 29(15), 21935-21954. http://doi.org/10.1007/s11356-021-17443-0.
http://doi.org/10.1007/s11356-021-17443-...
).

The attributes highlighted by Fu et al. (2019)Fu, J., Zhong, P. A., Chen, J., Xu, B., Zhu, F., & Zhang, Y. (2019). Water resources allocation in transboundary river basins based on a game model considering inflow forecasting errors. Water Resources Management, 33(8), 2809-2825. http://doi.org/10.1007/s11269-019-02259-y.
http://doi.org/10.1007/s11269-019-02259-...
such as the ability to establish a quantitative interaction between inputs and outputs, fast modeling speed and high prediction accuracy, are essential factors that drive the choice of data-driven methods. For this reason, in recent years, techniques such as artificial intelligence, deep learning, machine learning and time series have been widely adopted in forecasting hydrological variables.

Time series methodology, developed in the 1970s by statisticians Box and Jenkins, which consolidated the class of ARMA (Autoregressive and Moving Average) models, allowed the establishment of several generic models. Seasonal Autoregressive Integrated Moving Average (SARIMA) model stands out in this class, as it considers seasonality, an intrinsic characteristic of flow time series (Abudu et al., 2010Abudu, S., Cui, C. L., King, J. P., & Abudukadeer, K. (2010). Comparison of performance of statistical models in forecasting monthly streamflow of Kizil River, China. Water Science and Engineering, 3(3), 269-281. http://doi.org/10.3882/j.issn.1674-2370.2010.03.003.
http://doi.org/10.3882/j.issn.1674-2370....
; Bayer et al., 2012Bayer, D. M., Castro, N. M. D. R., & Bayer, F. M. (2012). Modelagem e previsão de vazões médias mensais do rio Potiribu utilizando modelos de séries temporais. RBRH, 17(2), 229-239. http://doi.org/10.21168/rbrh.v17n2.p229-239.
http://doi.org/10.21168/rbrh.v17n2.p229-...
).

Essentially, time series methodologies efficiency depends on the serial correlation structure. In this way, data quality plays a crucial role in the successful application of these methodologies. In contrast, the incidence of missing data in flow time series imposes limitations on the autocorrelation function, leading to questions about viability and performance of forecast models, generating uncertainty in results and decision making (Giustarini et al., 2016Giustarini, L., Parisot, O., Ghoniem, M., Hostache, R., Trebs, I., & Otjacques, B. (2016). A user-driven case-based reasoning tool for infilling missing values in daily mean river flow records. Environmental Modelling & Software, 82, 308-320. http://doi.org/10.1016/j.envsoft.2016.04.013.
http://doi.org/10.1016/j.envsoft.2016.04...
; Dembélé et al., 2019Dembélé, M., Oriani, F., Tumbulto, J., Mariéthoz, G., & Schaefli, B. (2019). Gap-filling of daily streamflow time series using Direct Sampling in various hydroclimatic settings. Journal of Hydrology, 569, 573-586. http://doi.org/10.1016/j.jhydrol.2018.11.076.
http://doi.org/10.1016/j.jhydrol.2018.11...
).

The main causes of missing data in flow time series are the hydrometrist absence, failures in the instruments of data collection, loss of annotations, and monitoring termination or interruptions (Gao et al., 2018Gao, Y., Merz, C., Lischeid, G., & Schneider, M. (2018). A review on missing hydrological data processing. Environmental Earth Sciences, 77(2), 47. http://doi.org/10.1007/s12665-018-7228-6.
http://doi.org/10.1007/s12665-018-7228-6...
; Bleidorn et al., 2022Bleidorn, M. T., Pinto, W. P., Schmidt, I. M., Mendonça, A. S. F., & Reis, J. A. T. D. (2022). Methodological approaches for imputing missing data into monthly flows series. Revista Ambiente & Água, 17(2), e2795. http://doi.org/10.4136/ambi-agua.2795.
http://doi.org/10.4136/ambi-agua.2795...
). Therefore, there has been a growing interest in recent years in the missing data treatment, including through different imputations approaches (Hamzah et al., 2020Hamzah, F. B., Mohd Hamzah, F., Mohd Razali, S. F., Jaafar, O., & Abdul Jamil, N. (2020). Imputation methods for recovering streamflow observation: a methodological review. Cogent Environmental Science, 6(1), 1745133. http://doi.org/10.1080/23311843.2020.1745133.
http://doi.org/10.1080/23311843.2020.174...
). Examples include the works of Tencaliec et al. (2015)Tencaliec, P., Favre, A. C., Prieur, C., & Mathevet, T. (2015). Reconstruction of missing daily streamflow data using dynamic regression models. Water Resources Research, 51(12), 9447-9463. http://doi.org/10.1002/2015WR017399.
http://doi.org/10.1002/2015WR017399...
, that used dynamic regression to impute missing data from the Durance River, France; Dembélé et al. (2019)Dembélé, M., Oriani, F., Tumbulto, J., Mariéthoz, G., & Schaefli, B. (2019). Gap-filling of daily streamflow time series using Direct Sampling in various hydroclimatic settings. Journal of Hydrology, 569, 573-586. http://doi.org/10.1016/j.jhydrol.2018.11.076.
http://doi.org/10.1016/j.jhydrol.2018.11...
, that applied direct sampling with different hydroclimatic settings to fill gaps in flow data from the Volta River, West Africa; Semiromi & Koch (2019)Semiromi, M. T., & Koch, M. (2019). Reconstruction of groundwater levels to impute missing values using singular and multichannel spectrum analysis: application to the Ardabil Plain, Iran. Hydrological Sciences Journal, 64(14), 1711-1726. http://doi.org/10.1080/02626667.2019.1669793.
http://doi.org/10.1080/02626667.2019.166...
, that used singular spectrum analysis and multichannel to reconstruct groundwater levels in the Ardabil Plain, Iran; Arriagada et al. (2021)Arriagada, P., Karelovic, B., & Link, O. (2021). Automatic gap-filling of daily streamflow time series in data-scarce regions using a machine learning algorithm. Journal of Hydrology, 598, 126454. http://doi.org/10.1016/j.jhydrol.2021.126454.
http://doi.org/10.1016/j.jhydrol.2021.12...
, that utilized a machine learning algorithm to fill missing data on daily flows in several Chile watersheds; Hamzah et al. (2022)Hamzah, F. B., Mohamad Hamzah, F., Mohd Razali, S. F., & El-Shafie, A. (2022). Multiple imputations by chained equations for recovering missing daily streamflow observations: a case study of Langat River basin in Malaysia. Hydrological Sciences Journal, 67(1), 137-149. http://doi.org/10.1080/02626667.2021.2001471.
http://doi.org/10.1080/02626667.2021.200...
, that used equations of multiple chained imputations (IM) to fill missing data of Langat River, Malaysia.

In order to preserve the analysis quality in studies with time series, different authors previously used methodologies to impute missing data. For example, in the study by Pinto et al. (2015)Pinto, W. P., Lima, G. B., & Zanetti, J. B. (2015). Previsão de regimes de vazões médias mensais do rio Doce, Colatina-Espírito Santo. Ciência e Natura, 37(3), 1-11. http://doi.org/10.5902/2179460X17143.
http://doi.org/10.5902/2179460X17143...
, was used the maximum likelihood methodology through the EM (Expectation-Maximization) algorithm to reconstruct the monthly mean flows series from Doce River, Brazil, before adjust the SARIMA (1,1,1)(1,1,2)12 model. Bleidorn et al. (2019)Bleidorn, M. T., Pinto, W. P., Braum, E. S., Lima, G. B., & Montebeller, C. A. (2019). Modelagem e previsão de vazões médias mensais do rio Jucu, ES, utilizando o modelo SARIMA. Irriga, 24(2), 320-335. http://doi.org/10.15809/irriga.2019v24n2p320-335.
http://doi.org/10.15809/irriga.2019v24n2...
utilized the average of the closest neighbors to perform the missing data imputation in the time series from Jucu River, Brazil, and, subsequently, the SARIMA (1,0,0)(5,1,0)12 model was fitted. Duarte et al. (2019)Duarte, V. B. R., Silva, F. D. C. S., Souza, I. V., Silva, M. V. C., Almeida Sousa, H. G., Giongo, M., & Viola, M. R. (2019). Previsão de vazão na bacia hidrográfica do rio Manuel Alves da Natividade utilizando o modelo de séries temporais SARIMA. Journal of Biotechnology and Biodiversity, 7(4), 457-468. http://doi.org/10.20873/jbb.uft.cemaf.v7n4.duarte.
http://doi.org/10.20873/jbb.uft.cemaf.v7...
imputed missing data of the monthly mean flows series of Manuel Alves da Natividade River, Brazil, using the Kalman filter, and after model adjustment analysis, the SARIMA (1,0,1)(1,1,4)12 was the one that presented the best performance. Salame et al. (2019)Salame, C. W., Queiroz, J. C. B., Souza, E. B., Farias, V. J. C., Rocha, E. J. P., & Moura, H. P. (2019). Um estudo comparativo dos modelos Box-Jenkins e Redes Neurais Artificiais na previsão de vazões e precipitações pluviométricas da Bacia Araguaia, Tocantins, Brasil. Brazilian Journal of Environmental Sciences, (52), 28-43. http://doi.org/10.5327//Z2176-947820190444.
http://doi.org/10.5327//Z2176-9478201904...
utilized the IM to fill the missing values of flows and precipitation in the Araguaia Watershed, Brazil and, afterwards, compare the performance of Box and Jenkins approach and artificial neural networks for forecasts of the studied variables. Phan & Nguyen (2020)Phan, T.-T.-H., & Nguyen, X. H. (2020). Combining statistical machine learning models with ARIMA for water level forecasting: the case of the Red river. Advances in Water Resources, 142, 103656. http://doi.org/10.1016/j.advwatres.2020.103656.
http://doi.org/10.1016/j.advwatres.2020....
used linear imputation and/or moving average to fill the missing data of a monitoring station of the Red River, China, with posterior adjustment of ARIMA models. Retike et al. (2022)Retike, I., Bikše, J., Kalvāns, A., Dēliņa, A., Avotniece, Z., Zaadnoordijk, W. J., Jemeljanova, M., Popovs, K., Babre, A., Zelenkevičs, A., & Baikovs, A. (2022). Rescue of groundwater level time series: how to visually identify and treat errors. Journal of Hydrology, 605, 127294. http://doi.org/10.1016/j.jhydrol.2021.127294.
http://doi.org/10.1016/j.jhydrol.2021.12...
performed data reconstruction from the Latvian national groundwater level database. After that, the authors adjusted ARMA and ARIMA models to observed and reconstructed data and, evaluated, how the AIC values behaved, since their lowest values indicate the best fit of the models to the data. Using 605 series, the adjusted models evaluation showed, for most series, that the AIC values were significantly improved for the reconstructed series.

Other studies used methodologies to impute missing data and, after that, used the approach of interest in their analyses. For example, Gill et al. (2007)Gill, M. K., Asefa, T., Kaheil, Y., & McKee, M. (2007). Effect of missing data on performance of learning algorithms for hydrologic predictions: implications to an imputation technique. Water Resources Research, 43(7), 2006WR005298. http://doi.org/10.1029/2006WR005298.
http://doi.org/10.1029/2006WR005298...
observed that groundwater level predictions using Neural Networks and Support Vector Machines after missing’s imputation with the least squares method were close to the observed data performance. Chen et al. (2018)Chen, L., Xu, J., Wang, G., Liu, H., Zhai, L., Li, S., Sun, C., & Shen, Z. (2018). Influence of rainfall data scarcity on non-point source pollution prediction: implications for physically based models. Journal of Hydrology, 562, 1-16. http://doi.org/10.1016/j.jhydrol.2018.04.044.
http://doi.org/10.1016/j.jhydrol.2018.04...
evaluated the influence of imputation missing precipitation data on the performance of hydrological nonpoint pollution (H/NPS) forecasting using the SWAT model in a case study for Daning River watershed, Three Gorges Reservoir Region, China. The authors concluded that the EMB imputation methodology (a combination of the EM and Bootstrap algorithm) was better to the performances of the other two methods analyzed (data augmentation algorithm and meteorological generator).

Although imputation methods application before using interest hydrological models has been recurrently observed, a significant gap in the literature lies in the assessment of their impact on the SARIMA model specific performance. Therefore, this study aims to investigate how different missing data imputation approaches influence both the fit and forecasting ability of the SARIMA time series model. This analysis will be carried out, using average monthly flow data from the Doce river, Brazil, as a case study.

Results obtained in this study have the potential to contribute to improving flow forecasts quality. By offering specific insights into how different imputation techniques affect SARIMA model performance, it will be possible to make more consistent decisions in water resources management. Furthermore, this research will contribute to reducing the uncertainty associated with hydrological forecasts, providing a more solid basis for strategic and operational decision-making.

METHODOLOGY

Study area

Doce River watershed (Figure 1) is located in the Southeast Region of Brazil, situated in the states of Minas Gerais and Espírito Santo, between the parallels 17°45’ and 21°15’ of south latitude and the medians 39°55’ e 43°45’ of west longitude. The Doce River has total extension of 853 km and drainage area of 83,465 km2 (Coelho, 2007Coelho, A. L. N. (2007). Alterações hidrogeomorfológicas no médio-baixo Rio Doce/ES (Tese de doutorado). Departamento de Geografia, Instituto de Geociências, Universidade Federal Fluminense, Rio de Janeiro. Retrieved in 2022, October 20, from http://www.dominiopublico.gov.br/pesquisa/DetalheObraForm.do?select_action=&co_obra=157909
http://www.dominiopublico.gov.br/pesquis...
), of which 86% belongs to the Minas Gerais State and the remainder (14%) to the Espírito Santo State, being, therefore, an interstate watershed.

Figure 1
Location of the Doce River Watershed.

The Doce River Watershed population is estimated in 3.5 million inhabitants, distributed in 228 municipalities, being 200 of Minas Gerais and 28 of Espírito Santo. The economic activity in the watershed is very diversified: (i) in farming, stands out traditional crops, coffee culture, sugar cane, beef and dairy cattle breeding, pig farming, among others; (ii) in the agroindustry, it's highlighted the production of sugar and alcohol; (iii) has the largest steelmaking complex of Latin America, to which are associated mining and reforestation companies; (iv) additionally, stand out cellulose and dairy industries, commerce and services related to industrial complexes, as well as electric power generation, with great potential for exploitation (Comitê da Bacia Hidrográfica do Rio Doce, 2022Comitê da Bacia Hidrográfica do Rio Doce – CBH-Doce. (2022). A bacia do Rio Doce: caracterização da Bacia. Governador Valadares. Retrieved in 2023, November 01, from https://www.cbhdoce.org.br/institucional/a-bacia
https://www.cbhdoce.org.br/institucional...
).

Data

The monthly average flow time series data cover the period from 1987 to 2011, totaling 25 years of observations (or 300 months), obtained through the Hydrological Information System (Brasil, 2022Brasil. Agência Nacional de Águas – ANA. (2022). HIDROWEB: sistema de informações hidrológicas. Brasília. Retrieved in 2023, November 1, from http://hidroweb.ana.gov.br/default.asp
http://hidroweb.ana.gov.br/default.asp...
), of the National Water Agency and Basic Sanitation (ANA). The Colatina station was used to carry out the imputations, and in cases of multivariate imputation methodologies, support stations were utilized. The support stations were standardized following the same base period (1987 to 2011), and as a prerequisite, the correlation between stations was evaluated, mainly for the central study object station (Colatina). The selected stations are identified in this study as: Fazenda Cachoeira D’Antas (E1), Cachoeira dos Óculos Montante (E2), Belo Oriente (E3), Governados Valadares (E4), Tumiritinga (E5) and Colatina (E6). Figure 1 shows the station spatialization along the watercourse. Information regarding the global positioning characteristics and the stations' drainage area and the drainage area are shown in Table 1.

Table 1
Stations selected for the study.

As a precept for multivariate imputation methods, it is necessary that the data present homogeneity between them. The Pearson correlation coefficient (ρ) between stations can be seen in Table 2. The lowest correlation value (0.9373) was found between stations E1 and E6, which was expected because they are further away from each other. The high values enable the multivariate imputation methodologies use and that one station can provide reliable information when imputing missing data from another.

Table 2
Pearson correlations for monthly mean flow variable data between stations.

Imputation of missing data

Imputation methodologies can be classified in two ways: by the imputation amount and the need to use auxiliary series or not. When it comes to the imputation type, single imputation is that which occurs when missing data are imputed only once). In turn, multiple imputation occurs when missings undergoes numerous imputations and, followed by inference analysis, the appropriate value for imputation is defined. When the imputation method only requires the series of interest in generating information to conduct the imputation, the technique is called univariate. In cases where support series are necessary to carry out the imputation in the series of interest, it is called a multivariate procedure.

The imputation methods utilized in this study were grouped as follows: (i) single univariate imputation, represented by the arithmetic mean (AM), median (M) and interpolations (Spline and Stineman) methodologies; (ii) single multivariate imputation, which refers to Regional Weighting (RW) and Multiple Linear Regression (MLR); and finally, (iii) multiple multivariate imputation, represented by the multiple imputation (IM) and maximum likelihood (ML) methodologies.

Methods of single univariate imputation

Understood as basic approaches for imputing failures, missing data are replaced by mean or median, attributed using the data present (or remaining) in the series of interest. Equations 1 and 2 represent, respectively, AM and M.

x ¯ = 1 n i = 1 n x i (1)
m x = x n + 1 2 i f n i s a n o d d n u m b e r x n 2 + x n 2 + 1 2 i f n i s a n e v e n n u m b e r (2)

In Equation 1, x¯ represents the mean and in Equation 2, mx represents the median. In both equations, n is the quantity of data.

Methods of single multivariate imputation

The equations for the MLR and RW methodologies, are represented, respectively, by Equations 3 and 4.

X = β 0 + i = 1 n β i Y i + ε (3)

In Equation 3, X represents the series dependent of the linear equation; β0 is the linear coefficient vector; βi represents the angular coefficients; Yi denotes the independent station; and ε are the residuals of the model.

X = 1 n i = 1 n N X N i D i (4)

In Equation 4, NX and Ni denote, respectively, the monthly mean flow data for the station with missing data to be imputed and the monthly mean flow of order “i” of the neighbor station (m3.s-1); Di denotes the observed values of order “i” in the neighbor stations during the month of occurrence in the station with the data to be imputed (m3.s-1); and n is the number of neighbor stations considered.

Imputation methods by interpolation

In this study two advanced approaches of nonlinear interpolation were applied. The first one, called Spline interpolation, is applied over a Spline function defined in Equation 5.

S x = P 0 x , x , τ l ; P j x , x τ j , τ j + 1 , j = 1, r 1 ; P 0 x , x τ r , ; (5)

where S:, P0x, P2x,Prx is a sequence of cubic polynomials, and τl<τ2<<τr is a sequence of real numbers called knots of the Spline space.

The second, called Stineman interpolation, utilizes the rational interpolation that is intended to provide better results than Spline interpolation in the case of sudden changes in slope. The equating of Stineman interpolation is described in the following according to the work of Demirhan & Renwick (2018)Demirhan, H., & Renwick, Z. (2018). Missing value imputation for short to mid-term horizontal solar irradiance data. Applied Energy, 225, 998-1012. http://doi.org/10.1016/j.apenergy.2018.05.054.
http://doi.org/10.1016/j.apenergy.2018.0...
. Consider xj and yj to be rectangular coordinates of the j-th point on a curve and let y´j be the slope of the curve at the j-th point for j=1,, n and xj<xj+1 for j = 1, ,n1. Thereafter, the Stineman interpolation is applied to calculate the interpolated value of y via Algorithm 1.

Algorithm 1:

  1. Given x that satisfies the condition xjxxj+1, calculate the inclination of the line segment that joining two points by sj=yj+1yjxj+1xj.

  2. Calculate the ordinate corresponding to x by y0 = yj + sj xxj.

  3. Calculate the vertical distance of the point xy0 to a line through xjyj with slope y´j by Δyj= yj+y´jxxjy0 for j and j+1.

  4. Calculate the interpolated value by y=y0ΔyjΔyj+1xxj+1+xxj/ΔyjΔyj+1xj+1xjifΔyjΔyj+1<0.

To implement Algorithm 1, one needs to know the values of the slopes y´j. If these are not initially known, it is necessary to apply Algorithm 2 to perform the calculation of theses slopes. For this, let I, J and K be any three consecutive points that satisfy IJ´ > y´j > JK´ or IJ´< y´j< JK´, where .´ denotes the slope of the internal segment of the curve.

Algorithm 2:

  1. For internal points, the slope is calculated by yj´=yjyi xkxj2+ykyj2+ykyj xjxi2+yjyi2xjxi xkxj2+ykyj2+xkxj xjxi2+yjyi2.

  2. For final points, calculate the slope for the final point m by ym´=2syj´, where s is the slope of the line segment that joins the points J and the final points.

Method of multiple multivariate imputation

The IM methodology, proposed by Rubin (1987)Rubin, D. B. (1987). Procedures with nonignorable nonresponse. In: D. B. Rubin (Ed.), Multiple imputation for nonresponse in surveys (pp. 202-240). New York: John Wiley & Sons. http://doi.org/10.1002/9780470316696.ch6.
http://doi.org/10.1002/9780470316696.ch6...
, appeared as a more flexible manner and alternative to the maximum likelihood methods when there exists a large quantity of missing data (Schafer & Graham, 2002Schafer, J. L., & Graham, J. W. (2002). Missing data: our view of the state of the art. Psychological Methods, 7(2), 147-177. http://doi.org/10.1037/1082-989X.7.2.147.
http://doi.org/10.1037/1082-989X.7.2.147...
). This technique enables the inclusion of uncertainty of imputation in the results, which is the major limitation associated to the single imputation techniques (Nunes et al., 2009Nunes, L. N., Klück, M. M., & Fachel, J. M. G. (2009). Uso da imputação múltipla de dados faltantes: uma simulação utilizando dados epidemiológicos. Cadernos de Saúde Pública, 25(2), 268-278. http://doi.org/10.1590/S0102-311X2009000200005.
http://doi.org/10.1590/S0102-311X2009000...
). The technique is based on three steps: (i) in the first, are generated m sets of imputed data; (ii) utilizing standardized procedures, m analyses are made in the set of imputed data; and, (iii) the results of the m analyses are combined to obtain the necessary inferences to choose the values considered in the final imputation.

In the first steps, the imputation techniques have to preserve the relation of the missing and present observations and take into account the missing data pattern and the mechanisms of absence. Having realized the m imputations, the step (ii) of the IM can be performed, i.e., the m databases are analyzed by traditional methods of analysis that in this study, is the mean predictive correspondence. As an outcome, it is possible to combine the results and utilize the rules proposed by Rubin (Rubin, 1987Rubin, D. B. (1987). Procedures with nonignorable nonresponse. In: D. B. Rubin (Ed.), Multiple imputation for nonresponse in surveys (pp. 202-240). New York: John Wiley & Sons. http://doi.org/10.1002/9780470316696.ch6.
http://doi.org/10.1002/9780470316696.ch6...
; Nunes et al., 2009Nunes, L. N., Klück, M. M., & Fachel, J. M. G. (2009). Uso da imputação múltipla de dados faltantes: uma simulação utilizando dados epidemiológicos. Cadernos de Saúde Pública, 25(2), 268-278. http://doi.org/10.1590/S0102-311X2009000200005.
http://doi.org/10.1590/S0102-311X2009000...
). In step (iii), from each analysis, the estimates for the interest parameter Q are obtained. Let, Q^j for j = 1, 2,...,m, for Q equal to any scalar measure, the combined estimate will be the mean of the individual estimates (Nunes et al., 2009Nunes, L. N., Klück, M. M., & Fachel, J. M. G. (2009). Uso da imputação múltipla de dados faltantes: uma simulação utilizando dados epidemiológicos. Cadernos de Saúde Pública, 25(2), 268-278. http://doi.org/10.1590/S0102-311X2009000200005.
http://doi.org/10.1590/S0102-311X2009000...
), according to Equation 6.

Q ¯ = 1 m j = 1 m i = 1 m Q ^ j (6)

For the combined variance, it is necessary to calculate the variance inside the imputations (Equation 7) and the variance between the imputed databases (Equation 8).

U ¯ = 1 m j = 1 m (7)
B = 1 m 1 j = 1 m i = 1 m Q ^ j Q ¯ 2 (8)

Lastly, the total variance, which is the combined variance, is given by Equation 9.

T = U ¯ + 1 + 1 m B (9)

The main idea of the ML methodology is estimating the parameters that would maximize the probability of the distribution of the observed data (with or without the presence of missing values), allowing then, estimate the missing data. The likelihood function is represented as in Equation 10.

L ( Θ | Y ) = Π f ( Y i | Θ ) (10)

where f(Y|Θ) is a density function that describes the model responsible for generating the data, Y are the data and Θ is a set of unknown parameters that rules the distribution of Y, from which it is known that ir belongs to Ωθ. That is, f(Y|Θ) is a function of the parameter vector Θ Ωθ given Y, proportional to the density function. Establishing the model and the parameter vector Θ, f(Y|Θ) can be used to sample missing values (Allison, 2002Allison, P. D. (2002). Quantitative applications in the social sciences: missing data. Thousand Oaks: SAGE Publications.).

Due to the impositions associated with the analytical process, numerical methods are useful in the parameter estimation stage. Consolidating itself as an efficient alternative, the EM algorithm is an iterative procedure that consists in repeating two steps: estimation (E) and maximization (M). Consider a dataset with observed and missing data, with density function given by p(yc|Θ) whereby, lΘ,yc represents the log-likelihood function of the complete and observed data. The algorithm suggests that initially one finds the expected value of the logarithm of the likelihood function (step E) and next its maximum (step M), according to Equation 11:

Q ( Θ | Θ k = E l c ( Θ , y c | y c , Θ k ) (11)

In the step M, is aimed to find Θk+1 that maximizes QΘ|Θk. The process is repeated until convergence is achieved, by means of a stop criterion, such as Θk+1Θk<ε, when the difference between the estimated values of the parameters in two consecutive iterations is lower than the pre-established.

Mechanisms of missing data

The mechanisms of missing data describe the relations between lost values and the probability of absence, informing the cause of missing in the data. Little & Rubin (2002)Little, R. J., & Rubin, D. B. (2002). Statistical analysis with missing data. Hoboken: John Wiley & Sons. http://doi.org/10.1002/9781119013563.
http://doi.org/10.1002/9781119013563...
define three general theoretical mechanisms extensively utilized in the literature, known as (i) Missing Completely at Random (MCAR), (ii) Missing at Random (MAR) and (iii) not Missing at Random (NMAR). According to Hamzah et al. (2022)Hamzah, F. B., Mohamad Hamzah, F., Mohd Razali, S. F., & El-Shafie, A. (2022). Multiple imputations by chained equations for recovering missing daily streamflow observations: a case study of Langat River basin in Malaysia. Hydrological Sciences Journal, 67(1), 137-149. http://doi.org/10.1080/02626667.2021.2001471.
http://doi.org/10.1080/02626667.2021.200...
, in flow data studies, missing data is described as MCAR due to the episode that results in failures not being due to influences from external variables. This mechanism is called ignorable, and there is, therefore, no need for its incorporation in the failure estimation process.

Missing data also can be characterized in patterns (McKnight et al., 2007McKnight, P. E., McKnight, K. M., Sidani, S., & Figueredo, A. J. (2007). Missing data: a gentle introduction. New York: Guilford Press.). The main patterns of missing data are discussed by Hamzah et al. (2022)Hamzah, F. B., Mohamad Hamzah, F., Mohd Razali, S. F., & El-Shafie, A. (2022). Multiple imputations by chained equations for recovering missing daily streamflow observations: a case study of Langat River basin in Malaysia. Hydrological Sciences Journal, 67(1), 137-149. http://doi.org/10.1080/02626667.2021.2001471.
http://doi.org/10.1080/02626667.2021.200...
known as (i) general or random pattern, (ii) unitary non-response pattern and (iii) monotonous pattern. Due to the characteristics of missing data in flow variables being random and, very frequently, with a block pattern, patterns i and ii are of interest in the present study. Once the mechanism and pattern of missing data occurrence were established, artificial failures were simulated following the proportions of 5, 20 and 40% failures. In the block pattern, the following gaps were created:

  • 5%: a block of 12 missing observations;

  • 20%: four blocks of 12 missing observations in each;

  • 40%: eight blocks of 12 missing observations in each.

Considering the perspective of generating consistent results, a total of one thousand (1000) simulation repetitions were replicated for each pattern and percentage of missing data.

SARIMA model

The Seasonal Autoregressive Integrated Moving Average (SARIMA) model is useful because it incorporates the seasonality component in its modeling process, a characteristic present in the Doce River flow regime (Pinto et al., 2015Pinto, W. P., Lima, G. B., & Zanetti, J. B. (2015). Previsão de regimes de vazões médias mensais do rio Doce, Colatina-Espírito Santo. Ciência e Natura, 37(3), 1-11. http://doi.org/10.5902/2179460X17143.
http://doi.org/10.5902/2179460X17143...
; Bleidorn et al., 2019Bleidorn, M. T., Pinto, W. P., Braum, E. S., Lima, G. B., & Montebeller, C. A. (2019). Modelagem e previsão de vazões médias mensais do rio Jucu, ES, utilizando o modelo SARIMA. Irriga, 24(2), 320-335. http://doi.org/10.15809/irriga.2019v24n2p320-335.
http://doi.org/10.15809/irriga.2019v24n2...
). Let Zt=Zt;t be a linear process represented by Equation 12:

Φ B S ϕ B d Z t = Θ B S θ B ε t (12)

where s is called seasonal period of the process and εt~WN0,σε2, in which εt~WN is white noise WN, defined as a sequence of uncorrelated random variables with zero mean and constant variance over time (Wei, 2006Wei, W. W. S. (2006). Time series analysis. In T. D. Little (Ed.), The Oxford handbook of quantitative methods in psychology: statistical analysis (Vol. 2). Oxford: Oxford University Press.). The operator d, with d = d,D and d,D non-negative integer numbers that represent, respectively, the number of simple and seasonal differences applied to the process Zt, defined according to Equation 13:

d = 1 B d 1 B S D . (13)

We have that ΦzS=1Σi=1PΦizis,ϕz= 1Σj=1PΦjzj,ΘzS=1Σk=1QΘkzks and θz=1Σl=1qΘlz1 are polynomials of order P,p,Q,q , respectively, withz , where, represents the set of complex numbers, denotes the set of natural numbers, and Φi, ϕj, Θk and θl are sequences of real numbers. The process Zt with representation given in (12) is denominated seasonal multiplicative ARIMA (SARIMA) of order p,d,q×P,D,Qs.

Modeling methodology

The SARIMA model is built on the modeling methodology proposed by Box and Jenkins, based on an iterative cycle that contains four steps: (i) identification; (ii) estimation; (iii) residual adequacy; and, (iv) forecasting. In the first step, resources are used to understand the behavior of the series, such as visual analysis of it, of the correlograms (Autocorrelation Function – ACF and Partial Autocorrelation Function – PACF), of the spectral decomposition and the use of tests for detecting trend and seasonality. In this step, adjustment indicators are used, mainly the Akaike Information Criterion (AIC) (Akaike, 1974Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), 716-723. http://doi.org/10.1109/TAC.1974.1100705.
http://doi.org/10.1109/TAC.1974.1100705...
). In the second step, the parameters of the candidate models are estimated by several approaches, being recurrent the use of the maximum likelihood. In the third step, it is verified that the model is adequate via residual analysis. Having satisfied the above conditions, the model is considered in the steps of adjustment and forecasting.

Semi-automatization of the model choice

Taking into consideration that the model choice is extremely important, the founds in Pinto et al. (2015)Pinto, W. P., Lima, G. B., & Zanetti, J. B. (2015). Previsão de regimes de vazões médias mensais do rio Doce, Colatina-Espírito Santo. Ciência e Natura, 37(3), 1-11. http://doi.org/10.5902/2179460X17143.
http://doi.org/10.5902/2179460X17143...
and Bleidorn et al. (2019)Bleidorn, M. T., Pinto, W. P., Braum, E. S., Lima, G. B., & Montebeller, C. A. (2019). Modelagem e previsão de vazões médias mensais do rio Jucu, ES, utilizando o modelo SARIMA. Irriga, 24(2), 320-335. http://doi.org/10.15809/irriga.2019v24n2p320-335.
http://doi.org/10.15809/irriga.2019v24n2...
studies allows to infer the presence of the seasonality component and the characteristic of non-stationarity in the flow series for the same time series studied in this research. Hence, to bypass the non-linear structures and make the series stationary (Box & Jenkins, 1976Box, G. E., & Jenkins, G. M. (1976). Time series analysis: forecasting and control. Hoboken: John Wiley & Sons.), it was defined the differencing d and D = 1, and the data transformation via natural logarithm. Pinto et al. (2015)Pinto, W. P., Lima, G. B., & Zanetti, J. B. (2015). Previsão de regimes de vazões médias mensais do rio Doce, Colatina-Espírito Santo. Ciência e Natura, 37(3), 1-11. http://doi.org/10.5902/2179460X17143.
http://doi.org/10.5902/2179460X17143...
study allows to verified, for the same studied series (E6), a behavior of the correlograms that indicated an autoregressive part (AR) of order 1 and of moving average (MA) of order 2. Given this information, it was possible to define the initial conditions for the semi-automation of model choice.

The semi-automatization was established to remove the subjectivity in the model choice, consequently avoiding the distortion of the effect of using reconstructed databases on modeling and forecasting performance. This formulation consisted of the following steps: (a) all possible combinations resulting from models with a maximum of three parameters in each part of the model (AR and MA of the ordinal a seasonal parts), and the fixed unit differentiation (d and D = 1), considering that this conditioning of at most three parameters of each model component, follows the parsimony principle. The combination of these configurations allowed to generate a total of 254 candidate models. In the next step (b) was to order in increasing magnitude the AIC values. In the third step (c), the analyzed model parameters necessarily should have significance in the level of 99,5% (p-valor < 0,05). Finally, the last step (d), consisted of the analysis of normality and non-autocorrelation. The chosen model was the one that combined the lowest AIC value, significant parameters and normal and non-autocorrelated residuals.

Fitting the model to observed data

Initially, the observed series was used to adjust the SARIMA model, allowing it to be used as a parameter of adjustment analysis to the reconstructed databases. The data were divided into two groups: the first, for adjustment, from January 1987 to December 2011, resulting in 300 months of observations, and the second, from January to December 2012 to evaluate the forecast accuracy one step ahead (12 months). Table 3 presents the AIC and Bayesian (BIC) values (Akaike, 1978Akaike, H. (1978). A Bayesian analysis of the minimum AIC procedure. In E. Parzen, K. Tanabe & G. Kitagawa (Eds.), Selected papers of Hirotugu Akaike (pp. 275-280). New York: Springer.) and the normality and non-autocorrelation tests of residuals. Tables 4 and 5 present, respectively, the model's adjustment and prediction performance indicators values.

Table 3
Statistical tests of normality and correlation of residuals from the model adjusted to observed data.
Table 4
Quality-of-fit measurements of selected models.
Table 5
Forecast quality measures of selected models.

Performance indicators

The model performance in the adjustment and forecasting stages for the different imputed datasets, were evaluated using the performance indicators Absolute BIAS, Root Mean Squared Error (RMSE), Mean Absolut Percentual Error (MAPE), Nash-Sutcliffe (NSE) (Nash & Sutcliffe, 1970Nash, J. E., & Sutcliffe, J. V. (1970). River flow forecasting through conceptual models part I: a discussion of principles. Journal of Hydrology, 10(3), 282-290. http://doi.org/10.1016/0022-1694(70)90255-6.
http://doi.org/10.1016/0022-1694(70)9025...
), concordance index (d2) and Pearson correlation coefficient (ρ), presented by the Equations 14, 15, 16, 17 and 18, respectively.

The Absolute BIAS quantifies the estimates of underestimation and overestimation with respect to the mean observations.

1 n i = 1 N ( x i x ˜ i ) (14)

The RMSE is the quadratic difference between the forecasted or adjusted values and their respective true values. In general, lower values indicate better performance.

RMSE = 1 n i = 1 N ( x i x ˜ i ) 2 (15)

The MAPE is a measure of precision of adjustment and forecasting of a model. Lower values indicate good performance.

MAPE = 1 n i = 1 n x i x ˜ i x i (16)

ρ describes the relation between the variables and the closer it is to the extremes (-1 or 1) the stronger the correlation is.

ρ = i = 1 N ( x i x ¯ ) y i y ¯ i = 1 N ( x i x ¯ ) 2 i = 1 N ( y i y ¯ ) 2 (17)

NSE is used to evaluate the predictive capacity of the hydrological models. The values of NSE vary between -∞ and 1, with values closer to 1 representing good performance.

N S E = 1 i = 1 n ( x i y i ) 2 i 1 n ( x i x i ¯ ) 2 (18)

d2 reflects the concordance between adjustment/forecasting of the model with the observed data, with the desired value of 1 indicating perfect concordance.

d 2 = 1 i = 1 n x i x ˜ i 2 i = 1 n x i y i ¯ + x ˜ i y i ¯ 2 (19)

where: n represents the number of observations in the phase of adjustment or forecast; x indicates the observed values; y the adjusted or forecasted values; x¯ is the mean of the observations and y¯ the estimated values in the phase of adjustment or forecasting of the models.

Computational resources

Electronic spreadsheets were used for organizing the data and the software R (R Development Core Team, 2021R Development Core Team. (2021). R: a language and environment for statistical computing. Vienna: R Foundation for Statistical Computing.) was utilized to simulate missing data and its reconstructions, adjustment of the Box and Jenkins methodology using the SARIMA model and the analysis of its performance. The imputeTS package was used to carry out the imputations using the Stineman and Spline methodologies and the mice and mtsdi packages to carry out the imputations using the IM and ML methodologies, respectively.

RESULTS AND DISCUSSION

Descriptive analysis of data

For an initial understanding of the time series under study, some descriptive measures are presented in Table 6. The monthly mean flow was of 805.09 m3.s-1 with standard deviation of 578.48 m3.s-1 and coefficient of variation of 71.85%. The high values of the standard deviation and coefficient of variation indicate that the mean has little representativeness, associated to the seasonal characteristic of the series, as confirmed by the works of Pinto et al. (2015)Pinto, W. P., Lima, G. B., & Zanetti, J. B. (2015). Previsão de regimes de vazões médias mensais do rio Doce, Colatina-Espírito Santo. Ciência e Natura, 37(3), 1-11. http://doi.org/10.5902/2179460X17143.
http://doi.org/10.5902/2179460X17143...
and Bleidorn et al. (2019)Bleidorn, M. T., Pinto, W. P., Braum, E. S., Lima, G. B., & Montebeller, C. A. (2019). Modelagem e previsão de vazões médias mensais do rio Jucu, ES, utilizando o modelo SARIMA. Irriga, 24(2), 320-335. http://doi.org/10.15809/irriga.2019v24n2p320-335.
http://doi.org/10.15809/irriga.2019v24n2...
for the same series under study. The series has asymmetry of 1.82 and kurtosis of 3.59, indicating that the distribution is not normal and has heavier tails. The seasonality property of the studied series can be visualized in Figures 2 and 3, with a well-defined interannual variability pattern with periods of greater flows magnitude (November to April) followed by periods of smaller flows magnitude (May to October).

Table 6
Descriptive measures of Doce River flow.
Figure 2
Graphic of the mean flow time series of Doce River.
Figure 3
Graphical analysis of seasonality of monthly mean flows time series of Doce River.

Effect of the imputation methodologies in the quality of adjustment and forecasting

Tables 7 and 8 show the values of the performance indicators of adjustment of the model to the imputed series with random and in block pattern of missing data, respectively. In general, it is possible to verify that, while the percentages of imputed failures increase, the model presents loss of quality, with emphasis on the high values of the indicators BIAS, RMSE, MAPE to lows of the NSE, d2 and ρ for the adjustment of the reconstructed series by RW, MLR, M and AM methodologies in both patterns of missing data. For the block pattern, Spline and Stine techniques also limited the model performance. However, it is possible to verify that, even under a critical scenario of failures imputation (40%), in both patterns of missing data, the IM and ML methodologies are efficient to the point that there are no significant changes in the model quality indicators, with low values of BIAS, RMSE, MAPE and high of NSE, d2 and ρ.

Table 7
Quality measures of adjustment for the imputed data with random pattern.
Table 8
Quality measures of adjustment for the imputed data with in block pattern.

Tables 9 and 10 show the performance indicators in the forecasting step of the model, considering the missing data imputations with random and in block pattern, respectively. Due to the loss of quality in the adjust step, it was expected that the model’s performance would be compromised in the forecasting stage for the series reconstructed by RW, MLR, M and AM in both failure patterns and by Stine and Spline methodologies for the pattern in block. Just like in the adjustment step, it is possible to verify low values of the performance indicators BIAS, RMSE, MAPE and high NSE, d2 and ρ, for the forecast performance of the model fitted to the series reconstructed by the Stine and Spline methodologies for the random loss pattern and for IM and ML methodologies for both missing data patterns.

Table 9
Quality measures of forecasting for imputed data with random pattern.
Table 10
Quality measures of forecasting for imputed data with in block pattern.

Relative difference of the quality of adjustment and forecasting

To facilitate the understanding of the behavior of the quality measures, it is shown in Tables 11 and 12 the relative difference of fit performance measures and in Tables 13 and 14, the relative difference for forecasting, considering the performance indicators obtained by the reference model (adjusted to observed data, see Tables 5 and 6).

Table 11
Relative difference in the quality of adjustment of the model to the observed and imputed data with random pattern.
Table 12
Relative difference in the quality of adjustment of the model to the observed and imputed data with in block pattern.
Table 13
Relative difference in the quality of forecasting of the model to the observed and imputed data with random pattern.
Table 14
Relative difference in the quality of forecasting of the model to the observed and imputed data with block pattern.

It is generally inferred that, in the adjustment stage, the quality measures Bias, RMSE and MAPE increase as the proportions of imputed failures were increased. For the model adjusted to the series reconstructed by Spline and Stine methodologies with the random pattern of missing data was observed that the quality of the performance measures increase when the proportions of imputed failures were higher. Considerable quality losses in tuning performance can be attributed to the univariate (AM and M) and multivariate (PW and RLM) single imputation methods. To exemplify, the model adjustment to the series imputed by MLR and RW methodologies showed higher losses of quality, evidenced by the high values of the indicators Bias, RMSE, MAPE and by the decrease in quality of the indicators NSE, d2 e ρ.

In the most critical data reconstruction scenario (40%), relative differences in the model adjustment stage reached values of up to 1,233.44%, 106.28%, 171.40%, 205.59%, 65.48% and 0.94% of the respective indicators Bias, RMSE, MAPE, NSE, d2 and ρ, for the random pattern of the series reconstructed by the MLR methodology and up to 602.14%, 202.99%, 174.94%, 516.64%, 41.33 and 43.56% for the model adjusted to the series reconstructed by the Spline methodology under the block missing data pattern. Such findings allows to infer that these imputation methodologies do not preserve the series characteristics and compromise the SARIMA model adjustment performance and, consequently, the forecast performance (Table 12).

Studies found that RW, M and AM methodologies tend to underestimate the data variance. By their nature, AM and M are measures of central tendency of the series, being the median preferred when the data get farther of the normal distribution. The authors Ben Aissia et al. (2017)Ben Aissia, M. A. B., Chebana, F., & Ouarda, T. B. (2017). Multivariate missing data in hydrology–Review and applications. Advances in Water Resources, 110, 299-309. http://doi.org/10.1016/j.advwatres.2017.10.002.
http://doi.org/10.1016/j.advwatres.2017....
, Gao et al. (2018)Gao, Y., Merz, C., Lischeid, G., & Schneider, M. (2018). A review on missing hydrological data processing. Environmental Earth Sciences, 77(2), 47. http://doi.org/10.1007/s12665-018-7228-6.
http://doi.org/10.1007/s12665-018-7228-6...
and Kabir et al. (2020)Kabir, G., Tesfamariam, S., Hemsing, J., & Sadiq, R. (2020). Handling incomplete and missing data in water network database using imputation methods. Sustainable and Resilient Infrastructure, 5(6), 365-377. http://doi.org/10.1080/23789689.2019.1600960.
http://doi.org/10.1080/23789689.2019.160...
agree that, despite the simplicity of the methods, reconstruct the loss value using a constant does not reflect the variation that would probably occur if the data would be observed. It is considered important to evaluate the maintenance of the use of RW and MLR methodologies, for limiting the performance of the SARIMA model and, mainly because these are techniques usually applied to reconstruct missing data in hydrological series.

In turn, the reconstructed series by the single univariate (Spline and Stine) and multiple multivariate (IM and ML) imputation methodologies resulted in a good model fitting performance even under critical scenario of losses (40% - exception to Spline applied to the pattern of missing data in block). Small losses of quality of the model performance indicators adjusted to the series reconstructed by the methodologies Stine, IM and ML point up that these approaches preserves the series characteristics (Nunes et al., 2009Nunes, L. N., Klück, M. M., & Fachel, J. M. G. (2009). Uso da imputação múltipla de dados faltantes: uma simulação utilizando dados epidemiológicos. Cadernos de Saúde Pública, 25(2), 268-278. http://doi.org/10.1590/S0102-311X2009000200005.
http://doi.org/10.1590/S0102-311X2009000...
; Junger & Ponce de Leon, 2015Junger, W. L., & Ponce de Leon, A. (2015). Imputation of missing data in time series for air pollutants. Atmospheric Environment, 102, 96-104. http://doi.org/10.1016/j.atmosenv.2014.11.049.
http://doi.org/10.1016/j.atmosenv.2014.1...
; Bleidorn et al., 2022Bleidorn, M. T., Pinto, W. P., Schmidt, I. M., Mendonça, A. S. F., & Reis, J. A. T. D. (2022). Methodological approaches for imputing missing data into monthly flows series. Revista Ambiente & Água, 17(2), e2795. http://doi.org/10.4136/ambi-agua.2795.
http://doi.org/10.4136/ambi-agua.2795...
).

Just as in the adjustment step, the forecasting quality indicators indicate that the series reconstructed by the imputation methodologies AM, M, RW, MLR compromises the model performance, both for the scenario of random missing data and for the block pattern. The model adjusted to the reconstructed series by MLR in the 40% of random failures scenario resulted in relative difference of 1,606.32%, 62.71%, 78.90%, 173.70% e 67.78% in the performance indicators Bias, RMSE, MAPE, NSE and d2, respectively, and for the reconstructed series by Spline with the block pattern of failures resulted in relative difference values of 1,592.76%, 70.76%, 158.80%, 202.01% e 16.79% for the same indicators.

In cases of series reconstruction using the methodologies Spline (only to random missing data), Stine, IM and ML, as observed in the adjustment phase, minimal quality losses occurred. Even under the most critical data loss scenarios, the model obtained low relative differences, in both missing data patterns, when considering the series reconstructed by the IM and ML methodologies.

Figures 4 and 5 show the graphics of the simulations for the missing data proportion of 40% for the random and in block patterns, respectively. Figures 6 and 7 display a visual comparison of the adjusted model to the observed (black lines) and imputed data (red lines). This comparison shows a good performance of the adjusted model to the reconstructed series by the methodologies Spline, Stine, IM and ML considering the random missing data pattern and Stine, IM and ML for the in-block pattern. The other methodologies, just like suggested the performance indicators, resulted in underestimation of the adjustment to the observed series.

Figure 4
Monthly mean flow time series of station E6, considering 40% of random missing data.
Figure 5
Monthly mean flow time series of station E6, considering 40% of in block missing data.
Figure 6
Observed monthly mean flow time series (black line) and model fitted by the imputed series (red line) at station E6, for each imputation methodology, considering 40% of random missing data.
Figure 7
Observed monthly mean flow time series (black line) and model fitted by the imputed series (red line) at station E6, for each imputation methodology, considering 40% of in block missing data.

CONCLUSIONS

This study aimed to evaluate the effect of the use of different missing data imputation methodologies on the fitting and forecasting performance of the SARIMA time series model, considering as a case study the monthly mean flow data of the Doce River in Southeastern Brazil. Different failure proportions simulations under random and block pattern were considered.

A semi-automated approach was considered in the model choice to avoid any subjectivity in its choice. Therefore, the SARIMA (1,1,1)(0,1,1)12 model was adjusted to the observed data and served as a quality parameter in the adjustment and forecasting of the series reconstructed by the imputation methodologies. The results indicated that, overall, the model loses quality in adjustment and forecasting as the percentage of imputed missing data increases, especially in the use of the model in the series reconstructed by the mean, regional weighting and multiple linear regression. Therefore, despite being usually used to impute missing data on hydrological variables, it is considered that these methodologies considerably reduce the quality of SARIMA time series model. In contrast, it was verified that the model quality was maintained when applied to the reconstructed series by the Spline (only for the random missing data pattern), Stine, Multiple Imputation and Maximum Likelihood methodologies. These methods, even under extreme conditions of data loss (40%), allowed to preserve the series characteristics.

Whereas data quality is crucial for successful employment of the SARIMA model, the finding results highlight that the quality of flow forecasts can be improved when the missing data processing is carried out using appropriate processing methodologies. The search for increased reliability of the results is useful in engineering projects, risk management, allocation of multiple uses and water between watersheds and in hydroelectric power generation. Now, it is known that the stage of processing missing data with appropriate methodologies must be anticipated before the use of the SARIMA model.

REFERENCES

  • Abudu, S., Cui, C. L., King, J. P., & Abudukadeer, K. (2010). Comparison of performance of statistical models in forecasting monthly streamflow of Kizil River, China. Water Science and Engineering, 3(3), 269-281. http://doi.org/10.3882/j.issn.1674-2370.2010.03.003
    » http://doi.org/10.3882/j.issn.1674-2370.2010.03.003
  • Aghelpour, P., Bahrami-Pichaghchi, H., & Varshavian, V. (2021). Hydrological drought forecasting using multi-scalar streamflow drought index, stochastic models and machine learning approaches, in northern Iran. Stochastic Environmental Research and Risk Assessment, 35(8), 1615-1635. http://doi.org/10.1007/s00477-020-01949-z
    » http://doi.org/10.1007/s00477-020-01949-z
  • Ahmad, I., Waseem, M., & Zhang, J. (2022). Developing monthly hydrometeorological timeseries forecasts to reservoir operation in a transboundary river catchment. Theoretical and Applied Climatology, 147(3-4), 1663-1674. http://doi.org/10.1007/s00704-021-03901-9
    » http://doi.org/10.1007/s00704-021-03901-9
  • Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19(6), 716-723. http://doi.org/10.1109/TAC.1974.1100705
    » http://doi.org/10.1109/TAC.1974.1100705
  • Akaike, H. (1978). A Bayesian analysis of the minimum AIC procedure. In E. Parzen, K. Tanabe & G. Kitagawa (Eds.), Selected papers of Hirotugu Akaike (pp. 275-280). New York: Springer.
  • Allison, P. D. (2002). Quantitative applications in the social sciences: missing data Thousand Oaks: SAGE Publications.
  • Apaydin, H., Sattari, M. T., Falsafian, K., & Prasad, R. (2021). Artificial intelligence modelling integrated with Singular Spectral analysis and Seasonal-Trend decomposition using Loess approaches for streamflow predictions. Journal of Hydrology, 600, 126506. http://doi.org/10.1016/j.jhydrol.2021.126506
    » http://doi.org/10.1016/j.jhydrol.2021.126506
  • Arriagada, P., Karelovic, B., & Link, O. (2021). Automatic gap-filling of daily streamflow time series in data-scarce regions using a machine learning algorithm. Journal of Hydrology, 598, 126454. http://doi.org/10.1016/j.jhydrol.2021.126454
    » http://doi.org/10.1016/j.jhydrol.2021.126454
  • Bayer, D. M., Castro, N. M. D. R., & Bayer, F. M. (2012). Modelagem e previsão de vazões médias mensais do rio Potiribu utilizando modelos de séries temporais. RBRH, 17(2), 229-239. http://doi.org/10.21168/rbrh.v17n2.p229-239
    » http://doi.org/10.21168/rbrh.v17n2.p229-239
  • Ben Aissia, M. A. B., Chebana, F., & Ouarda, T. B. (2017). Multivariate missing data in hydrology–Review and applications. Advances in Water Resources, 110, 299-309. http://doi.org/10.1016/j.advwatres.2017.10.002
    » http://doi.org/10.1016/j.advwatres.2017.10.002
  • Bleidorn, M. T., Pinto, W. P., Braum, E. S., Lima, G. B., & Montebeller, C. A. (2019). Modelagem e previsão de vazões médias mensais do rio Jucu, ES, utilizando o modelo SARIMA. Irriga, 24(2), 320-335. http://doi.org/10.15809/irriga.2019v24n2p320-335
    » http://doi.org/10.15809/irriga.2019v24n2p320-335
  • Bleidorn, M. T., Pinto, W. P., Schmidt, I. M., Mendonça, A. S. F., & Reis, J. A. T. D. (2022). Methodological approaches for imputing missing data into monthly flows series. Revista Ambiente & Água, 17(2), e2795. http://doi.org/10.4136/ambi-agua.2795
    » http://doi.org/10.4136/ambi-agua.2795
  • Box, G. E., & Jenkins, G. M. (1976). Time series analysis: forecasting and control Hoboken: John Wiley & Sons.
  • Brasil. Agência Nacional de Águas – ANA. (2022). HIDROWEB: sistema de informações hidrológicas. Brasília. Retrieved in 2023, November 1, from http://hidroweb.ana.gov.br/default.asp
    » http://hidroweb.ana.gov.br/default.asp
  • Chen, L., Xu, J., Wang, G., Liu, H., Zhai, L., Li, S., Sun, C., & Shen, Z. (2018). Influence of rainfall data scarcity on non-point source pollution prediction: implications for physically based models. Journal of Hydrology, 562, 1-16. http://doi.org/10.1016/j.jhydrol.2018.04.044
    » http://doi.org/10.1016/j.jhydrol.2018.04.044
  • Coelho, A. L. N. (2007). Alterações hidrogeomorfológicas no médio-baixo Rio Doce/ES (Tese de doutorado). Departamento de Geografia, Instituto de Geociências, Universidade Federal Fluminense, Rio de Janeiro. Retrieved in 2022, October 20, from http://www.dominiopublico.gov.br/pesquisa/DetalheObraForm.do?select_action=&co_obra=157909
    » http://www.dominiopublico.gov.br/pesquisa/DetalheObraForm.do?select_action=&co_obra=157909
  • Comitê da Bacia Hidrográfica do Rio Doce – CBH-Doce. (2022). A bacia do Rio Doce: caracterização da Bacia. Governador Valadares. Retrieved in 2023, November 01, from https://www.cbhdoce.org.br/institucional/a-bacia
    » https://www.cbhdoce.org.br/institucional/a-bacia
  • Dembélé, M., Oriani, F., Tumbulto, J., Mariéthoz, G., & Schaefli, B. (2019). Gap-filling of daily streamflow time series using Direct Sampling in various hydroclimatic settings. Journal of Hydrology, 569, 573-586. http://doi.org/10.1016/j.jhydrol.2018.11.076
    » http://doi.org/10.1016/j.jhydrol.2018.11.076
  • Demirhan, H., & Renwick, Z. (2018). Missing value imputation for short to mid-term horizontal solar irradiance data. Applied Energy, 225, 998-1012. http://doi.org/10.1016/j.apenergy.2018.05.054
    » http://doi.org/10.1016/j.apenergy.2018.05.054
  • Duarte, V. B. R., Silva, F. D. C. S., Souza, I. V., Silva, M. V. C., Almeida Sousa, H. G., Giongo, M., & Viola, M. R. (2019). Previsão de vazão na bacia hidrográfica do rio Manuel Alves da Natividade utilizando o modelo de séries temporais SARIMA. Journal of Biotechnology and Biodiversity, 7(4), 457-468. http://doi.org/10.20873/jbb.uft.cemaf.v7n4.duarte
    » http://doi.org/10.20873/jbb.uft.cemaf.v7n4.duarte
  • Fu, J., Zhong, P. A., Chen, J., Xu, B., Zhu, F., & Zhang, Y. (2019). Water resources allocation in transboundary river basins based on a game model considering inflow forecasting errors. Water Resources Management, 33(8), 2809-2825. http://doi.org/10.1007/s11269-019-02259-y
    » http://doi.org/10.1007/s11269-019-02259-y
  • Gao, Y., Merz, C., Lischeid, G., & Schneider, M. (2018). A review on missing hydrological data processing. Environmental Earth Sciences, 77(2), 47. http://doi.org/10.1007/s12665-018-7228-6
    » http://doi.org/10.1007/s12665-018-7228-6
  • Gill, M. K., Asefa, T., Kaheil, Y., & McKee, M. (2007). Effect of missing data on performance of learning algorithms for hydrologic predictions: implications to an imputation technique. Water Resources Research, 43(7), 2006WR005298. http://doi.org/10.1029/2006WR005298
    » http://doi.org/10.1029/2006WR005298
  • Giustarini, L., Parisot, O., Ghoniem, M., Hostache, R., Trebs, I., & Otjacques, B. (2016). A user-driven case-based reasoning tool for infilling missing values in daily mean river flow records. Environmental Modelling & Software, 82, 308-320. http://doi.org/10.1016/j.envsoft.2016.04.013
    » http://doi.org/10.1016/j.envsoft.2016.04.013
  • Hamzah, F. B., Mohamad Hamzah, F., Mohd Razali, S. F., & El-Shafie, A. (2022). Multiple imputations by chained equations for recovering missing daily streamflow observations: a case study of Langat River basin in Malaysia. Hydrological Sciences Journal, 67(1), 137-149. http://doi.org/10.1080/02626667.2021.2001471
    » http://doi.org/10.1080/02626667.2021.2001471
  • Hamzah, F. B., Mohd Hamzah, F., Mohd Razali, S. F., Jaafar, O., & Abdul Jamil, N. (2020). Imputation methods for recovering streamflow observation: a methodological review. Cogent Environmental Science, 6(1), 1745133. http://doi.org/10.1080/23311843.2020.1745133
    » http://doi.org/10.1080/23311843.2020.1745133
  • Junger, W. L., & Ponce de Leon, A. (2015). Imputation of missing data in time series for air pollutants. Atmospheric Environment, 102, 96-104. http://doi.org/10.1016/j.atmosenv.2014.11.049
    » http://doi.org/10.1016/j.atmosenv.2014.11.049
  • Kabir, G., Tesfamariam, S., Hemsing, J., & Sadiq, R. (2020). Handling incomplete and missing data in water network database using imputation methods. Sustainable and Resilient Infrastructure, 5(6), 365-377. http://doi.org/10.1080/23789689.2019.1600960
    » http://doi.org/10.1080/23789689.2019.1600960
  • Khodakhah, H., Aghelpour, P., & Hamedi, Z. (2022). Comparing linear and non-linear data-driven approaches in monthly river flow prediction, based on the models SARIMA, LSSVM, ANFIS, and GMDH. Environmental Science and Pollution Research International, 29(15), 21935-21954. http://doi.org/10.1007/s11356-021-17443-0
    » http://doi.org/10.1007/s11356-021-17443-0
  • Little, R. J., & Rubin, D. B. (2002). Statistical analysis with missing data Hoboken: John Wiley & Sons. http://doi.org/10.1002/9781119013563
    » http://doi.org/10.1002/9781119013563
  • Liu, X., Sang, X., Chang, J., & Zheng, Y. (2021). Multi-model coupling water demand prediction optimization method for megacities based on time series decomposition. Water Resources Management, 35(12), 4021-4041. http://doi.org/10.1007/s11269-021-02927-y
    » http://doi.org/10.1007/s11269-021-02927-y
  • McKnight, P. E., McKnight, K. M., Sidani, S., & Figueredo, A. J. (2007). Missing data: a gentle introduction New York: Guilford Press.
  • Musa, J. J. (2013). Stochastic modeling of Shiroro River stream flow process. American Journal of Engineering Research, 2(6), 49-54.
  • Nash, J. E., & Sutcliffe, J. V. (1970). River flow forecasting through conceptual models part I: a discussion of principles. Journal of Hydrology, 10(3), 282-290. http://doi.org/10.1016/0022-1694(70)90255-6
    » http://doi.org/10.1016/0022-1694(70)90255-6
  • Nunes, L. N., Klück, M. M., & Fachel, J. M. G. (2009). Uso da imputação múltipla de dados faltantes: uma simulação utilizando dados epidemiológicos. Cadernos de Saúde Pública, 25(2), 268-278. http://doi.org/10.1590/S0102-311X2009000200005
    » http://doi.org/10.1590/S0102-311X2009000200005
  • Phan, T.-T.-H., & Nguyen, X. H. (2020). Combining statistical machine learning models with ARIMA for water level forecasting: the case of the Red river. Advances in Water Resources, 142, 103656. http://doi.org/10.1016/j.advwatres.2020.103656
    » http://doi.org/10.1016/j.advwatres.2020.103656
  • Pinto, W. P., Lima, G. B., & Zanetti, J. B. (2015). Previsão de regimes de vazões médias mensais do rio Doce, Colatina-Espírito Santo. Ciência e Natura, 37(3), 1-11. http://doi.org/10.5902/2179460X17143
    » http://doi.org/10.5902/2179460X17143
  • R Development Core Team. (2021). R: a language and environment for statistical computing. Vienna: R Foundation for Statistical Computing.
  • Retike, I., Bikše, J., Kalvāns, A., Dēliņa, A., Avotniece, Z., Zaadnoordijk, W. J., Jemeljanova, M., Popovs, K., Babre, A., Zelenkevičs, A., & Baikovs, A. (2022). Rescue of groundwater level time series: how to visually identify and treat errors. Journal of Hydrology, 605, 127294. http://doi.org/10.1016/j.jhydrol.2021.127294
    » http://doi.org/10.1016/j.jhydrol.2021.127294
  • Rubin, D. B. (1987). Procedures with nonignorable nonresponse. In: D. B. Rubin (Ed.), Multiple imputation for nonresponse in surveys (pp. 202-240). New York: John Wiley & Sons. http://doi.org/10.1002/9780470316696.ch6
    » http://doi.org/10.1002/9780470316696.ch6
  • Salame, C. W., Queiroz, J. C. B., Souza, E. B., Farias, V. J. C., Rocha, E. J. P., & Moura, H. P. (2019). Um estudo comparativo dos modelos Box-Jenkins e Redes Neurais Artificiais na previsão de vazões e precipitações pluviométricas da Bacia Araguaia, Tocantins, Brasil. Brazilian Journal of Environmental Sciences, (52), 28-43. http://doi.org/10.5327//Z2176-947820190444
    » http://doi.org/10.5327//Z2176-947820190444
  • Schafer, J. L., & Graham, J. W. (2002). Missing data: our view of the state of the art. Psychological Methods, 7(2), 147-177. http://doi.org/10.1037/1082-989X.7.2.147
    » http://doi.org/10.1037/1082-989X.7.2.147
  • Schäfer, M. P., Dietrich, O., & Mbilinyi, B. (2016). Streamflow and lake water level changes and their attributed causes in Eastern and Southern Africa: state of the art review. International Journal of Water Resources Development, 32(6), 853-880. http://doi.org/10.1080/07900627.2015.1091289
    » http://doi.org/10.1080/07900627.2015.1091289
  • Semiromi, M. T., & Koch, M. (2019). Reconstruction of groundwater levels to impute missing values using singular and multichannel spectrum analysis: application to the Ardabil Plain, Iran. Hydrological Sciences Journal, 64(14), 1711-1726. http://doi.org/10.1080/02626667.2019.1669793
    » http://doi.org/10.1080/02626667.2019.1669793
  • Tencaliec, P., Favre, A. C., Prieur, C., & Mathevet, T. (2015). Reconstruction of missing daily streamflow data using dynamic regression models. Water Resources Research, 51(12), 9447-9463. http://doi.org/10.1002/2015WR017399
    » http://doi.org/10.1002/2015WR017399
  • Wei, W. W. S. (2006). Time series analysis. In T. D. Little (Ed.), The Oxford handbook of quantitative methods in psychology: statistical analysis (Vol. 2). Oxford: Oxford University Press.

Edited by

Editor-in-Chief: Adilson Pinheiro
Associated Editor: Carlos Henrique Ribeiro Lima

Publication Dates

  • Publication in this collection
    12 Aug 2024
  • Date of issue
    2024

History

  • Received
    03 Nov 2023
  • Reviewed
    12 Apr 2024
  • Accepted
    29 Apr 2024
Associação Brasileira de Recursos Hídricos Av. Bento Gonçalves, 9500, CEP: 91501-970, Tel: (51) 3493 2233, Fax: (51) 3308 6652 - Porto Alegre - RS - Brazil
E-mail: rbrh@abrh.org.br