Abstract
In this paper, we propose the power Student-t regression model for censored (limited) observations which extends the Student-t censored regression model. This extension is based on the asymmetric and heavy-tailed power Student-t distribution. The score functions and expected information matrix are given as well as the process for estimating the parameters in the model is discussed by using the likelihood approach. Two simulation studies are conducted to evaluate parameter recovery and properties of the model and finally, two applications to a real data set are reported to demonstrate the usefulness of this new methodology.
Key words
Censored regression model; Fisher information matrix; maximum likelihood estimation; power Student-$t$ distribution
INTRODUCTION
Regression models where the response variable is censored or limited are common in different fields: clinical essays, econometric analysis, social phenomena, engineering studies, among others. In clinical essays for example, in the first phases of development of the new vaccines, the determination of antibody concentration values often are left-censored due to detection limit by lack of sensitivity of the essay when the concentrations are near zero, see Moulton & Halsey 1995MOULTON LH & HALSEY NA. 1995. A Mixture Model With Detection Limits for Regression Analyses of Antibody Response to Vaccine. Biometrics 51: 1570–1578. . In social phenomena, the study on extramarital behavior where the variable of interest is the number of extramarital affairs in the previous year, for example, it can result in a left-censored variable (Fair 1978FAIR RC. 1978.A theory of extramarital affairs. J Polit Econ 86(1): 45–61. ). In econometrics analysis, the ordinary Tobit model (Tobin 1958TOBIN J. 1958.Estimation of relationship for limited dependent variables. Econometrica 26(1): 24–36. ) is commonly used to conduct studies of the labor force participation of married women. In this case, the observed response is the wage rate, which is typically considered as censored below zero, i.e., for working women, positive values for the wage rates are registered, whereas for the non-working women the observed wage rates are zero; see Mroz 1987MROZ TA. 1987. The Sensitivity of an Empirical Model of Married Women’s Hours of Work to Economic and Statistical Assumptions. Econometrica 55(4): 765–799. doi:10.2307/1911029. .
In situations such as previously discussed, where censored regression (CR) models are proposed, it is common to assume a normal distribution for the error term, however, this assumption can not be suitable and it can be unrealistic due to the presence of atypical observations or high (or low) degree of skewness and kurtosis of the response variable, which the normal model is unable to capture, so considerable interest has centered on relaxing the assumption of normality of the errors in CR models. In this context, some authors have proposed a wide range of alternatives to the normal censored regression (NCR) model which is widely known in the literature as the Tobit model. Arellano-Valle et al. 2012ARELLANO-VALLE RB, CASTRO LM, GONZÁLEZ-FARÍAS G & MUNÕZ-GAJARDO KA. 2012. Student-t Censored Regression Model: Properties and Inference. Stat Methods Appt 21(4): 453–473. doi:10.1007/s10260-012-0199-y.
10.1007/s10260-012-0199-y...
for example, extend the classical Tobit model by introducing the Student- censored regression (TCR) model that can be suitable when the response variable has heavy-tails and the kurtosis is greater than the usual normal model. Another extension of the Tobit model was proposed by Martínez-Flórez et al. 2013MARTÍNEZ-FLÓREZ G, BOLFARINE H & GÓMEZ HW. 2013. The alpha–power tobit model. Commun Stat Theory Methods 42(4): 633–643. by considering that random errors follow a power-normal (PN) distribution (Gupta & Gupta 2008GUPTA RD & GUPTA RC. 2008. Analyzing skewed data by power–normal model. Test 17: 197–210. ). The novelty of this proposal is the incorporation of a shape parameter which gives flexibility to the assumption of the symmetric errors (normality assumption) and it allows to accommodating skewed forms to the left and the right for the error term in CR models. Recently Garay et al. 2017GARAY AM, LACHOS VH, BOLFARINE H & CABRAL CRB. 2017. Linear Censored Regression Models with Scale Mixtures of Normal Distributions. Stat Pap 58(1): 247–278. doi:10.1007/s00362-015-0696-9. proposed a family of censored regression models based on the family of symmetric distributions commonly known as the scale mixture of normal (SMN) distributions, which includes the TCR model proposed by Arellano-Valle et al. 2012ARELLANO-VALLE RB, CASTRO LM, GONZÁLEZ-FARÍAS G & MUNÕZ-GAJARDO KA. 2012. Student-t Censored Regression Model: Properties and Inference. Stat Methods Appt 21(4): 453–473. doi:10.1007/s10260-012-0199-y.
10.1007/s10260-012-0199-y...
. This family also includes Pearson type VII (PVII), slash (SL), power exponential (PE), contaminated normal (CN) and normal (N) distributions. In addition to being robust, these models have shown to be useful in detecting atypical observations in CR models.
Although some proposals that take into account the problem of atypical observations in censored regression models, most of them are based on the assumption of symmetry of the error and few studies that capture departure from symmetry in the distribution of errors as in Martínez-Flórez et al. 2013MARTÍNEZ-FLÓREZ G, BOLFARINE H & GÓMEZ HW. 2013. The alpha–power tobit model. Commun Stat Theory Methods 42(4): 633–643. , for example, who support their work in the great virtues of the alpha-power models to fit data where distribution presents high or low asymmetry and/or kurtosis.
Within this class of alpha-power models Zhao & Kim 2016ZHAO J & KIM HM. 2016.Power t distribution. Ommun Stat Appl Methods 23(4): 321–334. proposed an extension of the Student- model by defining the power-Student- (PT) distribution as an alternative to the skew- model by Azzalini & Capitanio 2003AZZALINI A & CAPITANIO A. 2003. Distributions generated by perturbation of symmetry with emphasis on a multivariate skew t-distribution. J R Stat Soc Series B Stat Methodol 65(2): 367–389. for fitting skewed and heavy-tailed data. The PT model, which extends the power-normal model by Gupta & Gupta 2008GUPTA RD & GUPTA RC. 2008. Analyzing skewed data by power–normal model. Test 17: 197–210. seems to be useful in situations where the data present higher degree of skewness and kurtosis than PN model in presence of atypical observations.
In this paper, we propose a censored regression model under the assumption that errors follow a PT distribution (hereafter we will call it the PTCR model). The assumption of PT distribution gives flexibility for accommodating skew forms to the left and the right, and kurtosis greater or smaller than the Student -distribution can be also accommodated, hence, PTCR model extends the TCR model. The process of inference in the model is conducted by using the maximum likelihood (ML) approach and its large sample properties. Application is implemented to real data set where it is demonstrated that the proposed model can be very useful in fitting real data sets.
The rest of this paper is organized as follows: Section “The Power Student-t Distribution“ presents a brief review of the main properties of the PT distribution. In Section “Power-Student-t Model for Censored and Truncated Data“, we introduce the censored and truncated PT models. Section “Censored Power-Student-t Regression Model“ introduces the PTCR model. Here, ML equation and the observed and expected information matrices are given. Section “Simulation Study“ presents the results of a simulation study which reveals the good performance of the estimation approach. The PTCR model is fitted to a data set of housewives wages in Section “Real Data Application“, revealing that the data set in question can be fitted by PTCR as well as by a CR model where the observational errors have a SMN distribution (SMNCR model).
THE POWER STUDENT-T DISTRIBUTION
In this section, we present the PT distribution and review some of its main characteristics and properties. The PT model was introduced by Zhao & Kim 2016ZHAO J & KIM HM. 2016.Power t distribution. Ommun Stat Appl Methods 23(4): 321–334. and it is an alternative to skew-t model for fitting data with high indices of asymmetry and kurtosis in addition to heavy tails.
Definition 1. The random variable is said to have a PT distribution with parameter and degree of freedom , if has probability density function (PDF) given by
for and . Functions and are the PDF and cumulative distribution function (CDF) of the standard Student- distribution.
Random variable having distribution is denoted shortly by . Figure 1 displays some forms of the PDF of the PN distribution for selected values of . Note from figure that parameter controls the skewness and kurtosis of the distribution. The CDF of the PT model is given by
for . Some properties of the PT distribution can be proven as result of Definition 1.
Proposition 1. Let , then
-
(i) if , follows Student- distribution and we write ,
-
(ii) if , converges to power-normal (PN) model with parameter . The PDF is given by
-
More details of PN distribution can be found in Gupta & Gupta 2008GUPTA RD & GUPTA RC. 2008. Analyzing skewed data by power–normal model. Test 17: 197–210. and Pewsey et al. 2012PEWSEY A, GÓMEZ HW & BOLFARINE H. 2012. Likelihood–based inference for power distributions. Test 21(4): 775–789. .
-
(iii) if and , converges to standard normal distribution.
Proof. Proof of (i)-(iii) are directly obtained from definition of PT distribution ◻
Proposition 2. Let , then for
where has a beta distribution, and denotes the inverse of the function .
Proof. We have by definition that
thus, letting , then , it follows that
which is the expected value of the function , where follows a beta distribution with parameter and 1. ◻
The expected value, variance, indices of asymmetry and kurtosis of the PT model can be found by using the expressions
-
(i)
-
(ii)
-
(iii)
-
(iv)
where and . Table I presents the values of the asymmetry coefficient of the PT model for some values of the parameter and for values of in the range of 0.1 to 100000.
Definition 2. Let . The PT density of location and scale is defined as the distribution of , for and . The corresponding PDF is given by
for . We denote this extension as , and we have that .
The th moment of the random variable is given by
where is the th moment of a random variable . Zhao & Kim 2016ZHAO J & KIM HM. 2016.Power t distribution. Ommun Stat Appl Methods 23(4): 321–334. derived the information matrix for the location-scale version and showed that it is non-singular when for small values of the parameter (i.e., ). When tends to , then PT distribution converges to PN model and here, we recall Pewsey et al. 2012PEWSEY A, GÓMEZ HW & BOLFARINE H. 2012. Likelihood–based inference for power distributions. Test 21(4): 775–789. showed that PN model has non-singular information matrix. This result guarantees that regularity conditions are satisfied for the likelihood approach, hence, with PT model, symmetry can be tested by using ordinary large sample properties of the likelihood ratio statistics.
POWER-STUDENT-T MODEL FOR CENSORED AND TRUNCATED DATA
Based on the goodness of the PT distribution to fit data with high indices of asymmetry and kurtosis, in this section, we introduce the censored PT and the truncated PT models which we will be denoted by CPT and TPT, respectively.
Definition 3 (Censored PT Model). Suppose that random variable follows a distribution. Let a random sample of size of , where only those values of greater than constant are recorded; and for values only the value is registered. The observed values can be written as
for . The resulting sample is said to be a censored power-Student- (CPT).
From Definition 3 it follows that , and for the observations the distribution of is the same of , i.e., . For convenience, we choose to work with the case of left-censored data, however, the followings results can be extended to other types of censorship.
Maximum Likelihood Estimation for CPT Model
Let be a random sample of the censored distribution (censored in ). To perform statistical inference for parameter vector by using the ML method, we use the reparameterization in Olsen 1978OLSEN RJ. 1978. Note on the Uniqueness of the Maximum Likelihood Estimator for the Tobit Model. Econometrica 46(5): 1211–1215. . Thus, let and , the log-likelihood function for the new vector , given the observed sample can be written as
where , , and is an indicator variable defined as if ; and if . The components of the score function are obtained by deriving partially the log-likelihood function given in (7) with regard to components , , and , we obtain the following equations
where , is the digamma function and is the truncated moment defined as
with and . The moments in (8) are obtained by numerical integration, for example, by using integrate function of R Development Core Team 2018R DEVELOPMENT CORE TEAM. 2018. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria. URL http://www.R-project.org. ISBN 3-900051-07-0.
http://www.R-project.org...
. The ML estimates of the parameters , , and in censored PT model are obtained using iterative algorithms based on the score functions and by applying the inverse transformation and . For obtaining the standard errors of the ML estimates one should compute the information matrix . It is well known that the elements of are given by
where . Since expectation over PT distribution and second-order derivatives are not straightforward, numerical methods should be performed to obtain the explicit form of the information matrix . Thus, we use the observed information matrix for calculating the standard errors in the rest of the paper. To recover the information matrix of the original parameterization , we use
where the Jacobian matrix is
Definition 4 (Truncated PT Model). Let be a random variable with distribution . Let with , such that . It is said that random variable has a truncated power-Student- (TPT) distribution in the interval , if has the same distribution as . In this case, we write .
As a consequence of the Definition 4, the PDF of TPT distribution can be obtained as
if , and in otherwise. Now, we consider that before the sample to be selected, the distribution of is truncated at the value , so that we can only choose observations such that . Then, random variable has PDF given by
for .
Maximum Likelihood Estimation for TPT Model
Given a sample of the TPT distribution in the value , the log-likelihood function for vector , where and , is given by
where , , and is the number of observations in the sample such that . Deriving partially the log-likelihood function (11) with respect to the components of the vector the following elements of the score function are obtained
where , is the digamma function and is given by (8). To obtain the ML estimates of the parameters , , and in the TPT model, we proceed in a similar way to the CPT model, and iterative methods based on the Newton-Rapshon algorithm are used with the score functions. We use the observed information matrix for calculating the standard errors and to recover the information matrix of the original parameterization , we use , where is given in (9).
CENSORED POWER-STUDENT-T REGRESSION MODEL
In this section, we introduce the censored power-Student- regression model, which is denoted by PTCR. This model results from the consideration of the observed random variable , where and , with , for ; i.e.,
where is a vector of dimension of unknown parameters, , for , are vectors of known covariates, and , for , are independent random variables with PT distribution with location parameter 0, scale parameter , shape parameter , and degrees of freedom . This assumption is equivalent to considering that unobserved random variables are independent with , that is, with PDF given by
for , where . The contribution to likelihood function for observations is given by
and for observations , we have that . Therefore, the likelihood function of the PTCR model based on the observed sample is given by
where
Model (13) can be extended to the situation where the value of the censorship associated with the observation is replaced by the value (a known value), i.e.,
for . Note that, by making , and , we have the previous model in (12), hence, the results of the inference based on the ML method can be used to fit the more general model in (14).
Proposition 3. Consider the model (12) with assumption , for , then
-
(i) if , model (12) is reduced to -Student censored regression (TCR) model
-
(ii) if , model (12) converges to power-normal censored regression (PNCR) model see Martínez-Flórez et al. 2013MARTÍNEZ-FLÓREZ G, BOLFARINE H & GÓMEZ HW. 2013. The alpha–power tobit model. Commun Stat Theory Methods 42(4): 633–643.
-
(iii) if and , model (12) converges to usual normal censored regression (NCR) model, that is, Tobit model.
Proof. Proof of (i)-(iii) are directly obtained from definition of PTCR model. ◻
Moments
Proposition 4. The mean and variance for the th observed response in PTCR model are given by
and
respectively, where with and
Note that is the moment of a random variable with distribution , truncated in the interval .
Proof. The mean and variance for the th observed response in PTCR model can be obtained by noting that , where and , with and , . We have
By Lemma 1 in Appendix A, we have
To obtain the variance of , note that
Using Lemma 1 in Appendix, it follows that
thus, by calculating and after some algebraic manipulations, we obtain
◻
It is important to note that, if tends to infinity, then (15) and (16) converge to the mean and variance of the PNCR model (Martínez-Flórez et al. 2013MARTÍNEZ-FLÓREZ G, BOLFARINE H & GÓMEZ HW. 2013. The alpha–power tobit model. Commun Stat Theory Methods 42(4): 633–643. ), i.e., when
where
with the CDF of the standard normal distribution, and the inverse function of . Also, worth noting that, if and , we have and , thus
which are the mean and variance, respectively, of the Tobit model (Tobin 1958TOBIN J. 1958.Estimation of relationship for limited dependent variables. Econometrica 26(1): 24–36. ).
Maximum Likelihood Estimation
The ML method is considered by using the reparameterization of Olsen 1978OLSEN RJ. 1978. Note on the Uniqueness of the Maximum Likelihood Estimator for the Tobit Model. Econometrica 46(5): 1211–1215. . Let , and
the log-likelihood function for obtained from (13) under the new parameterization is given by
where , with . The components of the score function are obtained by deriving partially in relation to the components , , and . After some algebraic manipulations the following components of the score function are obtained
where , , is the digamma function and is the truncated moment defined by (8). Note that, if the equations (18)-(21) are reduced to the functions of the TCR model (Arellano-Valle et al. 2012ARELLANO-VALLE RB, CASTRO LM, GONZÁLEZ-FARÍAS G & MUNÕZ-GAJARDO KA. 2012. Student-t Censored Regression Model: Properties and Inference. Stat Methods Appt 21(4): 453–473. doi:10.1007/s10260-012-0199-y.
10.1007/s10260-012-0199-y...
), while, if and , then , , and , therefore, the equations (18) and (19) are reduced to score functions of the Tobit model.
The elements of the observed information matrix for PTCR model, which are denoted by , can be obtained by calculating the second partial derivative of the log-likelihood function (17), i.e., , while the expected information matrix is obtained as , which involves the calculation of truncated expected values that have no closed form and must be obtained numerically. The Appendix B presents the expressions for the elements of the matrices and . The expected information matrix of the original parameterization can be recovered by using , where
Finally, the ML estimates for can be obtained using iterative methods based on the Newton-Rapshon algorithm from the score function (18) - (21) and applying the inverse transformation and . Estimates of the variances of the estimator can be obtained by evaluating the inverse of the observed information matrix at the ML estimators and by using the previous result.
Model Selection and Residual Analysis
In this section some criteria for the selection of the best-fitted model and a methodology for residual analysis are proposed.
Model Selection
Many model selection tools are generally used, such as the Akaike information criteria (AIC), (Akaike 1974AKAIKE H. 1974.A new look at statistical model identification. IEEE Trans Automat Contr AU-19(4): 716–722. ), Bayesian information criterion (BIC) (Schwarz 1978SCHWARZ G. 1978. Estimating the dimension of a model. Ann Stat 6(2): 461–464. ), and the AIC corrected (AICc) (Sugiura 1978SUGIURA N. 1978. Further Analysis of the Data by Akaike’s Information Criterion and the Finite Corrections. Commun Stat Theory Methods 7(1): 13–26. doi:10.1080/03610927808827599. ), which are defined by
where the term is a quantity that depends on the number of free parameters that are estimated in the model , and the number of observations in the sample . For the AIC one has , for BIC and for AICc, . To choose the best-fitted model, the criteria AIC, BIC and AICc are used.
Residual Analysis
The residual analysis has the purpose of detecting the presence of atypical observations and to evaluate the assumptions of the model, being able to include formal tests to detect departures from the assumptions of the considered model, as well as informal graphs to present general characteristics of the residuals.
Following Garay et al. 2017GARAY AM, LACHOS VH, BOLFARINE H & CABRAL CRB. 2017. Linear Censored Regression Models with Scale Mixtures of Normal Distributions. Stat Pap 58(1): 247–278. doi:10.1007/s00362-015-0696-9. and Arellano-Valle et al. 2012ARELLANO-VALLE RB, CASTRO LM, GONZÁLEZ-FARÍAS G & MUNÕZ-GAJARDO KA. 2012. Student-t Censored Regression Model: Properties and Inference. Stat Methods Appt 21(4): 453–473. doi:10.1007/s10260-012-0199-y.
10.1007/s10260-012-0199-y...
, in this work we considered the transformed martingal residuals proposed by Barros et al. 2010BARROS M, GALEA M, GONZÁLEZ M & LEIVA V. 2010. Influence Diagnostics in the Tobit Censored Response Model. Stat Methods Appt 19(3): 379–397. doi:10.1007/s10260-010-0135-y. as diagnostic tool to evaluate deviations from the postulated model for the response variable, as well as to detect the presence of atypical observations. The residuals are defined as
where is the martingal residual proposed by Ortega et al. 2003ORTEGA EM, BOLFARINE H & PAULA GA. 2003. Influence Diagnostics in Generalized Log-Gamma Regression Models. Comput Stat Data Anal 42: 165–186. , where indicates whether the th observation is censored or not, respectively, denotes the sign of and represents the survival function evaluated at , where are the MLE for .
As suggested by Garay et al. 2017GARAY AM, LACHOS VH, BOLFARINE H & CABRAL CRB. 2017. Linear Censored Regression Models with Scale Mixtures of Normal Distributions. Stat Pap 58(1): 247–278. doi:10.1007/s00362-015-0696-9. , this type of standardized residuals is used due to the fact that they are symmetrically distributed around zero, which facilitates the construction of the simulated envelopes with little computational effort and will be useful to detect an incorrect specification of the model, as well as the presence of observations atypical.
SIMULATIONS STUDIES
Simulation Study 1: Robustness of the Maximum Likelihood Estimates
In this section, we compare the performance of the estimates for PTCR model in the presence of outliers on the response variable. Following Garay et al. 2017GARAY AM, LACHOS VH, BOLFARINE H & CABRAL CRB. 2017. Linear Censored Regression Models with Scale Mixtures of Normal Distributions. Stat Pap 58(1): 247–278. doi:10.1007/s00362-015-0696-9. and Mattos et al. 2018MATTOS T, GARAY AM & LACHOS VH. 2018. Likelihood-based inference for censored linear regression models with scale mixtures of skew-normal distributions. J Appl Stat 45(11): 2039–2066. we performed a simulation study based on the NCR model. Specifically, we considered (12) with and for . As in Garay et al. 2017GARAY AM, LACHOS VH, BOLFARINE H & CABRAL CRB. 2017. Linear Censored Regression Models with Scale Mixtures of Normal Distributions. Stat Pap 58(1): 247–278. doi:10.1007/s00362-015-0696-9. we generated 1000 artificial samples of size , considering , and fixing the left censoring level at and 30% (that is, 10, 20 and 30% of the observations in each data set were left censored, respectively). We generated independently the values , for , from a uniform distribution on the interval (2, 20). These values were fixed throughout the simulations.
To assess how much the ML estimates are influenced by the presence of outliers, we replaced the observation by , with . Let and be the ML estimates of with and without contamination, respectively, . We are particularly interested in the relative changes
We define the relative changes for analogously. For each replication we obtained the parameter estimates with and without outliers, under the PTCR model. Table II and Figure 2 depict the average values of the relative changes across all samples and different censoring levels.
Average relative changes on estimates for different contaminations d and censoring level p.
Simulation study 1. Average relative changes on estimates for different contaminations d and censoring level.
We observe that influence increases dramatically when increases for , specially for the parameter. However, for the and 30%, these measures vary little, which indicates that PTCR model is more robust in these cases in the presence of discrepant observations.
Simulation Study 2: Asymptotic properties
To study the performance of the ML estimator , a Monte Carlo simulation study with sample sizes 150, 300, 750, and 1000 is presented. We considered the PTCR model defined in Section “Censored Power-Student-t Regression Model” with for . The true values of the parameters were taken as (2,1.5), 1.5, 0.4, 2.5 and 3.0. We also consider levels of censorship equal to 0, 10, 25 and 45. The covariate was generated from a uniform distribution (0.1,20) as considered in Garay et al. 2017GARAY AM, LACHOS VH, BOLFARINE H & CABRAL CRB. 2017. Linear Censored Regression Models with Scale Mixtures of Normal Distributions. Stat Pap 58(1): 247–278. doi:10.1007/s00362-015-0696-9. . For each combination of parameters, sample sizes and censorship levels, 2000 samples of the PTCR model were generated with errors . To evaluate the performance of the estimators, the absolute value of the relative bias (RB) and the mean square error (MSE) were considered, they are given by
respectively, where is the estimator of for the th sample, for . The ML estimates of the parameters were calculated by using the optim function of R Development Core Team 2018R DEVELOPMENT CORE TEAM. 2018. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria. URL http://www.R-project.org. ISBN 3-900051-07-0.
http://www.R-project.org...
. The optimization of the likelihood function was done using iterative methods based on the Newton-Rapshon algorithm by using the score functions.
It can be seen from the Table III that RB and MSE tend to decrease when the value of increases, indicating that estimates based on the ML method have good asymptotic properties. That pattern is the same for the different levels of censorship of under consideration. Note that, when the sample sizes is , the estimates for the parameter are unstable (in terms of MSE) because it is affected by the bias of the asymmetry parameter , however, when the sample size increases, the estimates become more stable. In general, this problem is very common in these types of models, see for example, Martínez-Flórez et al. 2013MARTÍNEZ-FLÓREZ G, BOLFARINE H & GÓMEZ HW. 2013. The alpha–power tobit model. Commun Stat Theory Methods 42(4): 633–643. , so we recommend moderate and large sample sizes in these types of models
REAL DATA APPLICATIONS
Application 1: Wage rate
To illustrate the proposed model, we consider a data set described by Mroz 1987MROZ TA. 1987. The Sensitivity of an Empirical Model of Married Women’s Hours of Work to Economic and Statistical Assumptions. Econometrica 55(4): 765–799. doi:10.2307/1911029. . The data set consists of 753 married white women with ages between 30 and 60 years old in 1975, with 428 women that worked at some point during that year. The response variable used in this application is the wage rate, which represents a measure of the wage of the housewife known as the average hourly earnings. In data set, we have that if the wage rates are set equal to zero, these wives did not work in 1975. Therefore, these observations are considered left-censored at zero.
The considerated covariates were the wife’s age , years of schooling , the number of children younger than six years old in the household , and the number of children between six and nineteen years old . These data were analyzed previously by Arellano-Valle et al. 2012ARELLANO-VALLE RB, CASTRO LM, GONZÁLEZ-FARÍAS G & MUNÕZ-GAJARDO KA. 2012. Student-t Censored Regression Model: Properties and Inference. Stat Methods Appt 21(4): 453–473. doi:10.1007/s10260-012-0199-y.
10.1007/s10260-012-0199-y...
using a TCR model and later by Garay et al. 2017GARAY AM, LACHOS VH, BOLFARINE H & CABRAL CRB. 2017. Linear Censored Regression Models with Scale Mixtures of Normal Distributions. Stat Pap 58(1): 247–278. doi:10.1007/s00362-015-0696-9. using the Scale Mixture of Normal Censored Regression (SMNCR) models. We analyzed the data set by fitting a PTCR model and we compare our proposal with SMNCR models by Garay et al. 2017GARAY AM, LACHOS VH, BOLFARINE H & CABRAL CRB. 2017. Linear Censored Regression Models with Scale Mixtures of Normal Distributions. Stat Pap 58(1): 247–278. doi:10.1007/s00362-015-0696-9. : Student- censored regression model (TCR) (Arellano-Valle et al. 2012ARELLANO-VALLE RB, CASTRO LM, GONZÁLEZ-FARÍAS G & MUNÕZ-GAJARDO KA. 2012. Student-t Censored Regression Model: Properties and Inference. Stat Methods Appt 21(4): 453–473. doi:10.1007/s10260-012-0199-y.
10.1007/s10260-012-0199-y...
), Slash censored regression model (SLCR), and normal censored regression model (NCR), that is, the usual tobit model. Table IV shows skewness and kurtosis index for complete data and also for uncensored observations. Notice that values for the skewness and kurtosis indexes justify using the PTCR.
Table V presents parameter estimates together with their corresponding standard errors (SE) for the PTCR, TCR, SLCR and NCR models. To fit model of the PTCM, we use the R code in the Appendix C. Table VI presents some model selection criteria, together with the values of the log-likelihood. According to the AIC, BIC and AICc criteria, the PTCR model seems to yield a better fit to the Mroz’s data than the SMNCR models (TCR and SLCR models) and the usual Tobit model (NCR model), supporting the contention of a departure from symmetry of the errors. Also, the SE of the PTCR model are smaller than SE of the SMNCR and NCR models.
Parameters and standard errors (SE) of the PTCR, TCR, SLCR and NCR models fitted to Wage rate data.
A more emphatic indication that an asymmetric model should be considered comes from testing the hypothesis a TCR model against an asymmetric (PTCR model), that is,
by using the likelihood ratio (LR) statistics, , which for the data set under study, leads to , which is greater than the critical 5% value with one degree of freedom which is given by . This is an indication that the PTCR model fits Mroz’s data better than the ordinary TCR model.
Finally, in order to verify if there is any incorrect specification in the assumptions of the fitted model, the simulated envelope graphs for the transformed martingal residuals are shown in Figure 3. This figure indicates that the PTCR model is, apparently, more suitable for the adjustment of this data than the SMNCR models. It can also be observed that the SMNCR models with heavy tails fit the data better than the NCR model, since there are few observations that are outside the envelopes.
Wage rate data. Envelopes of transformed martingale residuals for PTCR, TCR, SLCR and NCR models.
Application 2: Stellar Abundances Data
The second censored dataset is described in Santos et al. 2002SANTOS N, LÓPEZ RG, ISRAELIAN G, MAYOR M, REBOLO R, GARCÍA-GIL A, TAORO MP DE & RANDICH S. 2002. Beryllium Abundances in Stars Hosting Giant Planets. Astron Astrophys 386: 1028–1038. and are available in the R package astrodatR (Feigelson 2014FEIGELSON ED. 2014. astrodatR: Astronomical Data. Available at. URL https://cran.r-project.org/web/packages/astrodatR/. R package v. 0.1.
https://cran.r-project.org/web/packages/...
) under the name Stellar abundances. These data were analyzed Mattos et al. 2018MATTOS T, GARAY AM & LACHOS VH. 2018. Likelihood-based inference for censored linear regression models with scale mixtures of skew-normal distributions. J Appl Stat 45(11): 2039–2066. by using the Scale Mixture of Skew Normal Censored Regression (SMSNCR) models. We analyzed the data set by fitting a PTCR model and again, we compare our proposal with SMNCR models by Garay et al. 2017GARAY AM, LACHOS VH, BOLFARINE H & CABRAL CRB. 2017. Linear Censored Regression Models with Scale Mixtures of Normal Distributions. Stat Pap 58(1): 247–278. doi:10.1007/s00362-015-0696-9. .
The dataset consists of measurements for 68 solar-type stars and for our analysis we followed Mattos et al. 2018MATTOS T, GARAY AM & LACHOS VH. 2018. Likelihood-based inference for censored linear regression models with scale mixtures of skew-normal distributions. J Appl Stat 45(11): 2039–2066. and consider:
-
log N(Be) as the response variable, which represents the log of the abundance of beryllium scaled to Sun’s abundance (i.e. the Sun has )
-
as the explanatory variable, which represents the effective stellar surface temperature (in kelvin).
In astronomical research, a previously identified sample of objects (stars, galaxies, quasars, X-ray sources, etc.) is observed at some new wavebands. According to Feigelson 2014FEIGELSON ED. 2014. astrodatR: Astronomical Data. Available at. URL https://cran.r-project.org/web/packages/astrodatR/. R package v. 0.1.
https://cran.r-project.org/web/packages/...
, due to limited sensitivities, some objects may be undetected, leading to upper limits in their derived luminosities. For this dataset we have 12 left-censored data points, i.e. 12 undetected beryllium measurement, that represents 19.35% of observations.
Table VII presents the ML estimates for the parameters of the four models, i.e. PTCR, TCR, SLCR and NCR models, together with their corresponding standard errors. Table VIII compares the fit of the four models using the model selection criteria (AIC, AIC and BIC). Note that again the PTCR model with heavy tails have better fit than the TCR, SLCR and NCR models. The QQ-plots and envelopes for the martingale residuals are shown in Figure 4. This figure clearly indicates that the PTCR, TCR and SLCR models are more suitable for modeling the current data than the NCR model, since there are not observations falling outside the envelope.
Parameters and standard errors (SE) of the PTCR, TCR, SLCR and NCR models fitted to stellar abundances data.
Stellar abundances data. Envelopes of transformed martingale residuals for PTCR, TCR, SLCR and NCR models.
CONCLUSIONS
In this paper, an asymmetric alternative for the Student- censored regression model by Arellano-Valle et al. 2012ARELLANO-VALLE RB, CASTRO LM, GONZÁLEZ-FARÍAS G & MUNÕZ-GAJARDO KA. 2012. Student-t Censored Regression Model: Properties and Inference. Stat Methods Appt 21(4): 453–473. doi:10.1007/s10260-012-0199-y.
10.1007/s10260-012-0199-y...
and SMNCR by Garay et al. 2017GARAY AM, LACHOS VH, BOLFARINE H & CABRAL CRB. 2017. Linear Censored Regression Models with Scale Mixtures of Normal Distributions. Stat Pap 58(1): 247–278. doi:10.1007/s00362-015-0696-9. has been developed. It is based on the new family of asymmetric and heavy-tailed power- distribution (Zhao & Kim 2016ZHAO J & KIM HM. 2016.Power t distribution. Ommun Stat Appl Methods 23(4): 321–334. ). Moreover, it follows that the ordinary tobit model and the Student- censored regression models are special cases. The observed and expected information matrix is analytically obtained, allowing for the direct implementation of the inference on this type of models. The problem of estimating the parameters in the model is dealt by using the maximum likelihood approach which is also used for developing large sample properties for the estimators. The likelihood ratio statistics can be used for testing the PTCR null hypothesis since the TCR model is special case of the model entertained. Applications to Wage rate data and Stellar Abundances Data indicate that the PTCR model can be a useful alternative to the TCR and SMNCR models.
The proposed PT distribution can be considered in the statistical models based on the scale mixtures of normal family to improve the fit the models such as Maleki & Nematollahi 2017MALEKI M & NEMATOLLAHI AR. 2017. Autoregressive Models with Mixture of Scale Mixtures of Gaussian innovations. Iran J Sci Technol Trans A Sci 41(4): 1099–1107. . Also, the methodology of constructing the asymmetric distribution on the symmetric version of the Skew-Reflected-Gompertz distribution which recently introduced by Hosseinzadeh et al. 2019HOSSEINZADEH A, MALEKI M, KHODADADI Z & CONTRERAS-REYES JE. 2019. The Skew-Reflected-Gompertz distribution for analyzing the symmetric and asymmetric data. J Comput Appl Math 349: 132–141. , can be considered as a future work for researchers.
ACKNOWLEDGMENTS
- AKAIKE H. 1974.A new look at statistical model identification. IEEE Trans Automat Contr AU-19(4): 716–722.
- ARELLANO-VALLE RB, CASTRO LM, GONZÁLEZ-FARÍAS G & MUNÕZ-GAJARDO KA. 2012. Student- Censored Regression Model: Properties and Inference. Stat Methods Appt 21(4): 453–473. doi:10.1007/s10260-012-0199-y
- AZZALINI A & CAPITANIO A. 2003. Distributions generated by perturbation of symmetry with emphasis on a multivariate skew -distribution. J R Stat Soc Series B Stat Methodol 65(2): 367–389.
- BARROS M, GALEA M, GONZÁLEZ M & LEIVA V. 2010. Influence Diagnostics in the Tobit Censored Response Model. Stat Methods Appt 19(3): 379–397. doi:10.1007/s10260-010-0135-y.
- FAIR RC. 1978.A theory of extramarital affairs. J Polit Econ 86(1): 45–61.
- FEIGELSON ED. 2014. astrodatR: Astronomical Data. Available at. URL https://cran.r-project.org/web/packages/astrodatR/ R package v. 0.1.
» https://cran.r-project.org/web/packages/astrodatR/ - GARAY AM, LACHOS VH, BOLFARINE H & CABRAL CRB. 2017. Linear Censored Regression Models with Scale Mixtures of Normal Distributions. Stat Pap 58(1): 247–278. doi:10.1007/s00362-015-0696-9.
- GUPTA RD & GUPTA RC. 2008. Analyzing skewed data by power–normal model. Test 17: 197–210.
- HOSSEINZADEH A, MALEKI M, KHODADADI Z & CONTRERAS-REYES JE. 2019. The Skew-Reflected-Gompertz distribution for analyzing the symmetric and asymmetric data. J Comput Appl Math 349: 132–141.
- MALEKI M & NEMATOLLAHI AR. 2017. Autoregressive Models with Mixture of Scale Mixtures of Gaussian innovations. Iran J Sci Technol Trans A Sci 41(4): 1099–1107.
- MARTÍNEZ-FLÓREZ G, BOLFARINE H & GÓMEZ HW. 2013. The alpha–power tobit model. Commun Stat Theory Methods 42(4): 633–643.
- MATTOS T, GARAY AM & LACHOS VH. 2018. Likelihood-based inference for censored linear regression models with scale mixtures of skew-normal distributions. J Appl Stat 45(11): 2039–2066.
- MOULTON LH & HALSEY NA. 1995. A Mixture Model With Detection Limits for Regression Analyses of Antibody Response to Vaccine. Biometrics 51: 1570–1578.
- MROZ TA. 1987. The Sensitivity of an Empirical Model of Married Women’s Hours of Work to Economic and Statistical Assumptions. Econometrica 55(4): 765–799. doi:10.2307/1911029.
- OLSEN RJ. 1978. Note on the Uniqueness of the Maximum Likelihood Estimator for the Tobit Model. Econometrica 46(5): 1211–1215.
- ORTEGA EM, BOLFARINE H & PAULA GA. 2003. Influence Diagnostics in Generalized Log-Gamma Regression Models. Comput Stat Data Anal 42: 165–186.
- PEWSEY A, GÓMEZ HW & BOLFARINE H. 2012. Likelihood–based inference for power distributions. Test 21(4): 775–789.
- R DEVELOPMENT CORE TEAM. 2018. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria. URL http://www.R-project.org ISBN 3-900051-07-0.
» http://www.R-project.org - SANTOS N, LÓPEZ RG, ISRAELIAN G, MAYOR M, REBOLO R, GARCÍA-GIL A, TAORO MP DE & RANDICH S. 2002. Beryllium Abundances in Stars Hosting Giant Planets. Astron Astrophys 386: 1028–1038.
- SCHWARZ G. 1978. Estimating the dimension of a model. Ann Stat 6(2): 461–464.
- SUGIURA N. 1978. Further Analysis of the Data by Akaike’s Information Criterion and the Finite Corrections. Commun Stat Theory Methods 7(1): 13–26. doi:10.1080/03610927808827599.
- TOBIN J. 1958.Estimation of relationship for limited dependent variables. Econometrica 26(1): 24–36.
- ZHAO J & KIM HM. 2016.Power distribution. Ommun Stat Appl Methods 23(4): 321–334.
APPENDIX A LEMMAS
Lemma 1. Let , then , with
where is the inverse of .
Lemma 2. Let , and define . Then
-
(i), where
-
(ii), where
-
(iii) , where
-
con .
The proof of the Lemmas 1 and 2 are straightforward and they follow directly for the definition of expected value.
APPENDIX B INFORMATION MATRICES FOR PTCR MODEL
Observed Information MatrixThe elements of the observed information matrix for the PTCR model are given by
Expected Information MatrixThe expected information matrix is obtained by taking , which involves the calculation of truncated moments that have no closed form and must be obtained numerically. The elements of the expected information matrix are given by
where , and are given by (8) and must be obtained numerically. The components , and , are given in the Lemman 2.
Publication Dates
-
Publication in this collection
11 Oct 2021 -
Date of issue
2021
History
-
Received
11 Aug 2019 -
Accepted
22 Oct 2019