Resumo
This paper estimates bivariate regressions for wages and hours worked as an alternative to the univariate Mincerian earnings equation. The bivariate vector of dependent variables included both common and specific covariates. Using individual level data from the Brazilian National Household Sample Survey (PNAD), the Student t distribution produced the best fit to the data according to information criteria and Mahalanobis distance. The bivariate estimation accounts for correlation between the dependent variables, identifies antagonistic effects from common covariates and allows assuming different bivariate distributions. Education, type of employment contract and geographical region affect wages and hours worked in opposite directions.
Keywords Bivariate distributions; Mincerian equation; Wages; Hours worked; Bivariate regression
1. Introduction
The Mincerian earnings equation introduced by Mincer (1974) is the baseline for a broad empirical literature on labor economics, including contributions by Senna (1976), Card (1999), Resende and Wyllie (2006) and Aali-Bujari et al. (2019). These studies generally seek to estimate the returns to education and experience on the wage rate earned by the worker.1 Mincer proposed that the distribution of wages among different occupations is positively correlated to the amount of investment in human capital, which positively affects productivity and economic growth.2
The Mincerian earnings equation was originally represented by a linear regression in which the wage rate was explained by education and experience. Following this approach by Mincer (1974), other explanatory variables were included in the regression, such as individual characteristics of gender and race that are used to assess the presence of discrimination in the labor market.
When deciding to join the labor market, a worker chooses the quantity of hours that he will supply to the market. Sedlacek and Santos (1991) used data from the Brazilian National Household Sample Survey (PNAD) to analyze the relationship between the husband’s income and the labor supply by the spouse. They found that the higher the husband’s income, the higher the reservation wage and less likely the wife will work. Moreover, the younger and more children the family has, the less likely they are to join the labor market or, when they do so, they will supply fewer hours of work.
As far as estimation methods are concerned, since Mincer (1974), the literature has used the traditional ordinary least squares (OLS) method and its variants with instrumental variables, quantile regression, sample selection, and procedures based on maximum likelihood estimation [Chatterjee and Price (1991), Heckman (1979), Buchin-sky (2001)]. In Brazil, the greater availability of microdata and the improvement of the computational capacity contributed to the expansion of the empirical evidence, as highlighted by Maciel et al. (2001), Giuberti and Menezes-Filho (2005) and Madalozzo (2010).
A common feature in the literature is the use of earnings per hour as the dependent variable in the Mincerian equation. This variable, in general, is obtained by simple division of wage earned by hours worked in the period. Such an approach, however, implies the agglutination, in a single variable, of two distinct components, represented by earnings and hours worked, which should be modeled separately. The determinants of earnings and hours worked are not necessarily the same and those that enter in both regressions might differ in either quantitative (magnitudes) or qualitative (signals) terms.
This feature is not captured by traditional estimates of the Mincerian equation that uses wage rate as the sole dependent variable. The stock of human capital, measured by formal education and experience, for instance, tends to increase the workers’ remuneration, but it might also reduce the willingness to supply working hours in the labor market. Those who are more qualified might receive higher remuneration by working less hours than those who are less qualified. These antagonistic effects of education on wages and hours worked are not captured by the univariate estimates of the Mincerian earnings equation.
Therefore, there is a gap in the literature that this study seeks to fill. The common practice of using the earnings per hour dependent variable might hide effects of covariates that would be distinct if separately assessed by regressions on wage and hours worked. In contrast to the classical approach, this paper aims to estimate a bi-variate regression for the Mincerian equation considering earnings and hours worked as a bivariate vector of dependent variables. The regressions include both common and specific covariates for the bivariate vector of earnings and hours worked. The bivariate Normal, Student t, and Birnbaum-Saunders (BS)3 distributions are used in the estimation. Forthe sake of comparison, the univariate Mincerian earnings equation will also be estimated, considering a single dependent variable represented by earnings per hour worked. Estimates will be made forthe Brazilian economy using data extracted from the Brazilian National Household Sample Survey (PNAD) for the period from 2013 to 2015.
Advantages of the bivariate regression approach include the possibility of modeling a correlation structure among the dependent variables. If there is correlation, the estimation of univariate regressions separately for earnings and hours worked might provide biased results [Marchant et al. (2016)]. The bivariate framework allows to identify antagonistic effects of common covariates on the two different dependent variables. Finally, there is flexibility to assume different bivariate distributions for the earnings and hours-worked model. As in Heckman (1976), the parameters will be estimated by maximum likelihood, which is efficient according to Mittelhammer et al. (2000). Thus, the bivariate model emerges as an important alternative to the univariate equation that is traditionally estimated for the Mincerian earnings equation.
The results indicate that some common explanatory variables have different signals and magnitudes of the estimated coefficients in the bivariate regression of earnings and hours-worked. Specifically, the estimated coefficients for education, type of employment contract, and geographical region have distinct signals and different magnitudes for wage and hours worked regressions. Considering education, for instance, more years in school imply in higher average wage and lower supply of hours to the labor market. In the univariate regression, however, only the positive effect of an additional year of study on the wage rate is observed. Furthermore, the bivariate model captures the correlation between the two dependent variables, which increases robustness in relation to the estimation of separate univariate regressions. Thus, there are important advantages associated to the bivariate approach when compared to the univariate regression, suggesting that the former is more suitable for the estimation of the Mincerian earnings equation.
The paper is organized as follows. Section 2 describes the empirical model, presents the database, reports, and discusses the results. Finally, the third section is dedicated to the concluding remarks.
2. Econometric approach
2.1 Empirical model
The Mincerian earnings equation is typically described by the following univariate regression:
where log(w) is a vector with the logarithm of the wage per hour (dependent variable), γ is a vector of coefficients, X is a matrix of explanatory variables, such as education, experience, race, gender and others, and ε is a random error vector, usually assumed to follow a normal distribution.4
The differential of the present paper is to model the earnings equation (1) as a bi-variate regression of wages and hours worked separately in order to capture different effects of the common explanatory variables on wages and labor supply. Furthermore, as earnings and hours worked are correlated, the bivariate regression is more appropriate than the univariate estimation of separate regressions.
In the bivariate environment, the model can be estimated as a vector of dependent variables where Y1i is the wage in the main job and Y2i represents the hours dedicated to the main job by each individual i. This vector might be modeled by a set x of explanatory variables using one of the bivariate distributions described in the Appendix, such that:
i) Bivariate Normal distribution:
ii) Bivariate t distribution:
iii) Bivariate BS distribution:
Notice that in the cases of the Normal and t, we assume that the dependent variables have bivariate log-normal and log-t distributions, which implies that the logarithm of the variables follow the Normal and t bivariate distributions, respectively [Vanegas and Paula (2016)]. For the bivariate BS distribution, it is not necessary to apply the logarithm due to the parameterization as a function of averages of this distribution [Saulo et al. (2021; 2020)]. Based on the literature, we defined the set of covariates used in the estimations and separated covariates that affect both earnings and hours worked simultaneously from those that affect only one of them separately.
The common covariates, which affect both earnings and hours worked, are:
-
Gender: dummy variable that assumes value 1 for men and 0 for women;
-
Race: dummy variable that assumes value 1 for Caucasians and 0 for non-Caucasians;
-
Marital status: dummy variable that assumes value 1 for married and 0 for unmarried individuals;
-
Age and age2: age, measured in years, and age square are used as proxy for the labor market experience of the individual, following the literature;
-
Years of schooling: is a proxy for education, ranging from 0 to 16 years of study in the sample;
-
Category (high, high mean, mean, low mean, low): binary variables used to designate occupancy category, segmented according to socioeconomic criteria and having the low category as a reference5;
-
Type of employment contract (with employment record card, without employment record card, autonomous, civil servant): dummy variables that seek to capture the type of occupation of the individual in the labor market, having “with employment record card” as the base category;
-
Metropolitan region (Belém-PA, Fortaleza-CE, Recife-PE, Salvador-BA, Belo Horizonte-MG, Rio de Janeiro-RJ, Curitiba-PR, Porto Alegre-RS, Brasília-DF and São Paulo-SP): dummy variables that designate the metropolitan regions of residence of the individuals, taking São Paulo as the reference category;
-
Year (2013, 2014, and 2015): time dummies for the years of the sample, having 2013 as the reference year;
-
Sector of activity (agriculture, industry, construction, commerce, food and others, education, health, and social services): dummy variables used to capture cluster effects by sector of activity of the individuals, having individuals working in the public sector as reference.
The covariates that affect only earnings are:
-
Labor union: dummy variable that assumes value of 1 for individuals affiliated to any labor union and 0 for those who were not affiliated;
-
Social Security: dummy variable that assumes value of 1 for individuals who were taxpayers for social security in the reference period and 0 for those who were not taxpayers;
-
Time in job: number of years employed in the current main job, ranging from 0 to 56 years in the sample.
The covariates that affect only hours worked are:
-
Head: dummy variable that assumes value 1 if the reference individual in the household is head of the family and 0 otherwise (non-head);
-
Minor: dummy variable used to capture if there are children under 10 years old in the household;
-
Inactivity: dummy variable that assumes value 1 if there are unemployed individuals in the household and 0 if there are no unemployed individuals in the household.
The database was collected from the Brazilian National Household Sample Survey (PNAD) in the period from 2013 to 2015. This survey is annual, produced and published by the Brazilian Institute of Geography and Statistics (IBGE). It provides a wide set of demographic and socioeconomic information about the Brazilian population at the individual and household levels. We considered a sample of individuals aged between 18 and 65 years with complete information on earnings and hours worked, totalizing 167,271 observations. The sample refers to the 10 major metropolitan regions of the country, namely Belém-PA, Fortaleza-CE, Recife-PE, Salvador- BA, Belo Horizonte-MG, Rio de Janeiro-RJ, Curitiba-PR, Porto Alegre-RS, Brasília-DF, and São Paulo-SP The nominal values of earnings were deflated by the National Consumer Price Index (INPC). There is no control for groups of individuals in each year, characterizing the data set a pooled cross-section. All results were obtained in the R statistical software [https://www.r-project.org/].
Table 1 provides some descriptive statistics for earnings and hours worked at level and logarithmic scales, including sample size, average (avg), median, standard deviation (SD), coefficient of variation (CV), asymmetry (CA), and kurtosis (CK). These statistics indicate that earnings in level has a high asymmetry and a significant kurtosis, suggesting that an asymmetric distribution with heavy tails is better to fit the data. On the other hand, hours worked in level show low asymmetry and moderate kurtosis. The application of the logarithm tends to produce symmetry, especially in the case of earnings. Figure 1 shows histograms of earnings and hours worked at level and logarithmic scales.
2.2 Investigation of the best fit
Initially, we estimate the Normal, t, and BS univariate regressions for earnings and hours worked, as well as their bivariate counterparts, to investigate the best fit to the data in each case. Table 2 reports the values of the Akaike (AIC) and Bayesian (BIC) information criteria, calculated as:
where ℓ is the value of the log-likelihood function, k denotes the number of parameters, and n indicates the number of observations. According to Table 2, the univariate and bivariate models based on the t distribution yielded the best adjustments, as they re-suited the lowest values for both AIC and BIC. Thus, among the 3 distributions tested, the univariate and bivariate models of the t distribution shall be used according to the information criteria. Notice that the t distribution has heavier tails than the normal distribution, implying robustness against outlying observations [see Lucas (1997)].
Once the best univariate and bivariate models were chosen, we applied the Mahalanobis distance to evaluate the quality of the fit to the data, as proposed by Marchant et al. (2016). In the case of the bivariate t distribution, this distance is given by:
where U ~ t Biv(μ1,μ2,σ1,σ2,ν,ρ) according to equation (15) in the Appendix and ψ is the covariance matrix. According to (5), the Mahalanobis distance for the bivariate t distribution follows a F2,v distribution. That is, F distribution with 2 and v degrees of freedom. In the univariate case, we have a F1,v. In order to obtain the estimated values of the Mahalanobis distance, the parameters are replaced by their maximum likelihood estimates, which asymptotically results in the same distribution as (5) [Vilca et al. (2014)]. The Wilson-Hilferty approximation might then be applied to the Mahalanobis distance to obtain a standard Normal distribution approximation in (5). Thus, the quality of the fit of the univariate and bivariate t regression models might be evaluated by the transformed distances with the Wilson-Hilferty approximation [Ibacache-Pulgar et al. (2014)]. In this case, the distances in (5) are adapted to accommodate the regressive structure and the univariate or bivariate condition.
Figure 2 displays the probability-probability (PP) plots of the transformed Mahalanobis distance for the univariate t regressions of earnings and hours worked. The PP plot is commonly used to assess how close 2 sets of data are by plotting the 2 corresponding cumulative distribution functions. The closer the points are from the 45° line in the (0.0) to (1.1) area, the best is the fit. Figure 2, shows the cumulative distribution function of the standard Normal versus the empirical cumulative distribution function of the transformed Mahalanobis distance. The results reveal an excellent fit of the univariate models.
PP plots of the transformed distances for the univariate; regression models of earnings (left) and hours worked (right). Legend: EC = empirical probability, TC = theoretical probability.
Figure 3 shows the PP plot of the transformed Mahalanobis distance for the bivariate t regressions of earnings and hours worked. The results also suggest an excellent fit to the data for the bivariate case. Thus, for both univariate and bivariate cases, the t regression models provided excellent adjustments to the data and might therefore be used.
PP plots of the transformed distance for the bivariate t regression model of earnings and hours worked. Legend: EC = empirical probability, TC = theoretical probability.
2.3 Estimations and analysis
Table 3 reports the results of the maximum likelihood estimation for the bivariate t distribution regression model of earnings and hours worked, with the respective standard errors, Wald statistics, and p-values. The model based on the t distribution presented the best fit to the data according to the AIC and BIC information criteria and the PP plot of the Mahalanobis distance reported in the previous section. The Wald statistic is used to test the following hypotheses: versus . The Wald statistic is defined by:
which is approximately distributed as a standard Normal under H0, in which θ^ and θ0 are the estimate and its proposed value under H0, respectively. In this case, our interest lies in knowing if or versus H1: θ ≠ θ, at a significance level of .
Regarding to the interpretation of the estimated coefficients, the following cases deserves special attention:
-
When the independent variable x is quantitative (for instance, number of years in school) and the value of the coefficient estimated is: (i) out of range: there is an increase (or decrease if the estimate is negative) of in the expected value (mean) of the dependent variable due to an increase of 1 unit in x; (ii) within the range there is an increase (or decrease if the estimate is negative) of in the expected value (mean) of the dependent variable when x increases by one unit.
-
When the independent variable x is a dummy (for instance, gender) and the coefficient value is: (i) out of range , there is an increase (or decrease if the estimate is negative) of in the expected value (mean) of the dependent variable when x changes from 0 (women) to 1 (men); (ii) within the range we have an increase (or decrease if the estimate is negative) of in the expected value (mean) of the dependent variable when x changes from 0 (women) to 1 (men).
Table 3 indicates that the estimated correlation of 0.1877 between earnings and hours worked is statistically significant at the 5% significance level. This means that the bivariate model is more appropriate than the univariate estimation of independent regressions, which might lead to biased results due to the untreated correlation between the two dependent variables.
Considering the estimated coefficients, the variable “Gender” indicates that men have an average income that is 34.58% higher than women and they supply 8.62% more hours worked, on average, than women. On the other hand, the variable “Race” reveals that the wages of Caucasian individuals are, on average, 10.31% higher than the wages of non-Caucasians. However, when it comes to hours worked, Caucasians only supply 0.02% more hours than non-Caucasians. This results confirm that there is discrimination in the Brazilian labor market. Cavalieri and Fernandes (1998), for instance, found wage discrimination using data from the PNAD of 1989. They also found higher wages for men than for women and for Caucasian individuals than for non-Caucasians, even after controlling for age, years in school, and geographical region of residence.
The “Age” variable indicates that one additional year of experience in the labor market increases wage by 5.27%, while hours worked raise only by 1.28%. Considering the variable “Education”, an increase of one year of study raises in 4.68% the average wage. However, this same increase in schooling leads to an average decrease of 0.09% in hours worked. Thus, the higher the individual’s schooling, the higher his average wage and the lower his supply of hours in the labor market. This finding illustrates a fundamental advantage of the bivariate regression, since the effects of “Education” go in opposite directions in the bivariate regressions and this cannot be captured by the traditional univariate estimation that considers wage per hour as the unique dependent variable. Lau et al. (1993) also found a positive effect of “Education” on earnings (per capita) due to the higher level of schooling. Gonzaga et al. (2002) argued that, in Brazil, level of schooling is inversely related to hours worked.
Taking into account the metropolitan regions, Brasília-DF presents an average wage 8.95% higher and workers supply 2.70% less hours of work than São Paulo-SP. Again, it is also possible to identify distinct effects of an explanatory variable in the bivariate regression that cannot be captured by the traditional univariate model. In order to explain this finding, the unobservable characteristics of the workers, such as skill and motivation, as well as specific differences among the sectors of activity and the geographical regions of the country should be taken into account. In the specific case of Brasília, the differential is due to the location of the federal public administration in Brasília, which pays higher average wages than the private sector.
Regarding the types of labor contracts, the estimates point out that those with “no employment record card” have an average wage that is 17.73% lower than the wages of individuals “with employment record card”. In addition, they offer about 15.08% less hours worked than their peers “with employment record card”. The “civil servant” category incorporates, on average, an increase of 26.90% in wage while supplying an average of 4.10% less hours worked in relation to the workers “with employment record card”. Meanwhile, those who are “autonomous” have an average wage 12.82% lower and supply 16.04% less hours to the labor market than the workers “with employment record card”. It is worth mentioning that the the “civil servant” category also presents antagonistic effects on wages and hours worked that might be captured only by bivariate estimation.
For variables that affect only wages or hours worked separately, Table 3 illustrates that individuals who contribute to social security have an average wage 42.70% higher than those who do not contribute. The “head” variable, which affects only hours worked, confirms that the head of the household supplies 1.80% more hours to the labor market on average than those who are not in this condition.6
For comparison purposes, Table 4 presents the results of the traditional univariate regression in which the dependent variable is the wage rate (or wage per hour). In principle, some results show similarity in terms of magnitude with the estimates of the bivariate regression model. However, the univariate model cannot disentangle the effects of a given explanatory variable on wages and hours worked, as were the cases of “Education”, “Brasília-DF”, and “Civil Servant” discussed above. These variables displayed different signals in the estimated coefficients for the wage and hours worked regressions. For “Education”, for instance, the higher the individual’s level of schooling, the higher is his average wage and the lower is his supply of hours of work. However, in the univariate regression reported in Table 4, only the positive effect of an additional year of study on the average wage rate can be estimated. In addition, the bivariate model captured a positive and statically significant correlation between wage and hours worked, allowing for a more robust estimation than the simple adjustment of two independent regressions. Therefore, there are important advantages coming from the bivariate model, including the evidence that the determinants of wages and hours worked might not be the same in both quantitative and qualitative terms. In this environment, the bivariate estimation emerges as an important alternative for the estimation of the Mincerian earnings equation.
3. Conclusion
This paper proposed an alternative approach to estimate the Mincerian earnings equation based on bivariate regression modeling. The combination of wages and hours worked in a single dependent variable, as traditionally is done in the empirical literature, prevents capturing distinct effects of common covariates on those dependent variables separately. On the other hand, the univariate estimation of independent regressions for earnings and hours worked is not indicated due to the correlation between these variables, which might bias the estimates. We proposed the estimation of a regression for wages and hours worked as a bivariate vector of dependent variables, including common and specific covariates among the explanatory variables and using the Normal, Student t and BS bivariate distributions. The estimates used data at the individual level extracted from the Brazilian National Household Sample Survey (PNAD) for the years from 2013 to 2015.
In the bivariate case, the Normal, t and BS distributions were used to jointly model wages and hours worked. The AIC and BIC information criteria and the Mahalanobis distance indicated that the Student t distribution yielded the best fit to the data. In addition, a positive and statistically significant correlation between wages and hours worked justified the use of the bivariate regression in detriment of two separate regressions for those variables, which would yield in biased estimates.
The bivariate estimation indicated that a given common covariate might have distinct effects on wages and hours worked. The results for “Education”, for instance, indicated that an additional year of study leads to an average increase of 4.68% in wages and an average decrease of 0.9% in hours worked. This suggests that individuals with more years of schooling, on average, have higher wages and work less hours than those with less years of schooling. Other covariates common to the bivariate vector, such as type of employment contract and geographic region of residence, also had antagonistic effects on earnings and hours worked. This evidence illustrates a fundamental advantage of the bivariate regression, which allows to disentangle the distinct effects of a given common covariate on wages and hours worked. This cannot be done by the traditional estimation of the univariate regression that considers the wage per hour as the dependent variable.
Thus, the bivariate regression might be considered as an alternative approach for the estimation of the Mincerian earnings equation. As further work, one might implement the Heckman two-step correction for selection bias (Heckman, 1979), since the PNAD survey refers to individuals who were actually working in the sample period. However, the individual’s earnings are associated with the decision to supply work, which ultimately depends on their opportunity cost. It is advantageous to work if the wage (or potential earnings) is greater than the opportunity cost (reservation wage). In addition, other bivariate probability distributions might be adjusted to model wages and hours worked, such as Pareto and its extensions, which are commonly used in income modeling. Finally, a bivariate logistic regression model might be used to estimate the influence of individual characteristics on the probability of a given worker to belong to a particular income group and type of work. Some of these extensions are object of our ongoing research.
It is also worth mentioning that, in further research, the study might benefit by moving towards a structural approach, with a careful modeling strategy of the labor market and the resulting wage equation. Here, our focus was just on the application of an alternative bivariate approach to estimate the traditional Mincerian wage equation by using Brazilian micro data. In addition, due to well-known distortions in the Brazilian labor market, further extensions should consider an empirical analysis by sector of activity and type of occupation. We leave these issues for future research.
Appendix A Distributions and bivariate regression models
In the symmetric context, the bivariate Normal distribution has been intensely used in the literature [Johnson et al. (1995)]. However, an alternative symmetric to the bivariate Normal distribution is the Student t model, as in Johnson et al. (1995) and Balakrishnan and Lai (2009), which has heavier tails than the Normal bivariate distribution. This flexibility is important to accommodate observations with more outliers, which makes the t an alternative of interest. On the other hand, in the univariate asymmetric context, a distribution that has received considerable attention is the BS model, which was introduced by Birnbaum and Saunders (1969) whereby its genesis is motivated by material fatigue problems [Johnson et al. (1995)]. Recently, Saulo et al. (2020; 2021) proposed a bivariate BS distribution and its corresponding regression model, based on the univariate BS distribution reparameterized by the mean proposed by Santos-Neto et al. (2012). In this reparameterization, there is no need to transform the dependent variable to a logarithmic scale, which is an advantage since it can lead to difficulties in interpretation. In general terms, Normal, Student t, and BS bivariate distributions can be considered as good candidates in the context of modeling earnings and hours worked, since in the the Normal and t cases the logarithm of the data is used, i.e. the log-normal and log-t are considered for the level variables [Vanegas and Paula (2016)], and in the BS case, the data (asymmetric on the right) are used in the original scale. The Normal, Student t, and BS bivariate distributions and their respective regression models are presented in sequence.
Bivariate Normal distributionLet be a bivariate random vector following a bivariate normal distribution with means μ1 e μ2, and standard deviations σ1 e σ2. In addition to these 4 parameters there is a correlation coefficient p between Y1 and Y2 defined by −1 < ρ < 1. Therefore, denoting . The joint probability density function (PDF)of Y1 and Y2 can be written as:
The joint PDF of the random vector following a bivariate standard Normal distribution (means zero and variances one) is given by:
The corresponding joint cumulative distribution function (CDF) associated with (6) is given by:
When ρ = 0, i.e., when the Normal variables are uncorrelated, (6) can be expressed as the product of 2 Normal CDFs.
Normal bivariate regression modelConsider such that follows a Normal bivariate model, i.e., , Consider that there are r and s covariates, let’s say and , associated with Y1i, and Y2i, respectively, such that
with
where is a vector of l unknown parameters, and is the t-th line of matrix X(k) whose dimension is n × l, for k = 1,2 and l = r,s. Thus, we have the following Normal bivariate model:
where , and they are independently distributed.
To estimate the unknown parameters σ1, σ2, β1, β2 and ρ based on a random sample of size n, i.e., the maximum likelihood method is used. The likelihood and log-likelihood functions of the observed sample can be written respectively as
where f is the joint PDF of the bivariate normal distribution. The model parameter estimates must be obtained by maximizing the log-likelihood function (14). This is done by solving a nonlinear iterative optimization process, particularly the quasi-Newton Broyden-Fletcher-Goldfarb-Shanno (BFGS) method can be used. The BFGS method is implemented in R software (http://cran.r-project.org), using the optim and optimx functions.
Bivariate t distributionLet be a bivariate random vector following a bivariate t distribution with location parameters μ1 and μ2, scale parameters σ1 e σ2, degrees of freedom ν, and cor-relation coefficient −1 < ρ < 1 between U1 and U2, denoted by . Therefore, the joint PDF of U1 and U2 is given by:
The joint PDF of the random vector following a standard bivariate t distribution is given by:
The corresponding joint CDF associated to (16) is:
Bivariate t regression model
Consider such that follows a bivariate t distribution, i.e., with PDF (15). Assume that there are r and s covariates, and associated with U1i and U2i, respectively, such that
with as in (11), where is a vector of l unknown parameters, and is the i-th line of matrix X(k) whose dimension is n × l, for k = 1,2 and l = r,s. Therefore, we have the following bivariate t regression model
where are independently distributed.
The parameters of the model are estimated as the bivariate normal, that is, given as likelihood and log-likelihood function,
respectively, where f is the joint PDF of the bivariate t distribution, the model parameter estimates, σ1, σ2, β1, β2 e ρ, are obtained by maximizing the log-likelihood function by solving an iterative nonlinear optimization process, particularly the quasi-Newton BFGS method. The parameter ν is estimated according to Barros et al. (2008). The profiled log-likelihood and the following steps are used:
-
Let vk = k be for each k, k = 1,...,20, compute the parameter estimates of using the maximum likelihood method. Moreover, compute the log-likelihood function;
-
The final estimate of ν is the one which maximizes the log-likelihood function and the associated estimates of are the final ones.
Let be a bivariate random vector following a bivariate BS distribution parameterized by means with parameters μ1, μ2, δ1, δ2 e ρ. Therefore, the joint CDF of T1 and T2 can be written, for t1,t2 > 0, as (Saulo et al., 2020)
F(t1,t2;μ1,μ2,δ1,δ2,ρ)
where is the CDF of the standard bivariate distribution given in (8). Therefore, the joint PDF associated with (22) is given by
f(t1,t2;μ1,μ2,δ1,δ2,ρ)
where φ2 is the PDF of a normal bivariate distribution given by (7). The bivariate BS distribution with PDF (23) is denoted by .
Bivariate Birnbaum-Saunders (BS) regression modelConsider T1, ... ,Tn such that follows a bivariate BS distribution, that is, . Assume that there is r and s covariates, and , associated with T1i and T2i, respectively. Therefore, from (22), we have, for t1, t2 > 0, and i = 1,...,n, the joint CDF (Saulo et al., 2020)
F(t1i,t2i;μ1i,μ2i,δ1,δ2,ρ)
where
with as in (11), where is a vector of unknown l parameters, and is the i-th line of matrix X(k) whose dimension is n × l, for k = 1,2 and l = r,s. Here, differently from the BS regression model based on the classical parameterization Rieck and Nedelman (1991), there is no need for logarithmic transformation, that is, the data for the response are worked on in their original scale.
In order to estimate the parameters, as in the normal bivariate case, the maximum likelihood method is used. Consider a random sample of size say, therefore the likelihood and log-likelihood functions of the observed sample are given respectively by
where f is a joint PDF of the bivariate BS distribution. The parameter estimates β1, β2, δ1, δ2 and ρ are obtained by maximizing the log-likelihood function (28) using an iterative non-linear optimization process, in this case, the BFGS quasi-Newton method.
Bivariate regression of wages and hours worked considering only common explanatory variables Table A.1 Bivariate t distribution regression models for earnings and hours worked: including only common covariates (v = 4).Dependent variable | Explanatory var. | Coefficient | Standard error | Wald stat. | p-value |
---|---|---|---|---|---|
σ1 | 0.560 | 0.0018 | 305.35 | <0.0001 | |
σ2 | 0.510 | 0.0018 | 305.35 | <0.0001 | |
ρ | 0.192 | 0.0118 | 16.22 | <0.0001 | |
Wage | (Intercept) | 5.541 | 0.0198 | 280.01 | <0.0001 |
Gender | 0.319 | 0.0031 | 101.35 | <0.0001 | |
Age | 0.055 | 8.0000×10−4 | 69.95 | <0.0001 | |
Age2 | −5.000×10−4 | 0.0000 | −54.89 | <0.0001 | |
Race | 0.103 | 0.0032 | 32.60 | <0.0001 | |
Marital status | 0.003 | 0.0071 | 0.42 | 0.6770 | |
Education | 0.050 | 5.0000×10−4 | 100.83 | <0.0001 | |
Belém-PA | −0.319 | 0.0064 | −49.56 | <0.0001 | |
Fortaleza-CE | −0.347 | 0.0061 | −57.23 | <0.0001 | |
Recife-PE | −0.354 | 0.0059 | −60.45 | <0.0001 | |
Salvador-BA | −0.308 | 0.0059 | −51.84 | <0.0001 | |
Belo Horizonte-MG | −0.093 | 0.0056 | −16.57 | <0.0001 | |
Rio de Janeiro-RJ | −0.086 | 0.0053 | −16.29 | <0.0001 | |
Curitiba-PR | 0.013 | 0.0067 | 1.87 | 0.0619 | |
Porto Alegre-RS | −0.086 | 0.0053 | −16.22 | <0.0001 | |
Brasília-DF | 0.096 | 0.0065 | 14.80 | <0.0001 | |
2014 | 0.004 | 0.0034 | 1.16 | 0.2475 | |
2015 | −0.060 | 0.0034 | −17.31 | <0.0001 | |
High | 0.945 | 0.0086 | 110.33 | <0.0001 | |
High men | 0.395 | 0.0075 | 52.70 | <0.0001 | |
Mean | 0.186 | 0.0071 | 26.22 | <0.0001 | |
Low mean | 0.035 | 0.0069 | 5.03 | <0.0001 | |
Agriculture | −0.366 | 0.0181 | −20.21 | <0.0001 | |
Industry | −0.200 | 0.0090 | −22.27 | <0.0001 | |
Construction | −0.085 | 0.0096 | −8.90 | <0.0001 | |
Commerce, food and others | −0.229 | 0.0085 | −26.86 | <0.0001 | |
Education, health and soc. serv. | −0.265 | 0.0081 | −32.53 | <0.0001 | |
Other services | −0.225 | 0.0085 | −26.31 | <0.0001 | |
No employment record card | −0.227 | 0.0042 | −53.91 | <0.0001 | |
Autonomous | −0.122 | 0.0042 | −29.19 | <0.0001 | |
Civil servant | 0.313 | 0.0073 | 42.91 | <0.0001 | |
Hours worked | (Intercept) | 3.264 | 0.0164 | 198.60 | <0.0001 |
Gender | 0.086 | 0.0026 | 32.60 | <0.0001 | |
Age | 0.014 | 7.0000×10−4 | 20.35 | <0.0001 | |
Age2 | -1.000×10−4 | 0.0000 | −16.19 | <0.0001 | |
Race | −2.000×10−4 | 0.0027 | −0.06 | 0.9543 | |
Marital status | −0.008 | 0.0061 | −1.34 | 0.1809 | |
Education | −9.000×10−4 | 4.0000×10−4 | −2.22 | 0.0261 | |
Belém-PA | 0.003 | 0.0055 | 0.46 | 0.6471 | |
Fortaleza-CE | 0.017 | 0.0052 | 3.31 | 0.0009 | |
Recife-PE | −0.015 | 0.0050 | −3.07 | 0.0021 | |
Salvador-BA | −0.044 | 0.0051 | −8.62 | <0.0001 | |
Belo Horizonte-MG | 0.012 | 0.0047 | 2.54 | 0.0110 | |
Rio de Janeiro-RJ | −0.087 | 0.0045 | −19.52 | <0.0001 | |
Curitiba-PR | −0.012 | 0.0057 | −2.18 | 0.0295 | |
Porto Alegre-RS | 0.018 | 0.0045 | 3.99 | 0.0001 | |
Brasília-DF | −0.027 | 0.0054 | −4.97 | <0.0001 | |
2014 | 0.012 | 0.0029 | 4.10 | <0.0001 | |
2015 | −0.040 | 0.0029 | −13.62 | <0.0001 | |
High | 0.123 | 0.0073 | 16.93 | <0.0001 | |
High mean | 0.099 | 0.0066 | 15.00 | <0.0001 | |
Mean | 0.146 | 0.0063 | 23.25 | <0.0001 | |
Low mean | 0.155 | 0.0061 | 25.22 | <0.0001 | |
Agriculture | 0.157 | 0.0151 | 10.37 | <0.0001 | |
Industry | 0.020 | 0.0073 | 2.75 | 0.0060 | |
Construction | 0.042 | 0.0078 | 5.35 | <0.0001 | |
Commerce, food and others | 0.065 | 0.0069 | 9.38 | <0.0001 | |
Education, health and soc. serv. | −0.067 | 0.0065 | −10.19 | <0.0001 | |
Other services | −0.015 | 0.0069 | −2.21 | 0.0273 | |
No employment record card | −0.164 | 0.0036 | −45.30 | <0.0001 | |
Autonomous | −0.174 | 0.0034 | −50.64 | <0.0001 | |
Civil servant | −0.041 | 0.0060 | −6.77 | <0.0001 |
-
1
The wage rate is usually defined as the wage per hour earned by workers.
-
2
Human capital is understood as the set of attributes acquired by a worker by means of education, skill, and experience that improve productivity. This term was introduced by Mincer (1958) and later explored by Becker (1993) and Heckman et al. (2000), among others.
- 3
-
4
The variable wage per hour is commonly calculated as
-
5
The occupational classification is based on Jannuzzi (2001).
-
6
As a sensitivity analysis, we also estimated the bivariate regression including only common explanatory variables. The results, reported in Appendix, indicate that there is no significant change in the previous findings.
-
JEL classification. J20, C50.
-
The authors thank participants in the XLVII meeting of the Brazilian Economic Association (ANPEC) for valuable comments and suggestions. Danúbia Rodrigues thanks CAPES/PROEX for the doctorate scholarship. Helton Saulo acknowledges financial support from CNPq and FAP-DF. José A. Divino thanks CNPq for the financial support through grant number 302632-2019-0. The authors declare that they have no conflict of interest that could have appeared to influence the work reported in this paper. All remaining errors are the authors’ sole responsibility.
Bibliography
- Aali-Bujari, A., F. Venegas-Martínez, and A. García-Santillán (2019): “Schooling levels and wage gains in Mexico,” Economics and Sociology, 12, 74–83. [1]
- Balakrishnan, N. and C-D. Lai (2009): Continuous Bivariate Distributions, New York: Springer. [3, 16]
- Barros, M., G.A. Paula, and V. Leiva (2008): “A new class of survival regression models with heavy-tailed errors: robustness and diagnostics,” Lifetime Data Analysis, 14, 316–332. [19]
- Becker, G.S. (1993): Human capital a theoretical and empirical analysis, with special reference to education, New York: NBER. [2]
- Birnbaum, Z.W. and S.C. Saunders (1969): “A new family of life distributions,” Journal of Applied Probability, 6, 319–327. [16]
- Buchinsky, M. (2001): “Quantile regression with sample selection: estimating women’s return to education in the U.S,” Empirical Economics, 26, 87–113. [2]
- Card, David (1999): “The causal effect of education on earnings,” in Handbook of Labor Economics, ed. by O. Ashenfelter and D. Card, Elsevier, vol. 3, Part A, chap. 30, 1801–1863, 1 ed. [1]
- Cavalieri, C. H. and R. Fernandes (1998): “Diferenciais por gênero e cor: uma comparação entre as regiões metropolitanas brasileiras,” Revista de Economia Política, 18. [10]
- Chatterjee, S. and B. Price (1991): Regression Analysis by Example, New York: John Wiley. [2]
- Giuberti, A. C. and N. Menezes-Filho (2005): “Discriminação de rendimentos por gênero: uma comparação entre o Brasil e os Estados Unidos,” Economia Aplicada, 9, 369–384. [2]
- Gonzaga, G., P.G.G.P. Leite, and D.C. Machado (2002): “Quem trabalha muito e quem trabalha pouco no Brasil?” in Anais do XIII Encontro Nacional De Estudos Populacionais, ABEP. [11]
- Heckman, J. (1976): “The common structure of statistical models of truncation, sample selection and limited dependent variables and a simple estimator for such models,” Annals of Economic and Social Measurement, 5, 475–492. [3]
- Heckman, J. (1979): “Sample selection bias as a specification error,” Econometrica, 47, 153–161. [2, 13]
- Heckman, J., J.L. Tobias, and E. Vytlacil (2000): “Simple estimators for treatment parameters in a latent variable framework with an application to estimating the returns to shooling,” NBER Working Paper, 7950. [2]
- Ibacache-Pulgar, G., G. Paula, and M. Galea (2014): “On influence diagnostics in elliptical multivariate regression models with equicorrelated random errors,” Statistical Modelling, 16, 14–21. [8]
- Jannuzzi, P. de M. (2001): “Status socioeconômico das ocupações brasileiras: medidas aproximativas para 1980, 1991 e anos 90,” Revista Brasiliera de Estatística, 2, 47–74. [5]
- Johnson, N.L., S. Kotz, and N. Balakrishnan (1995): Continuous Univariate Distributions, vol. 2, New York, US: Wiley. [3, 16]
- Lau, L.J., D.T. Jamison, S. Liu, and S. Riukin (1993): “Education and economic growth: Some cross-sectional evidence from Brazil,” Journal of Development Economics, 41, 45–70. [11]
- Lucas, A. (1997): “Robustness of the student t based M-estimator,” Communications in Statistics: Theory and Methods, 41, 1165–1182. [8]
- Maciel, M.C., A.C. Campêlo, and M.C.F. Raposo (2001): “A dinâmica das mudanças na distribuição salarial e no retorno em educação para mulheres: Uma aplicação de regressão quantílica,” in Anais do XXIX Encontro Nacional de Economia, Salvador: ANPEC. [2]
- Madalozzo, R. (2010): “Occupational segregation and the gender wage gap in Brazil: an empirical analysis,” Economia Aplicada, 14, 147–168. [2]
- Marchant, C., V. Leiva, and F. J. A. Cysneiros (2016): “A multivariate log-linear model for Birnbaum-Saunders distributions,” IEEE Transactions on Reliability, 65, 816–827. [3, 8]
- Mincer, J. (1958): “Investment in human capital and personal income distribution,” Journal of Political Economy, 66, 281–302. [2]
- Mincer, J. (1974): Schooling, Experience, and Earnings, National Bureau of Economic Research, Inc. [1, 2]
- Mittelhammer, R.C., G.G. Judge, and D.J. Miller (2000): Econometric Foundations, New York, US: Cambridge University Press. [3]
- Resende, Marcelo and Ricardo Wyllie (2006): “Retornos para educação no Brasil: evidências empíricas adicionais,” Economia Aplicada, 10, 349–365. [1]
- Rieck, J.R. and J.R. Nedelman (1991): “A log-linear model for the Birnbaum-Saunders distribution,” Technometrics, 3, 51–60. [20]
- Santos-Neto, M., F.J.A. Cysneiros, V. Leiva, and S.E. Ahmed (2012): “On new parameterizations of the Birnbaum-Saunders distribution,” Pakistan Journal of Statistics, 28, 1–26. [3, 16]
- Saulo, H., J. Leão, R. Vila, V. Leiva, and V. Tomazella (2020): “On mean-based bivariate Birnbaum-Saunders distributions: Properties, inference and application,” Communications in Statistics – Theory and Methods, 49, 6032–6056. [3, 5, 16, 19]
- Saulo, H., J. Leão, R. Vila, V. Leiva, and V. Tomazella (2021): “A bivariate fatigue-life regression model and its application to fracture of metallic tools,” Brazilian Journal of Probability and Statistics, 35, 119–137. [3, 5, 16]
- Sedlacek, G. and E. Santos (1991): “A mulher cônjuge no mercado de trabalho como estratégia de geração da renda familiar,” Pesquisa e Planejamento Econômico, 21, 449–470. [2]
- Senna, J.C. (1976): “Escolaridade, experiência no trabalho e salários no Brasil,” Revista Brasileira de Economia, 30, 163–193. [1]
- Vanegas, L. H. and G. A. Paula (2016): “Log-symmetric distributions: statistical properties and parameter estimation,” Brazilian Journal of Probability and Statistics, 30, 196–220. [4, 16]
- Vilca, F., N. Balakrishnan, and C.B. Zeller (2014): “The bivariate sinh-elliptical distribution with applications to Birnbaum-Saunders distribution and associated regression and measurement error models,” Computational Statistics and Data Analysis, 80, 1–16. [8]
Publication Dates
-
Publication in this collection
22 June 2023 -
Date of issue
2023