vignettes/mixpoissonreg.Rmd
mixpoissonreg.Rmd
The mixpoissonreg package is useful for fitting and analysing regression models with overdispersed count responses. It provides two regression models for overdispersed count data:
Negative-Binomial regression models;
Poisson Inverse Gaussian regression models.
The parameters of the above models may be estimated by the Expectation-Maximization algorithm or by direct maximization of the likelihood function.
There are several functions to aid the user to check the adequacy of the fitted model, perform some inference and study influential observations.
For theoretical details we refer the reader to Barreto-Souza and Simas (2016)
Some reasons are:
glm
-like functions to fit and analyze fitted models
Useful plots for fitted models available using R
’s base graphics or ggplot2
Estimation of parameters via EM-algorithm
Easy to do inference on fitted models with help of lmtest
package
Regression structure for the precision parameter
Several global and local influence measures implemented (along with corresponding plots)
Compatibility with the tidyverse
We will now delve into the reasons above in more detail.
glm
-like functions to fit and analyze fitted modelsIf you know how to fit glm
models then you already know how to fit mixpoissonreg
models. You use a formula
expression for the regression with respect to mean parameter:
library(mixpoissonreg)
fit <- mixpoissonreg(daysabs ~ gender + math + prog, data = Attendance)
If you do not provide further arguments, the mixpoissonreg package will fit a Negative-Binomial model with a log link function for the mean estimated via EM-algorithm. Just like you would do with a glm
fit, you can just type fit
to get a simplified output of the fit, or get a summary, by typing summary(fit)
:
fit
#>
#> Negative Binomial Regression - Expectation-Maximization Algorithm
#>
#> Call:
#> mixpoissonreg(formula = daysabs ~ gender + math + prog, data = Attendance)
#>
#> Coefficients modeling the mean (with log link):
#> (Intercept) gendermale math progAcademic progVocational
#> 2.707563981 -0.211066195 -0.006235448 -0.424575877 -1.252812675
#> Coefficients modeling the precision (with identity link):
#> (Intercept)
#> 1.047289
summary(fit)
#>
#> Negative Binomial Regression - Expectation-Maximization Algorithm
#>
#> Call:
#> mixpoissonreg(formula = daysabs ~ gender + math + prog, data = Attendance)
#>
#>
#> Pearson residuals:
#> RSS Min 1Q Median 3Q Max
#> 345.7857 -0.9662 -0.7356 -0.3351 0.3132 5.5512
#>
#> Coefficients modeling the mean (with link):
#> Estimate Std.error z-value Pr(>|z|)
#> (Intercept) 2.707564 0.203069 13.333 < 2e-16 ***
#> gendermale -0.211066 0.122373 -1.725 0.0846 .
#> math -0.006235 0.002491 -2.503 0.0123 *
#> progAcademic -0.424576 0.181772 -2.336 0.0195 *
#> progVocational -1.252813 0.201381 -6.221 4.94e-10 ***
#>
#> Coefficients modeling the precision (with link):
#> Estimate Std.error z-value Pr(>|z|)
#> (Intercept) 1.0473 0.1082 9.675 <2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Efron's pseudo R-squared: 0.1854615
#> Number of iterations of the EM algorithm = 1
If you want to fit a Poisson Inverse Gaussian regression model, just set the model
argument to PIG
:
fit_pig <- mixpoissonreg(daysabs ~ gender + math + prog, model = "PIG",
data = Attendance)
summary(fit_pig)
#>
#> Poisson Inverse Gaussian Regression - Expectation-Maximization Algorithm
#>
#> Call:
#> mixpoissonreg(formula = daysabs ~ gender + math + prog, data = Attendance,
#> model = "PIG")
#>
#>
#> Pearson residuals:
#> RSS Min 1Q Median 3Q Max
#> 276.4232 -0.8225 -0.6386 -0.3219 0.3038 5.4969
#>
#> Coefficients modeling the mean (with link):
#> Estimate Std.error z-value Pr(>|z|)
#> (Intercept) 3.060166 0.227621 13.444 < 2e-16 ***
#> gendermale -0.244078 0.129011 -1.892 0.058502 .
#> math -0.007421 0.002641 -2.810 0.004951 **
#> progAcademic -0.719714 0.193040 -3.728 0.000193 ***
#> progVocational -1.591909 0.215375 -7.391 1.45e-13 ***
#>
#> Coefficients modeling the precision (with link):
#> Estimate Std.error z-value Pr(>|z|)
#> (Intercept) 0.7323 0.1121 6.533 6.46e-11 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Efron's pseudo R-squared: 0.1758126
#> Number of iterations of the EM algorithm = 20
There are also several standard methods implemented for mixpoissonreg
fitted objects: plot
, residuals
, vcov
, influence
, cooks.distance
, hatvalues
, predict
, etc.
A fitted mixpoissonreg
object is compatible with the lrtest
, waldtest
, coeftest
and coefci
methods from the lmtest
package, which, for instance, allows one to easily perform ANOVA-like tests for mixpoissonreg
models.
We refer the reader to the Analyzing overdispersed count data with the mixpoissonreg package vignette for further details on fitting mixpoissonreg
objects.
R
’s base graphics or ggplot2
Once you have a fitted mixpoissonreg
object, say fit
, you can easily produce useful plots by typing plot(fit)
:
plot(fit)
You can also have the ggplot2
version of the above plots by typing autoplot(fit)
:
autoplot(fit)
For further details on plots of mixpoissonreg
objects, we refer the reader to the Building and customizing base-R diagnostic plots with the mixpoissonreg package and Building and customizing ggplot2-based diagnostic plots with the mixpoissonreg package vignettes.
The EM-algorithm also tries to find the maximum-likelihood estimate of the parameters. So, why is there advantage on using the EM-algorithm? The reason is purely numerical. Sometimes the log-likelihood function has several local maxima or almost flat, which makes the direct maximization of the likelihood function converge (and making it very difficult not to obtain early convergence) before reaching its maximum. In such situations the EM-algorithm may be able to obtain estimates closer to the maximum-likelihood estimate. However, it has a drawback that it may take much longer to converge on such situations.
We refer the reader to the Supplementary Material of Barreto-Souza and Simas (2016) and to Barreto-Souza and Simas (2017) for numerical studies.
lmtest
packageWe can easily perform ANOVA-like tests using the lmtest
package. Let us perform a likelihood-ratio test against the NULL model that only contains the intercept:
lmtest::lrtest(fit)
#> Likelihood ratio test
#>
#> Model 1: daysabs ~ gender + math + prog | 1
#> Model 2: daysabs ~ 1 | 1
#> #Df LogLik Df Chisq Pr(>Chisq)
#> 1 6 -864.15
#> 2 2 -896.47 -4 64.638 3.068e-13 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now let us perform the Wald test against the NULL model that only contains the intercept:
lmtest::waldtest(fit)
#> Wald test
#>
#> Model 1: daysabs ~ gender + math + prog | 1
#> Model 2: daysabs ~ 1 | 1
#> Res.Df Df Chisq Pr(>Chisq)
#> 1 308
#> 2 312 -4 74.257 2.861e-15 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Let us now consider a reduced model:
fit_red <- mixpoissonreg(daysabs ~ math, data = Attendance)
We can perform a likelihood ratio test of full fit
model against the NULL of the reduced model as:
lmtest::lrtest(fit, fit_red)
#> Likelihood ratio test
#>
#> Model 1: daysabs ~ gender + math + prog | 1
#> Model 2: daysabs ~ math | 1
#> #Df LogLik Df Chisq Pr(>Chisq)
#> 1 6 -864.15
#> 2 3 -888.15 -3 47.999 2.131e-10 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In the same fashion, we can perform a Wald test of full fit
model against the NULL of the reduced model as:
lmtest::waldtest(fit, fit_red)
#> Wald test
#>
#> Model 1: daysabs ~ gender + math + prog | 1
#> Model 2: daysabs ~ math | 1
#> Res.Df Df Chisq Pr(>Chisq)
#> 1 308
#> 2 311 -3 52.537 2.301e-11 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can obtain confidence intervals for the estimated parameters by using the coefci
method:
lmtest::coefci(fit)
#> 2.5 % 97.5 %
#> (Intercept) 2.30955534 3.105572621
#> gendermale -0.45091355 0.028781163
#> math -0.01111851 -0.001352381
#> progAcademic -0.78084150 -0.068310253
#> progVocational -1.64751140 -0.858113947
#> (Intercept).precision 0.83512850 1.259450101
The mixpoissonreg
package allows the regression models to also have a regression structure on the precision parameter. It is very simple to fit such models, one simply write the formula
as formula_1 | formula_2
, where formula_1
corresponds to the regression structure with respect to the mean parameter and contains the response variable and formula_2
corresponds to the regression structure with respect to the precision parameter and does not contain the response variable.
Consider the following example:
fit_prec <- mixpoissonreg(daysabs ~ gender + math + prog | gender + math + prog,
data = Attendance)
summary(fit_prec)
#>
#> Negative Binomial Regression - Expectation-Maximization Algorithm
#>
#> Call:
#> mixpoissonreg(formula = daysabs ~ gender + math + prog | gender +
#> math + prog, data = Attendance)
#>
#>
#> Pearson residuals:
#> RSS Min 1Q Median 3Q Max
#> 322.0142 -1.1751 -0.6992 -0.3600 0.3014 4.7178
#>
#> Coefficients modeling the mean (with link):
#> Estimate Std.error z-value Pr(>|z|)
#> (Intercept) 2.746123 0.147412 18.629 < 2e-16 ***
#> gendermale -0.245113 0.117967 -2.078 0.03773 *
#> math -0.006617 0.002317 -2.856 0.00429 **
#> progAcademic -0.425983 0.132144 -3.224 0.00127 **
#> progVocational -1.269755 0.174444 -7.279 3.37e-13 ***
#>
#> Coefficients modeling the precision (with link):
#> Estimate Std.error z-value Pr(>|z|)
#> (Intercept) 1.414227 0.343243 4.120 3.79e-05 ***
#> gendermale -0.208397 0.203692 -1.023 0.306262
#> math -0.005123 0.004181 -1.225 0.220458
#> progAcademic -1.084418 0.305479 -3.550 0.000385 ***
#> progVocational -1.422051 0.343812 -4.136 3.53e-05 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Efron's pseudo R-squared: 0.1860886
#> Number of iterations of the EM algorithm = 11
In the abscence of link.precision
argument, a log link will be used as link function for the precision parameter.
There are several global influence measures implemented. See influence
, hatvalues
, cooks.distance
and their arguments.
The easiest way to perform global influence analysis on a mixpoissonreg
fitted object is to obtain the plots related to global influence:
One can also get the ggplot2
version of the above plots:
In a similar fashion, there are several local influence measures implemented in the local_influence
method. In the same spirit as above, the easiest way to perform local influence analysis on a mixpoissonreg
fitted object is to obtain plots by using local_influence_plot
method:
local_influence_plot(fit_prec)
One can also obtain ggplot2
versions of the above plots by using the local_influence_autoplot
method:
local_influence_autoplot(fit_prec)
For further details on global and local influence analyses of mixpoissonreg
objects, we refer the reader to the Global and local influence analysis with the mixpoissonreg package vignette.
tidyverse
We provide the standard methods from the broom
package, namely augment
, glance
and tidy
, for mixpoissonreg
objects:
augment(fit_prec)
#> # A tibble: 314 x 12
#> daysabs gender math prog .fitted .fittedlwrconf .fitteduprconf .resid
#> <dbl> <fct> <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 4 male 63 Academic 5.25 4.14 6.66 -0.200
#> 2 4 male 27 Academic 6.66 5.40 8.22 -0.370
#> 3 2 female 20 Academic 8.92 7.21 11.0 -0.814
#> 4 3 female 16 Academic 9.15 7.34 11.4 -0.713
#> 5 3 female 2 Academic 10.0 7.76 13.0 -0.772
#> 6 13 female 71 Academic 6.36 5.02 8.07 0.956
#> 7 11 female 63 Academic 6.71 5.39 8.34 0.599
#> 8 7 male 3 Academic 7.81 6.04 10.1 -0.102
#> 9 10 male 51 Academic 5.68 4.58 7.05 0.660
#> 10 9 male 49 Vocational 2.48 1.86 3.30 1.86
#> # … with 304 more rows, and 4 more variables: .resfit <dbl>, .hat <dbl>,
#> # .cooksd <dbl>, .gencooksd <dbl>
glance(fit_prec)
#> # A tibble: 1 x 9
#> efron.pseudo.r2 df.null logLik AIC BIC df.residual nobs model.type
#> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int> <chr>
#> 1 0.186 312 -853. 1726. 1764. 304 314 NB
#> # … with 1 more variable: est.method <chr>
tidy(fit_prec)
#> # A tibble: 10 x 6
#> component term estimate std.error statistic p.value
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 mean (Intercept) 2.75 0.147 18.6 1.87e-77
#> 2 mean gendermale -0.245 0.118 -2.08 3.77e- 2
#> 3 mean math -0.00662 0.00232 -2.86 4.29e- 3
#> 4 mean progAcademic -0.426 0.132 -3.22 1.27e- 3
#> 5 mean progVocational -1.27 0.174 -7.28 3.37e-13
#> 6 precision (Intercept) 1.41 0.343 4.12 3.79e- 5
#> 7 precision gendermale -0.208 0.204 -1.02 3.06e- 1
#> 8 precision math -0.00512 0.00418 -1.23 2.20e- 1
#> 9 precision progAcademic -1.08 0.305 -3.55 3.85e- 4
#> 10 precision progVocational -1.42 0.344 -4.14 3.53e- 5
As seen above, the autoplot
method from ggplot2
package is also implemented for mixpoissonreg
objects.
autoplot(fit_prec)
Finally, we provide additional methods related to local influence analysis: tidy_local_influence
, local_influence_benchmarks
and local_influence_autoplot
:
tidy_local_influence(fit_prec)
#> # A tibble: 314 x 5
#> case_weights hidden_variable mean_explanatory precision_explanatory
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.00146 0.000353 0.000533 0.00217
#> 2 0.000970 0.000268 0.000782 0.00129
#> 3 0.000712 0.000369 0.000535 0.00132
#> 4 0.000350 0.000328 0.000875 0.000130
#> 5 0.000680 0.000562 0.00101 0.000701
#> 6 0.00118 0.00387 0.00494 0.000740
#> 7 0.00125 0.00230 0.00396 0.00139
#> 8 0.00243 0.00127 0.00230 0.00185
#> 9 0.00107 0.00180 0.00266 0.00160
#> 10 0.00249 0.00325 0.000340 0.000788
#> # … with 304 more rows, and 1 more variable: simultaneous_explanatory <dbl>
local_influence_benchmarks(fit_prec)
#> # A tibble: 1 x 5
#> case_weights hidden_variable mean_explanatory precision_explanatory
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.0140 0.0206 0.0182 0.0195
#> # … with 1 more variable: simultaneous_explanatory <dbl>
local_influence_autoplot(fit_prec)
For further details on tidyverse
compatibility of mixpoissonreg
objects and how to use it to produce nice diagnostics and plots, we refer the reader to the mixpoissonreg in the tidyverse vignette.