Dear S-Plus Users:
I'm posting this again as I've had no response. The first part
relates to model selection in glm when comparing different link
functions for proportional data. The second part is a more general
computing question about differences in glm computations in various
statistics packages, and why they might yield such different estimates
and standard errors.
(1)
This is related to a recent discussion about better ways to model data
that are proportions. It has been suggested that rather than doing
logit transformations on proportions and then treating them as
dependent variables in a least squares regression, that we use a logit
link function in a generalized linear model (glm), as this avoids
adhoc approaches to dealing with 0's and 1's (McCullagh and Nelder
1989). I like this aspect. I have a problem where my dependent
variable is proportion of seedlings surviving in a quadrat (fixed
area) that I want to relate to 4 independent variables. Using glm I
fit both a linear least square regression, glm(surv~ x1 + x2 + x3 +
x4,family="gaussian"), and a linear model with logit link and constant
variance, glm(surv~ x1 + x2 + x3 + x4,family ="quasi"(link = "logit",
variance="constant"). Residual plots suggested both models achieved
similar homogeneity of residuals and there seemed to be little
justification to model the variance function differently than constant
(e.g., mu(1-mu) for logit was no improvement. I thought I might be
able to use AIC to pick between these models, but then wondered what
"AIC" computation to use as several alternatives have been offered.
Below is a summary of deviances and AIC for the 2 models
identity link logit link
residual deviance 17.245 16.536
df 125 125
dispersion 0.138 0.132
parameter
AIC -252.601 -258.059
"quasi" AIC 18.625 17.856
Now as these really are both least squares regressions, one a linear
model (y=b0 + b1x1+ b2x2 + ..)and the other a nonlinear model (y =
exp(b0 + b1x1 + b2x2 + ...)/(1+exp(b0 + b1x1 + b2x2 + ...)) where the
loglikelihood is for a normal error distribution, I would think to use
an AIC based on -2loglikelihood + 2p, where -2loglikelihood =
n*log(RSS/n), RSS = residual sums of squares = residual deviance, and
p is number of parameters. But often in the glm literature related to
quasi likelihoods where a dispersion parameter is estimated I see a
quantity quasi AIC = deviance/estimate of dispersion parameter + 2p or
in Venables and Ripley (1994) quasi AIC = deviance + 2p(estimate of
dispersion). Venables and Ripley suggest the estimate of dispersion
parameter should be held constant to compare models, but that would
make it a constant that effectively could be removed from the formula.
All three of these variants of AIC order the models the same (as does
just looking at deviance because number of model parameters is the
same) but the scale of relative differences is different for the
different variants of AIC. It is the relative differences of AIC that
are supposed to be important when comparing models. What is the best
way to compare glm's with different link functions and estimated
dispersions (and possibly different number of parameters) using AIC?
Or should this not be done?
(2)
An interesting issue occurred with estimating the logit link model in
3 different packages, S-Plus 4.5, SAS, SYSTAT 7; different parameter
estimates or standard errors. Results below. Note that deviances
(RSS = 16.5365) and estimates of dispersion (MSE = 0.1323) were the
same for all packages. Anyone have any comment on what S-Plus is
doing differently than SAS and SYSTAT?
S-Plus SAS Insight SYSTAT 7
glm( family=quasi(link glm (quasi,link=logit, Nonlinear
=logit,variance=const) (variance=normal, model
deviance) (loss = RSS,
Gauss-Newton)
b0(se) -0.098 (6.707) -0.383 (8.295) -0.383 (6.767)
b1(se) 0.002 (0.003) 0.002 (0.003) 0.002 (0.003)
b2(se) -0.003 (0.002) -0.003 (0.002) -0.003 (0.002)
b3(se) 0.557 (0.208) 0.567 (0.262) 0.567 (0.211)
b4(se) -0.033 (0.009) -0.034 (0.011) -0.034 (0.009)
SYSTAT nonlinear least squares parameter estimates are similar to
those from SAS glm but SYSTAT standard errors are more similar to
S-plus glm standard errors. Great discrepancy between SAS/SYSTAT b0
estimate and that of S-Plus is disturbing.
Brian Cade (USGS)
brian_cade@usgs.gov
-----------------------------------------------------------------------
This message was distributed by s-news@wubios.wustl.edu. To unsubscribe
send e-mail to s-news-request@wubios.wustl.edu with the BODY of the
message: unsubscribe s-news
|