I am afraid this is a tale of woe. For some reason Insightful continues
to supply a function, step.glm, with show-stopping bugs. (I looked at the
code in S-PLUS 7, not having 6.1 to hand.)
1) I think the crux here is the offset. First, step.glm returns the
results of glm.fit, whereas glm recomputes the null deviance for models
with offsets, since glm.fit gets it wrong. So step.glm is wrong.
Second, step.glm calls add1.lm and drop1.lm which do not know about
offsets, and so its approximations to the AIC changes are in a completely
incorrect model.
This has been known for many years.
2) You have a conceptual problem here. step.glm claims to minimize AIC.
For AIC to be defined you need a log-likelihood, and the point about
quasi-likelihood fits is that you do not have a likelihood. Your
quasi-likelihood is that of a quasi-Poisson model with a lot of
over-dispersion, but does not correspond to the likelihood of any
over-dispersed Poisson model. If your data are counts I would consider
using a negative-binomial GLM.
Model selection in quasi-likelihood GLMs is AFAIK a research topic.
3) It is well known that step.glm does not do what it claims, both because
it uses a too-crude approximation and because it has a different
definition of AIC from Akaike (and almost everyone else). That's the
reason behind stepAIC in the MASS library section.
On Mon, 3 Oct 2005, Megan Ferguson wrote:
Hi,
I'm using S-PLUS Professional Edition Version 6.1.2 Release 1 for Windows to
build a quasi-likelihood glm with a log link and variance proportional to the
mean. I've been using the "glm" command to build a null model, which I then
feed to "step.glm" to test a range of variables for inclusion in the model.
My code is as follows:
####
null.indglm.stripe10 <- glm(formula = indstripe ~offset(log(oneffort20)),
family = quasi(link = log, variance = "mu"), data = all.10.ind13)
indglm1.stripe10 <- step.glm(null.indglm.stripe10, direction="both", scope=~
frontsd2
+ distarc + distarc2
+ depth + depth2
+ pslope + pslope2
+ temp + temp2
+ sal + sal2
+ tcd + tcd2
+ tcs + tcs2
+ ezd + ezd2
+ log(chcon) + (log(chcon))2
+ beauf + beauf2,
keep=keepit)
####
I would like to look at the percent change in deviance between the null model
and the model that step.glm selects. To do this, I used the "summary"
command on indglm1.stripe10. The value returned for the null deviance is
LESS than that for the residual deviance:
###
Null Deviance: 238718.1 on 14522 degrees of freedom
Residual Deviance: 258702.1 on 14508 degrees of freedom
###
Then I called for a summary of the null model (null.indglm.stripe10) and a
much larger value was returned for the null deviance:
###
Null Deviance: 278765.6 on 14522 degrees of freedom
Residual Deviance: 278765.6 on 14522 degrees of freedom
###
It looks like step.glm is using the larger value for the null deviance in
computing AIC (I see this when I get the printout from
indglm1.stripe10$anova). Why is the estimated value of the null deviance
different between the summary of the null model and the summary of the best
model from step.glm? Does this have to do with my specification of the
quasi-likelihood family? Is the second value of the null deviance derived
from a revised estimate of the scale (dispersion) parameter? When does the
estimated value of the null deviance change in the step.glm algorithm?
Thanks for your time.
Megan
--
Brian D. Ripley, ripley@stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
|