s-news
[Top] [All Lists]

[S] model selection in glm - logit link vs identity

To: s-news@wubios.wustl.edu
Subject: [S] model selection in glm - logit link vs identity
From: Brian_Cade@usgs.gov (Brian Cade)
Date: Wed, 27 Jan 1999 12:25:34 -0700
Sender: owner-s-news@wubios.wustl.edu
     Dear S-Plus Users:
     
     I'm posting this again as I've had no response.  The first part 
     relates to model selection in glm when comparing different link 
     functions for proportional data.  The second part is a more general 
     computing question about differences in glm computations in various 
     statistics packages, and why they might yield such different estimates 
     and standard errors. 
     
     (1)
     This is related to a recent discussion about better ways to model data 
     that are proportions.  It has been suggested that rather than doing 
     logit transformations on proportions and then treating them as 
     dependent variables in a least squares regression, that we use a logit 
     link function in a generalized linear model (glm), as this avoids 
     adhoc approaches to dealing with 0's and 1's (McCullagh and Nelder 
     1989).  I like this aspect.  I have a problem where my dependent 
     variable is proportion of seedlings surviving in a quadrat (fixed 
     area) that I want to relate to 4 independent variables.  Using glm I 
     fit both a linear least square regression, glm(surv~ x1 + x2 + x3 + 
     x4,family="gaussian"), and a linear model with logit link and constant 
     variance, glm(surv~ x1 + x2 + x3 + x4,family ="quasi"(link = "logit", 
     variance="constant").  Residual plots suggested both models achieved 
     similar homogeneity of residuals and there seemed to be little 
     justification to model the variance function differently than constant 
     (e.g., mu(1-mu) for logit was no improvement.  I thought I might be 
     able to use AIC to pick between these models, but then wondered what 
     "AIC" computation to use as several alternatives have been offered.  
     Below is a summary of deviances and AIC for the 2 models
     
                        identity link               logit link
     
     
     residual deviance      17.245                      16.536
                  df      125                         125      
     
     dispersion              0.138                       0.132 
     parameter
     
     AIC                   -252.601                   -258.059  
     
     "quasi" AIC             18.625                     17.856
     
     Now as these really are both least squares regressions, one a linear 
     model (y=b0 + b1x1+ b2x2 + ..)and the other a nonlinear model (y = 
     exp(b0 + b1x1 + b2x2 + ...)/(1+exp(b0 + b1x1 + b2x2 + ...)) where the 
     loglikelihood is for a normal error distribution, I would think to use 
     an AIC based on -2loglikelihood + 2p, where -2loglikelihood = 
     n*log(RSS/n), RSS = residual sums of squares = residual deviance, and 
     p is number of parameters.  But often in the glm literature related to 
     quasi likelihoods where a dispersion parameter is estimated I see a 
     quantity quasi AIC = deviance/estimate of dispersion parameter + 2p or 
     in Venables and Ripley (1994) quasi AIC = deviance + 2p(estimate of 
     dispersion).  Venables and Ripley suggest the estimate of dispersion 
     parameter should be held constant to compare models, but that would 
     make it a constant that effectively could be removed from the formula. 
     All three of these variants of AIC order the models the same (as does 
     just looking at deviance because number of model parameters is the 
     same) but the scale of relative differences is different for the 
     different variants of AIC.  It is the relative differences of AIC that 
     are supposed to be important when comparing models.  What is the best 
     way to compare glm's with different link functions and estimated 
     dispersions (and possibly different number of parameters) using AIC?  
     Or should this not be done?
     
     (2)
     An interesting issue occurred with estimating the logit link model in 
     3 different packages, S-Plus 4.5, SAS, SYSTAT 7; different parameter 
     estimates or standard errors.  Results below.  Note that deviances 
     (RSS = 16.5365) and estimates of dispersion (MSE = 0.1323) were the 
     same for all packages.  Anyone have any comment on what S-Plus is 
     doing differently than SAS and SYSTAT?
     
     S-Plus                      SAS Insight                   SYSTAT 7 
     glm( family=quasi(link         glm (quasi,link=logit,      Nonlinear
      =logit,variance=const)            (variance=normal,        model     
                                         deviance)            (loss = RSS,
                                                              Gauss-Newton)
     
     b0(se)  -0.098 (6.707)        -0.383 (8.295)           -0.383 (6.767)  
     
     b1(se)   0.002 (0.003)         0.002 (0.003)            0.002 (0.003)  
     
     b2(se)  -0.003 (0.002)        -0.003 (0.002)           -0.003 (0.002)  
     
     b3(se)   0.557 (0.208)         0.567 (0.262)            0.567 (0.211)
     
     b4(se)  -0.033 (0.009)        -0.034 (0.011)           -0.034 (0.009)
     
     SYSTAT nonlinear least squares parameter estimates are similar to 
     those from SAS glm but SYSTAT standard errors are more similar to 
     S-plus glm standard errors.  Great discrepancy between SAS/SYSTAT b0 
     estimate and that of S-Plus is disturbing.
     
     Brian Cade (USGS)
     
     brian_cade@usgs.gov
     
-----------------------------------------------------------------------
This message was distributed by s-news@wubios.wustl.edu.  To unsubscribe
send e-mail to s-news-request@wubios.wustl.edu with the BODY of the
message:  unsubscribe s-news

<Prev in Thread] Current Thread [Next in Thread>