s-news
[Top] [All Lists]

Re: Developing a parsimonious mixed logistic regression model

To: "Hunsicker, Lawrence" <lawrence-hunsicker@uiowa.edu>
Subject: Re: Developing a parsimonious mixed logistic regression model
From: Frank E Harrell Jr <f.harrell@vanderbilt.edu>
Date: Mon, 05 Jan 2009 11:22:52 -0600
Cc: s-news@lists.biostat.wustl.edu
In-reply-to: <2B80F69A8A189D48B0E668B0BBC6BA4201E0113D@HC-MAIL13.healthcare.uiowa.edu>
References: <2B80F69A8A189D48B0E668B0BBC6BA4201E0113D@HC-MAIL13.healthcare.uiowa.edu>
User-agent: Thunderbird 2.0.0.18 (X11/20081125)
Hunsicker, Lawrence wrote:


Good morning and Happy New Year, folks:

Today I have a statistical, rather than a programming problem for you folks. I am trying to develop a mixed logistic model to predict (and perhaps to explain) which patients in a cohort get a certain test done. The patients are nested within center within country within region. There are about 25,000 patients and about half have had the test done. I ’ d like to test the importance of a small number of covariates (definable a priori), but because this is a voluntary data set with variable inclusion criteria from center to center, I ’ d like to correct for a batch of other variables that may contribute “ noise ” to the analysis.

First, I understand that with 25,000 patients, it is probably really not necessary to develop a parsimonious model. I could just throw all the nuisance covariates in to the model and then ignore them. (And I may well choose to do just this.) But it is traditional to generate a “ parsimonious ” model.

Traditional perhaps but why?  See

@Article{gre00whe,
  author =               {Greenland, Sander},
title = {When should epidemiologic regressions use random coefficients?},
  journal =      Biometrics,
  year =                 2000,
  volume =               56,
  pages =                {915-921},
  annote =               {Bayesian methods;causal inference;empirical Bayes
estimators;epidemiologic method;hierarchical regression;mixed
models;multilevel modeling;random-coefficient
regression;shrinkage;variance components;use of statistics in
epidemiology is largely primitive;stepwise variable selection on
confounders leaves important confounders uncontrolled;composition
matrix;example with far too many significant predictors with many
regression coefficients absurdly inflated when
overfit;lack of evidence for dietary effects mediated through
constituents;shrinkage instead of variable selection;larger effect on
confidence interval width than on point estimates with variable
selection;uncertainty about variance of random effects is just
uncertainty about prior opinion;estimation of variance is
pointless;instead the analysis should be repeated using different
values;"if one feels compelled to estimate $\tau^2$, I would recommend
giving it a proper prior concentrated amount contextually reasonable
values";claim about ordinary MLE being unbiased is misleading because
it assumes the model is correct and is the only model
entertained;shrinkage towards compositional model;"models need to be
complex to capture uncertainty about the relations...an honest
uncertainty assessment requires parameters for all effects that we
know may be present.  This advice is implicit in an antiparsimony
principle often attributed to L. J. Savage 'All models should be as
big as an elephant (see Draper, 1995)'".  See also gus06per.}
}
@Article{gus06per,
  author =               {Gustafson, Paul and Greenland, Sander},
title = {The performance of random coefficient regression in accounting for residual confounding},
  journal =      Biometrics,
  year =                 2006,
  volume =               62,
  pages =                {760-768},
  annote =               {Bayesian
                  analysis;bias;confounding;epidemiology;residual
                  confounding;heirarchical
                  models;identifiability;mixed models;random
                  effects;observational studies;nutritional
                  epidemiology;random coefficient regression;random
                  slopes}
}

Frank


So now to the question. To optimize any specific model, I will be using either partial quasilikelihood, or partial likelihood, or some other criterion other than simple likelihood ratios. This is fine for optimizing the parameters within a model, but it doesn’t necessarily give me a criterion for choosing between alternative models. In particular, AIC is not defined for most of these models, and even the log likelihood may not be available. So neither stepAIC, nor step, nor stepwise works to develop a parsimonious mixed logistic model. (In fact, if you call one of these methods, you get an error return saying that the function won ’ t work on this sort of base model.)

So how does one choose amongst models optimized using a c riterion other than l og likelihood? Is there any mathematically accurate way? And if there is (and this, I suppose, is the programming question) does an S or R function exist that will automate development of a parsimonious model?

Thanks in advance to any of you that can give me help on this question. Again, Happy New Year to all.

Larry Hunsicker

Notice: This UI Health Care e-mail (including attachments) is covered by the Electronic Communications Privacy Act, 18 U.S.C. 2510-2521, is confidential and may be legally privileged. If you are not the intended recipient, you are hereby notified that any retention, dissemination, distribution, or copying of this communication is strictly prohibited. Please reply to the sender that you have received the message in error, then delete it. Thank you.



--
Frank E Harrell Jr   Professor and Chair           School of Medicine
                     Department of Biostatistics   Vanderbilt University

<Prev in Thread] Current Thread [Next in Thread>