Hunsicker, Lawrence wrote:
Good morning and Happy New Year, folks:
Today I have a statistical, rather than a programming problem for you
folks. I am trying to develop a mixed logistic model to predict (and
perhaps to explain) which patients in a cohort get a certain test done.
The patients are nested within center within country within region.
There are about 25,000 patients and about half have had the test done.
I ’ d like to test the importance of a small number of covariates
(definable a priori), but because this is a voluntary data set with
variable inclusion criteria from center to center, I ’ d like to correct
for a batch of other variables that may contribute “ noise ” to the
analysis.
First, I understand that with 25,000 patients, it is probably really not
necessary to develop a parsimonious model. I could just throw all the
nuisance covariates in to the model and then ignore them. (And I may
well choose to do just this.) But it is traditional to generate a “
parsimonious ” model.
Traditional perhaps but why? See
@Article{gre00whe,
author = {Greenland, Sander},
title = {When should epidemiologic regressions use random
coefficients?},
journal = Biometrics,
year = 2000,
volume = 56,
pages = {915-921},
annote = {Bayesian methods;causal inference;empirical Bayes
estimators;epidemiologic method;hierarchical regression;mixed
models;multilevel modeling;random-coefficient
regression;shrinkage;variance components;use of statistics in
epidemiology is largely primitive;stepwise variable selection on
confounders leaves important confounders uncontrolled;composition
matrix;example with far too many significant predictors with many
regression coefficients absurdly inflated when
overfit;lack of evidence for dietary effects mediated through
constituents;shrinkage instead of variable selection;larger effect on
confidence interval width than on point estimates with variable
selection;uncertainty about variance of random effects is just
uncertainty about prior opinion;estimation of variance is
pointless;instead the analysis should be repeated using different
values;"if one feels compelled to estimate $\tau^2$, I would recommend
giving it a proper prior concentrated amount contextually reasonable
values";claim about ordinary MLE being unbiased is misleading because
it assumes the model is correct and is the only model
entertained;shrinkage towards compositional model;"models need to be
complex to capture uncertainty about the relations...an honest
uncertainty assessment requires parameters for all effects that we
know may be present. This advice is implicit in an antiparsimony
principle often attributed to L. J. Savage 'All models should be as
big as an elephant (see Draper, 1995)'". See also gus06per.}
}
@Article{gus06per,
author = {Gustafson, Paul and Greenland, Sander},
title = {The performance of random coefficient regression in
accounting for residual confounding},
journal = Biometrics,
year = 2006,
volume = 62,
pages = {760-768},
annote = {Bayesian
analysis;bias;confounding;epidemiology;residual
confounding;heirarchical
models;identifiability;mixed models;random
effects;observational studies;nutritional
epidemiology;random coefficient regression;random
slopes}
}
Frank
So now to the question. To optimize any specific model, I will be using
either partial quasilikelihood, or partial likelihood, or some other
criterion other than simple likelihood ratios. This is fine for
optimizing the parameters within a model, but it doesn’t necessarily
give me a criterion for choosing between alternative models. In
particular, AIC is not defined for most of these models, and even the
log likelihood may not be available. So neither stepAIC, nor step, nor
stepwise works to develop a parsimonious mixed logistic model. (In
fact, if you call one of these methods, you get an error return saying
that the function won ’ t work on this sort of base model.)
So how does one choose amongst models optimized using a c riterion other
than l og likelihood? Is there any mathematically accurate way? And if
there is (and this, I suppose, is the programming question) does an S or
R function exist that will automate development of a parsimonious model?
Thanks in advance to any of you that can give me help on this question.
Again, Happy New Year to all.
Larry Hunsicker
Notice: This UI Health Care e-mail (including attachments) is covered by
the Electronic Communications Privacy Act, 18 U.S.C. 2510-2521, is
confidential and may be legally privileged. If you are not the intended
recipient, you are hereby notified that any retention, dissemination,
distribution, or copying of this communication is strictly prohibited.
Please reply to the sender that you have received the message in error,
then delete it. Thank you.
--
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University
|