Donald Catanzaro, PhD wrote:
Hi All,
My apologies to the list as I lurch forward in my humble quest to
cross-validate my dataset. As folks have seen it is going rather slower
than I had hoped which is mainly due to my own lacking than anything else.
I've been working on subsetting my dataset into an 80/20 split and
creating a model with the 80% data and then using the remaining 20% for
model validation. For performance measures of the 80% model I'd like to
use the AIC and BIC coming from the 20% validation dataset.
It is rather nice that glm includes a subset option so I can create my
model using 80% of the data when supplied with the correct vector.
Is there a similar option where I can run the 20% data through the 80%
derived GLM and thus pull out the deviance & log-likelihoods without
additional calculations ?
If not, if I understand correctly, my other option would be to:
A) predict the 80% data points from the 80% model
B) find mu and size of the 80% predicted data points using fitdistr
C) calculate the log likelihood of the 20% validation dataset using mu
and size from the 80% predicted data points
D) calculate AIC and BIC from that log-likelihood
If I can't run the 20% data through my 80% model, would A-D get me where
I'd like to be ?
That approach to model validation has an extremely wide margin of error,
i.e., a large mean squared error in any estimate of predictive accuracy
arising from it.
Frank
--
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University
|