s-news
[Top] [All Lists]

Re:

To: dgcatanzaro@gmail.com
Subject: Re:
From: Frank E Harrell Jr <f.harrell@vanderbilt.edu>
Date: Thu, 25 Sep 2008 21:39:08 -0500
Cc: s-news@lists.biostat.wustl.edu
In-reply-to: <48DBF4D7.3020905@gmail.com>
References: <48DBF4D7.3020905@gmail.com>
User-agent: Thunderbird 2.0.0.16 (X11/20080724)
Donald Catanzaro, PhD wrote:
Hi All,

My apologies to the list as I lurch forward in my humble quest to cross-validate my dataset. As folks have seen it is going rather slower than I had hoped which is mainly due to my own lacking than anything else.

I've been working on subsetting my dataset into an 80/20 split and creating a model with the 80% data and then using the remaining 20% for model validation. For performance measures of the 80% model I'd like to use the AIC and BIC coming from the 20% validation dataset.

It is rather nice that glm includes a subset option so I can create my model using 80% of the data when supplied with the correct vector. Is there a similar option where I can run the 20% data through the 80% derived GLM and thus pull out the deviance & log-likelihoods without additional calculations ?

If not, if I understand correctly, my other option would be to:
A)  predict the 80% data points from the 80% model
B)  find mu and size of the 80% predicted data points using fitdistr
C) calculate the log likelihood of the 20% validation dataset using mu and size from the 80% predicted data points
D)  calculate AIC and BIC from that log-likelihood

If I can't run the 20% data through my 80% model, would A-D get me where I'd like to be ?


That approach to model validation has an extremely wide margin of error, i.e., a large mean squared error in any estimate of predictive accuracy arising from it.

Frank

--
Frank E Harrell Jr   Professor and Chair           School of Medicine
                     Department of Biostatistics   Vanderbilt University

<Prev in Thread] Current Thread [Next in Thread>
  • Re:, Donald Catanzaro, PhD
    • Re:, Frank E Harrell Jr <=