s-news
[Top] [All Lists]

Re: Logistic regression model building

To: "fmedeiros" <fmedeiros@uol.com.br>
Subject: Re: Logistic regression model building
From: Frank E Harrell Jr <feh3k@spamcop.net>
Date: Thu, 29 Jan 2004 14:49:19 -0600
Cc: s-news@wubios.wustl.edu
In-reply-to: <HS9LZV$IT9hYHi18I1Iybybu1XI03hYBOFRqhJcXQlR8@uol.com.br>
Organization: Vanderbilt University
References: <HS9LZV$IT9hYHi18I1Iybybu1XI03hYBOFRqhJcXQlR8@uol.com.br>
On Thu, 29 Jan 2004 16:57:31 -0200
"fmedeiros" <fmedeiros@uol.com.br> wrote:

> Dear All, I was asked to criticize the following strategy for logistic
> regression model building and I would appreciate any comments from the
> list (positive or negative):
> 
> 1. After exclusion of variables based on subject knowledge, 25 variables
> were considered

reasonable

> as possible candidates (sample size ~200, with smaller group~ 90)
> 2. An all-subsets regression technique was employed and the model chosen
> was the one with the largest bias-corrected (bootstrap, B=200) ROC curve
> area

not very reasonable.  Better to do data reduction (ignoring Y) and fit a
full model on the reduced set.

> 3. _To check stability of the model_, another bootstrap procedure was
> performed (B=200 again) and step 2 was included in every bootstrap
> sample (? Double bootstrap). So, each bootstrap resample had its _best
> model_ with its associated optimism (obtained from step 2).

Good idea to do a double bootstrap for this type of strategy

> 4. The number of times that each variable appeared in the 200 bootstrap
> samples (200_best models_, but not necessarily all different) from step
> 3 was reported.

Why?  This has severe problems with collinearity and tells you nothing
different that P-values.

> 5. The final model reported was the one obtained after all-subsets
> regression was applied to the original sample, and although the
> variables in this model were the ones who showed the highest frequency
> in the 200 bootstrap samples, it was recognized that several other"best
> models" were possible, which was illustrated by the frequency of
> variables in the 200 bootstrap resamples. The _optimism_ was reported as
> the average of the 200 _optimisms_.

Good idea to look at "close misses" among the models, but I would use a
completely different and simpler strategy.  See my book Regression
Modeling Strategies for more info.  -Frank Harrell


> 
> Fernando Medeiros.
> 
> 
> Fernando Medeiros, M.D., Ph.D.
> Department of Neurosciences
> University of Sao Paolo
> 

---
Frank E Harrell Jr   Professor and Chair           School of Medicine
                     Department of Biostatistics   Vanderbilt University

<Prev in Thread] Current Thread [Next in Thread>