s-news
[Top] [All Lists]

stepwise regression

To: s-news@lists.biostat.wustl.edu
Subject: stepwise regression
From: Jeff Simonoff <jsimonof@stern.nyu.edu>
Date: Sat, 18 Nov 2000 17:04:30 -0500 (EST)
Dear Splusers,
The recent message about stepwise regression, and Frank Harrell's reply to
it, prompted an exchange of e-mails between Frank and me on stepwise
regression in particular, and automatic data processing methods in
general. We found that we are in a great deal of agreement on these
issues, and we thought that it might be beneficial to the group to
summarize those feelings here. I believe that what is here is an accurate
reflection of both of our feelings, but just in case I'm mistaken, blame
me, not Frank! This message is also longer than your typical S-News
message, so if you're not really interested in these philosophical
questions of data analysis, you should press your delete key right now.

In Frank's message he pointed people to a web page with a discussion of
the pitfalls of stepwise regression (sorry, I don't have the address
handy). Basically, the advice given there was "never do it." The reasons
given fall into two broad categories:

(1) Stepwise methods (even those incorporating both forward and
    backward selection) are often fooled, and don't actually end
    up with the "best" model, however you might define it. There is
    a direct parallel here with function maximization - local
    maximization does not necessarily lead to the global maximum. 
    Similarly, just because adding a predictor to a current version
    of the model is the "best" thing to do, that doesn't mean that
    there isn't another model involving totally different predictors that
    is better still.
(2) More seriously, stepwise regression, and other automatic model
    selection methods like all subsets regression, result in inferential
    measures (F-test, t-tests, R-squared, standard error of the estimate,
    prediction intervals) that are biased towards indicating too much
    strength in the relationship between the target and the chosen
    predictors.

The first objection becomes pretty obvious if you use stepwise methods.
They were invented because of slow computers, and particularly in the
least squares regression case, are to my view obsolete. I would never teach 
least squares regression stepwise methods - besides all of the drawbacks
of automatic methods (point (2) above), they simply don't do what you want
when there is collinearity (and if there isn't collinearity, ordinary t-tests 
work
fine).

I do, however, talk about all subsets. I talk about all of the problems of
inflation of apparent strength. I also talk about a somewhat conservative
correction factor for the standard error of the estimate related to
Jimmy Ye's 1998 JASA article on this subject, which involves treating
the residual degrees of freedom as n-q-1, where q is the maximum number 
of predictors considered in the model selection (rather than the number 
of predictors in the final chosen model). I also say in no uncertain terms
that in any situation with a reasonably large sample size, you absolutely 
should validate onto new data (that is, hold some data out from the model 
building phase, and apply your chosen model to the held-out sample to see 
what its predictive accuracy really is). There has also been some research
on using bootstrap methods for this same purpose.

I can't agree, however, with the comments on the previously mentioned
web page that state that these problems with inference measures 
imply "never do it." The arguments that inference methods are
based on prespecified hypotheses didn't impress me 25 years ago (when I
was learning statistics), and they still don't. Nobody *ever* does
statistics this way; if we did, we would never identify outliers, look for
transformations, enrich the model in response to patterns in residual
plots, and so on (all of which also can increase the apparent strength of 
a regression). Further, I would argue that with the explosion of
methods commonly called "data mining," these pieces of advice are
ludicrously anachronistic. All subset regression is nothing compared to
those kinds of methods. We are no longer in the era of small data sets
isolated from each other in time; we are now in one of large (or even massive) 
ones that are part of an ongoing continuing process. In that context, I would
argue that automatic methods are crucial, and the key for statisticians
should be to get people to validate their models and correct for
selection effects, not tell them what nobody ever believed anyway.

Now, getting my students to actually do this in their own work is another
matter ...

           Best,

           Jeff


==============================================================================
Jeffrey S. Simonoff                            Phone:  (212) 998-0452
Dept. of Statistics & Oper. Res.
Leonard N. Stern School of Business            Fax:    (212) 995-4003
New York University
44 West 4th Street, Rm. 8-54                   e-mail: jsimonoff@stern.nyu.edu
New York, NY 10012-0258
USA                                    WWW: http://www.stern.nyu.edu/~jsimonof
==============================================================================





<Prev in Thread] Current Thread [Next in Thread>