s-news
[Top] [All Lists]

Re: Choosing the best model

To: Brenton Chatfield <chatfb01@student.uwa.edu.au>
Subject: Re: Choosing the best model
From: Frank E Harrell Jr <f.harrell@vanderbilt.edu>
Date: Wed, 09 Aug 2006 09:45:53 -0500
Cc: s-news@lists.biostat.wustl.edu
In-reply-to: <002101c6bb79$5f5f35f0$172c5f82@segs.uwa.edu.au>
References: <002101c6bb79$5f5f35f0$172c5f82@segs.uwa.edu.au>
User-agent: Thunderbird 1.5.0.5 (X11/20060728)
Brenton Chatfield wrote:


Greetings to you all,

I am looking for advice on how to select the ‘best’ of multiple models.

I have trawled the archives, but have not had any joy.

I am using S-plus 6.2 with Windows XP and looking at species presence absence data.

I have used GLMs and GAMs to investigate which environmental variables are influencing distribution.

I am also comparing whether a simple or more detailed classification of habitat

 (1 of the predictor variables) provides better model fits.

For each species, I have 4 models,

- 2 GLMs (1 using simple habitat classification and 1 using detailed classification) and

- 2 GAMs (simple habitat and detailed habitat). Each final model is selected using backward stepwise, with AIC for variable selection.

While I understand that D squared (1 – residual deviance/null deviance)

 should not be used to compare models based on binomial distribution

(because it does not follow the Chi squared distribution), I have taken the view that

these 4 models are nested and so I have calculated D squared and Adjusted D squared values

for them and have been using this to compare the 4 models – please correct me on this if it is not the case.

At present I am just eye balling these values to see which model explains more deviation,

but was hoping that someone could suggest a more rigorous way of determining which model is best?

Any thoughts or suggested reading would be appreciated.

Look forward to comments.

Brenton

Brenton Chatfield

PhD Candidate

UWA & Coastal CRC

School of Earth and Geographical Sciences

University of Western Australia (M004)

35 Stirling Highway

Crawley WA 6009

Email: chatfb01@student.uwa.edu.au

Telephone: +61 8 6488 4235

Fax: +61 8 6488 1037

Most indexes are not corrected for 'data mining' (e.g. variable selection) so there is a good chance they will give you a misleading picture. For example if one model was drastically more overfit than another, it may seem to be better but may not predict better on new data. Also you haven't explained why you need to eliminate variables from your original list. Without simultaneous shrinkage (as with the lasso) this may not be a good idea. A better approach is to fit a full flexible model (using e.g. regression splines) and to penalize it down to what the data will support, if any penalization is needed. The need for penalization (shrinkage) will depend on the amount of information in your data (e.g., effective sample size) and the complexity of the postulated model.

--
Frank E Harrell Jr   Professor and Chair           School of Medicine
                     Department of Biostatistics   Vanderbilt University

<Prev in Thread] Current Thread [Next in Thread>