| To: | Brenton Chatfield <chatfb01@student.uwa.edu.au> |
|---|---|
| Subject: | Re: Choosing the best model |
| From: | Frank E Harrell Jr <f.harrell@vanderbilt.edu> |
| Date: | Wed, 09 Aug 2006 09:45:53 -0500 |
| Cc: | s-news@lists.biostat.wustl.edu |
| In-reply-to: | <002101c6bb79$5f5f35f0$172c5f82@segs.uwa.edu.au> |
| References: | <002101c6bb79$5f5f35f0$172c5f82@segs.uwa.edu.au> |
| User-agent: | Thunderbird 1.5.0.5 (X11/20060728) |
Brenton Chatfield wrote: Greetings to you all,I am looking for advice on how to select the ‘best’ of multiple models. I have trawled the archives, but have not had any joy.I am using S-plus 6.2 with Windows XP and looking at species presence absence data.I have used GLMs and GAMs to investigate which environmental variables are influencing distribution.I am also comparing whether a simple or more detailed classification of habitat(1 of the predictor variables) provides better model fits.For each species, I have 4 models,- 2 GLMs (1 using simple habitat classification and 1 using detailed classification) and- 2 GAMs (simple habitat and detailed habitat). Each final model is selected using backward stepwise, with AIC for variable selection.While I understand that D squared (1 – residual deviance/null deviance) should not be used to compare models based on binomial distribution(because it does not follow the Chi squared distribution), I have taken the view thatthese 4 models are nested and so I have calculated D squared and Adjusted D squared valuesfor them and have been using this to compare the 4 models – please correct me on this if it is not the case.At present I am just eye balling these values to see which model explains more deviation,but was hoping that someone could suggest a more rigorous way of determining which model is best?Any thoughts or suggested reading would be appreciated.Look forward to comments.BrentonBrenton Chatfield PhD Candidate UWA & Coastal CRC School of Earth and Geographical Sciences University of Western Australia (M004) 35 Stirling Highway Crawley WA 6009Email: chatfb01@student.uwa.edu.au Telephone: +61 8 6488 4235Fax: +61 8 6488 1037 Most indexes are not corrected for 'data mining' (e.g. variable selection) so there is a good chance they will give you a misleading picture. For example if one model was drastically more overfit than another, it may seem to be better but may not predict better on new data. Also you haven't explained why you need to eliminate variables from your original list. Without simultaneous shrinkage (as with the lasso) this may not be a good idea. A better approach is to fit a full flexible model (using e.g. regression splines) and to penalize it down to what the data will support, if any penalization is needed. The need for penalization (shrinkage) will depend on the amount of information in your data (e.g., effective sample size) and the complexity of the postulated model.
--
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University
|
| <Prev in Thread] | Current Thread | [Next in Thread> |
|---|---|---|
| ||
| Previous by Date: | comparing two hazard ratios reported from two Cox models, Maggie Cheang |
|---|---|
| Next by Date: | (3) Courses & R/Splus Advanced Programming in New York City ***September 11-12 ***by the R Development Core Tean Guru, elvis |
| Previous by Thread: | Choosing the best model, Brenton Chatfield |
| Next by Thread: | comparing two hazard ratios reported from two Cox models, Maggie Cheang |
| Indexes: | [Date] [Thread] [Top] [All Lists] |