On Thu, 21 Jun 2001, D. Mckenzie wrote:
> I have rather noisy data: presence/absence of tree species as functions
> of environmental variables. Sample sizes range from 1500 to
> 4000. tree() tends to overfit models, as has been observed by
> several authors. The error misclassification rate seems to go up almost
> linearly (with my data) with reduction in the number of terminal nodes.
> When pruning back to "sensible" sizes, I notice that some terminal splits
> have the same value (1 or 0) at both nodes. Using various criteria to
> prune the tree, or with an unpruned tree, this happens.
>
> Obviously (I think) a split is useless if it does not discriminate. Is
> this common feature of classification trees, or is it characteristic of a
> certain kind of data, or is it another indicator of lack of fit? Has
> anyone tried hacking tree() or prune.tree() to circumscribe this?
If you prune on misclassification rate (as recommended by most people)
this does not happen. The split is not useless in that the probability
predictions differ, but it not effective for just predicting a class
with the default 0-1 loss structure.
There's an example and comment in V&R3, specifically on page 327.
--
Brian D. Ripley, ripley@stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272860 (secr)
Oxford OX1 3TG, UK Fax: +44 1865 272595
|