s-news
[Top] [All Lists]

Re: [S] confusion about prune.tree

To: Andrea Brown <abrown@credit.erin.utoronto.ca>
Subject: Re: [S] confusion about prune.tree
From: Prof Brian D Ripley <ripley@stats.ox.ac.uk>
Date: Wed, 22 Jul 1998 07:09:59 +0100 (BST)
Cc: s-news@wubios.wustl.edu
In-reply-to: <35B51106.2292@credit.erin.utoronto.ca>
Sender: owner-s-news@wubios.wustl.edu
On Tue, 21 Jul 1998, Andrea Brown wrote:

> I am calculating some regression trees but when I use prune.tree() and
> best=argument I find that I don't always get back a tree the size I
> indicated.  Instead the tree is grown to a larger size and then the
> deviances are calculated.  Is this an error in the code, or does this
> have something to do with a minimum default value related to the
> cost-complexity paramter.  I am aware that prune.tree has some bugs but
> these don't seem to be related to this problem.

You do need to tell us what version of S-PLUS you are using in questions
like this, as prune.tree in 3.4 is very different from that in 3.3, and
there are minor differences since.  I do try to get MathSoft to 
sort this out, but the code in my treefix library is always later.
If perchance you are using 3.3 or earlier do use treefix; those bugs are
serious.

First, under no circumstances does (AFAIK) prune.tree `grow a tree to
a larger size'. It chooses one of a set of pruned versions of the 
original trees for a cost-complexity index alpha (or k to the S-PLUS code),
minimizing

        fit + alpha * size

Those trees are not of all sizes, and really you should be selecting one by
choosing alpha (e.g. by cross-validation) and not size.  Given that,
in their wisdom Pregibon and Clark introduced the best= parameter and it
has been maintained for backwards compatibility. In the latest version
the code is

                if(!missing(best))
                        index <- ind[sum(best <= size)]

So, in the example in V&R

> prune.misclass(bwt.tr1)
$size:
[1] 19 11  5  2  1
   .....
> prune.misclass(bwt.tr1, best=8)

gives a tree of size 11, as the nearest (larger) match.  But you should
really only ask for one of these sizes.  I think the help page is
actually quite clear

best:     integer requesting the size (i.e.  number  of  terminal
       nodes)  of  a  specific  subtree  in  the  cost-complexity
       sequence to be returned. This is  an  alternative  way  to
       select   a  subtree  than  by  supplying  a  scalar  cost-
       complexity parameter k.   If  there  is  no  tree  in  the
       sequence  of  the  requested  size,  the  next  largest is
       returned.

but maybe only in current versions (the last sentence is not in
the help page for 3.4).

-- 
Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

-----------------------------------------------------------------------
This message was distributed by s-news@wubios.wustl.edu.  To unsubscribe
send e-mail to s-news-request@wubios.wustl.edu with the BODY of the
message:  unsubscribe s-news

<Prev in Thread] Current Thread [Next in Thread>