s-news
[Top] [All Lists]

Re: maximum factor levels in S+

To: Jennifer Miller <jmiller@rohan.sdsu.edu>
Subject: Re: maximum factor levels in S+
From: Prof Brian D Ripley <ripley@stats.ox.ac.uk>
Date: Wed, 19 Sep 2001 07:29:35 +0100 (BST)
Cc: <s-news@lists.biostat.wustl.edu>
In-reply-to: <Pine.SOL.4.02A.10109181540180.12999-100000@rohan.sdsu.edu>
On Tue, 18 Sep 2001, Jennifer Miller wrote:

>
> Hello,
>
> I'm using S+ (2000 on windows and version 5.0 on unix) to make
> classification trees for predictive modeling. One dataset I'm using
> contains a factor predictor variable with 31 levels (used to classify a
> factor variable with 9 levels). I read in the Splus manual (for
> windows, S+ 2000) that the maximum allowed levels for
> predictor variables is 32 and for response variables it is 128 levels. My
> dataset is within those limits, but in both the windows and unix
> platforms, the process "hangs" (for days). When I exclude this 31 level
> variable, the process runs fine.
> Additionally, I've used this 31 level predictor variable successfully in a
> classification tree with a 2 level response variable.
> My question is: are the computational problems a result of the combination
> of using 31 levels to help classify 9 levels, or is it something
> additional or beyond that?

Computational problems from having more than 2 levels of response, I
suspect.

Tree fitting for an unordered factor is looking for all possible
attributes, that is splits of the levels into 2 groups.  For 31 levels
there are 2^30 - 1 splits.  Now, I have not seen the internal code of
tree() (assuming that is what you are using) but the known theory
has a shortcut for 2 response levels that allows just 30 groups to be
considered whereas for >=3 response levels there seems to way to avoid an
exhaustive search.  So I guess that's what tree() is up to. (I have looked
at rpart(), and am fairly sure that's what it does.)

Solutions might be to use an ordering on the levels or groups the levels
into fewer levels.

[Theoretical background in my Pattern Recognition and Neural Networks book
pages 217-8.]

-- 
Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595


<Prev in Thread] Current Thread [Next in Thread>