s-news
[Top] [All Lists]

Re: Factors

To: s-news@lists.biostat.wustl.edu
Subject: Re: Factors
From: Terry Therneau <therneau@mayo.edu>
Date: Wed, 5 Mar 2008 12:17:44 -0600 (CST)
Cc: maustin@amgen.com
Reply-to: Terry Therneau <therneau@mayo.edu>
Alan H wrote:

> The notion of "factor" is built in to the
> statistical-modeling features of S in a way that can be extremely useful and
> convenient.

  The second half of the sentence is where I disagree.  Models work just fine 
with character variables.  In fact, they work better.  For instance, consider a 
model with a per-subject intercept that compares treatment slopes.  (I've used 
this for evaluation of pre/post pain treatments, for example), then the fit on 
particular  subsets of patients.  The "all models have the same coefficients" 
bias of factors is a major PITA in this case.
  
   Factors were made default because they made sense for the data set which 
happened to be under analysis at the time the authors decided on a default.  
(Look at the examples in the Chambers and Hastie book.)  I can't throw too many 
bricks at this, as lots of the defaults in my survival package have exactly the 
same origin.  The problem with factors is that they have so many consequences.
   
   I'll reiterate: we turned them off, we've never missed them.  Note that it 
is 
very easy to create a factor when desired; what we've turned off is the 
automatic conversion of data into factors "for your own good" by the package -- 
and auto conversion is something that I am very leery of in general.

  Matt Austin gave a nice synopsis of why factors are useful.  I agree.  For 
the 
1 in 20 or so character variables where a factor is useful --- I make it a 
factor.  For a treatment variable like 5-fu / methotrexate / placebo I'd have 
to 
redo it myself anyway to get 'placebo' as the reference, so autoconvert did me 
no good.
    "Autoconvert-char-to-factor", along with helmert contrasts, not listing NAs 
in a table command, and na.action=na.fail, rank as the 4 poorest defaults ever 
chosen in Splus.  Over time they've fixed 2 of them, we fix them all.

   As to speed issues, I can't say.  We do all our large data set manipulations 
in SAS.  (Particularly since my method for such problems is `give it to xxx 
down 
the hall'.)
   
   Terry Therneau



<Prev in Thread] Current Thread [Next in Thread>
  • Re: Factors, Terry Therneau <=