Alan H wrote:
> The notion of "factor" is built in to the
> statistical-modeling features of S in a way that can be extremely useful and
> convenient.
The second half of the sentence is where I disagree. Models work just fine
with character variables. In fact, they work better. For instance, consider a
model with a per-subject intercept that compares treatment slopes. (I've used
this for evaluation of pre/post pain treatments, for example), then the fit on
particular subsets of patients. The "all models have the same coefficients"
bias of factors is a major PITA in this case.
Factors were made default because they made sense for the data set which
happened to be under analysis at the time the authors decided on a default.
(Look at the examples in the Chambers and Hastie book.) I can't throw too many
bricks at this, as lots of the defaults in my survival package have exactly the
same origin. The problem with factors is that they have so many consequences.
I'll reiterate: we turned them off, we've never missed them. Note that it
is
very easy to create a factor when desired; what we've turned off is the
automatic conversion of data into factors "for your own good" by the package --
and auto conversion is something that I am very leery of in general.
Matt Austin gave a nice synopsis of why factors are useful. I agree. For
the
1 in 20 or so character variables where a factor is useful --- I make it a
factor. For a treatment variable like 5-fu / methotrexate / placebo I'd have
to
redo it myself anyway to get 'placebo' as the reference, so autoconvert did me
no good.
"Autoconvert-char-to-factor", along with helmert contrasts, not listing NAs
in a table command, and na.action=na.fail, rank as the 4 poorest defaults ever
chosen in Splus. Over time they've fixed 2 of them, we fix them all.
As to speed issues, I can't say. We do all our large data set manipulations
in SAS. (Particularly since my method for such problems is `give it to xxx
down
the hall'.)
Terry Therneau
|