One last comment on factors from me on this topic--this one involving the
bigdata library.
I have included some information in a document I'm currently writing:
%% begin describing factor level issues in bigdata
For efficiency, the bigdata library handles strings as factors as a default.
Storing as
factors is more efficient than storing as character strings because the vector
of
data can be stored as integers with a small number of unique character labels
stored
as attributes. By default, the maximum number of factor levels allowed for a
given
vector of data is 500. This can be troublesome in many situations, one example
is
in a clinical trial with more than 500 subjects or an adverse event dataset with
more than 500 unique preferred terms from a coding dictionary. By default the
data would be
read in with a warning that the maximum number of levels has been exceeded and
the
column containing the subject idenitifier or the preferred terms would contain
only missing values (NA).
One option is to increase the number of allowed factor levels using, for example
to allow up to 1000 unique factor levels for a given vector of data the
following statement
is issued:
\begin{verbatim}
>bd.options(max.levels=1000)
\end{verbatim}
Another useful option is the "error.on.level.overflow" which defaults to FALSE.
If this
parameter is set to TRUE and the maximum number of factor levels is exceeded,
the process immediately
stops with an error. This can save a considerable amount of time when reading
large datasets where the default behavior of setting vectors of data to missing
when a
factor level overflow occurs is undesirable. It can be quite frustrating to
wait for over an hour to load a very large dataset and find that several key
variables are unusable (all missing).
%% end describing factor level issues in bigdata
Another "gotcha" I found with factor levels and the bigdata library:
Interestingly, the sodium ("NA") abbreviation causes trouble using
non-expression language syntax when the data
has been imported with importData(..., bigdata=TRUE), but not when data has
been converted via bdcoerce() to a
bigdata object.
Hope this helps someone,
--Matt
Matt Austin
Global Statistical Lead, PMO
Director, Biostatistics
Amgen, Inc.
maustin@amgen.com
-----Original Message-----
From: Terry Therneau [mailto:therneau@mayo.edu]
Sent: Wednesday, March 05, 2008 10:18 AM
To: s-news@lists.biostat.wustl.edu
Cc: Austin, Matt
Subject: RE: [S] Factors
Alan H wrote:
> The notion of "factor" is built in to the statistical-modeling
> features of S in a way that can be extremely useful and convenient.
The second half of the sentence is where I disagree. Models work just fine
with character variables. In fact, they work better. For instance, consider a
model with a per-subject intercept that compares treatment slopes. (I've used
this for evaluation of pre/post pain treatments, for example), then the fit on
particular subsets of patients. The "all models have the same coefficients"
bias of factors is a major PITA in this case.
Factors were made default because they made sense for the data set which
happened to be under analysis at the time the authors decided on a default.
(Look at the examples in the Chambers and Hastie book.) I can't throw too many
bricks at this, as lots of the defaults in my survival package have exactly the
same origin. The problem with factors is that they have so many consequences.
I'll reiterate: we turned them off, we've never missed them. Note that it
is very easy to create a factor when desired; what we've turned off is the
automatic conversion of data into factors "for your own good" by the package --
and auto conversion is something that I am very leery of in general.
Matt Austin gave a nice synopsis of why factors are useful. I agree. For the
1 in 20 or so character variables where a factor is useful --- I make it a
factor. For a treatment variable like 5-fu / methotrexate / placebo I'd have
to redo it myself anyway to get 'placebo' as the reference, so autoconvert did
me no good.
"Autoconvert-char-to-factor", along with helmert contrasts, not listing NAs
in a table command, and na.action=na.fail, rank as the 4 poorest defaults ever
chosen in Splus. Over time they've fixed 2 of them, we fix them all.
As to speed issues, I can't say. We do all our large data set manipulations
in SAS. (Particularly since my method for such problems is `give it to xxx
down the hall'.)
Terry Therneau
|