s-news
[Top] [All Lists]

Re: Factors

To: "s-news@lists.biostat.wustl.edu" <s-news@lists.biostat.wustl.edu>
Subject: Re: Factors
From: "Austin, Matt" <maustin@amgen.com>
Date: Wed, 5 Mar 2008 15:09:48 -0800
Accept-language: en-US
Acceptlanguage: en-US
In-reply-to: <200803051817.m25IHhF01791@hsrnfs-101.mayo.edu>
References: <200803051817.m25IHhF01791@hsrnfs-101.mayo.edu>
Thread-index: Ach+7Try2xDGX24cRc2O8Uuua8XE1wAJ1AVw
Thread-topic: [S] Factors
One last comment on factors from me on this topic--this one involving the 
bigdata library.

I have included some information in a document I'm currently writing:


%% begin describing factor level issues in bigdata

For efficiency, the bigdata library handles strings as factors as a default.  
Storing as
factors is more efficient than storing as character strings because the vector 
of
data can be stored as integers with a small number of unique character labels 
stored
as attributes.  By default, the maximum number of factor levels allowed for a 
given
vector of data is 500.  This can be troublesome in many situations, one example 
is
in a clinical trial with more than 500 subjects or an adverse event dataset with
more than 500 unique preferred terms from a coding dictionary.  By default the 
data would be
read in with a warning that the maximum number of levels has been exceeded and 
the
column containing the subject idenitifier or the preferred terms would contain 
only missing values (NA).

One option is to increase the number of allowed factor levels using, for example
to allow up to 1000 unique factor levels for a given vector of data the 
following statement
is issued:

\begin{verbatim}
>bd.options(max.levels=1000)
\end{verbatim}

Another useful option is the "error.on.level.overflow" which defaults to FALSE. 
 If this
parameter is set to TRUE and the maximum number of factor levels is exceeded, 
the process immediately
stops with an error.  This can save a considerable amount of time when reading
large datasets where the default behavior of setting vectors of data to missing 
when a
factor level overflow occurs is undesirable.  It can be quite frustrating to
wait for over an hour to load a very large dataset and find that several key
variables are unusable (all missing).

%% end describing factor level issues in bigdata


Another "gotcha" I found with factor levels and the bigdata library:

Interestingly, the sodium ("NA") abbreviation causes trouble using 
non-expression language syntax when the data
has been imported with importData(..., bigdata=TRUE), but not when data has 
been converted via bdcoerce() to a
bigdata object.

Hope this helps someone,

--Matt

Matt Austin
Global Statistical Lead, PMO
Director, Biostatistics
Amgen, Inc.
maustin@amgen.com


-----Original Message-----
From: Terry Therneau [mailto:therneau@mayo.edu]
Sent: Wednesday, March 05, 2008 10:18 AM
To: s-news@lists.biostat.wustl.edu
Cc: Austin, Matt
Subject: RE: [S] Factors

Alan H wrote:

> The notion of "factor" is built in to the statistical-modeling
> features of S in a way that can be extremely useful and convenient.

  The second half of the sentence is where I disagree.  Models work just fine 
with character variables.  In fact, they work better.  For instance, consider a 
model with a per-subject intercept that compares treatment slopes.  (I've used 
this for evaluation of pre/post pain treatments, for example), then the fit on 
particular  subsets of patients.  The "all models have the same coefficients"
bias of factors is a major PITA in this case.

   Factors were made default because they made sense for the data set which 
happened to be under analysis at the time the authors decided on a default.
(Look at the examples in the Chambers and Hastie book.)  I can't throw too many 
bricks at this, as lots of the defaults in my survival package have exactly the 
same origin.  The problem with factors is that they have so many consequences.

   I'll reiterate: we turned them off, we've never missed them.  Note that it 
is very easy to create a factor when desired; what we've turned off is the 
automatic conversion of data into factors "for your own good" by the package -- 
and auto conversion is something that I am very leery of in general.

  Matt Austin gave a nice synopsis of why factors are useful.  I agree.  For the
1 in 20 or so character variables where a factor is useful --- I make it a 
factor.  For a treatment variable like 5-fu / methotrexate / placebo I'd have 
to redo it myself anyway to get 'placebo' as the reference, so autoconvert did 
me no good.
    "Autoconvert-char-to-factor", along with helmert contrasts, not listing NAs 
in a table command, and na.action=na.fail, rank as the 4 poorest defaults ever 
chosen in Splus.  Over time they've fixed 2 of them, we fix them all.

   As to speed issues, I can't say.  We do all our large data set manipulations 
in SAS.  (Particularly since my method for such problems is `give it to xxx 
down the hall'.)

   Terry Therneau



<Prev in Thread] Current Thread [Next in Thread>
  • Re: Factors, Terry Therneau
    • Re: Factors, Austin, Matt <=