s-news
[Top] [All Lists]

Re: Burned by factors

To: <bates@stat.wisc.edu>, <kw.statr@gmail.com>
Subject: Re: Burned by factors
From: <Bill.Venables@csiro.au>
Date: Sun, 16 Mar 2008 10:47:43 +1000
Cc: <s-news@lists.biostat.wustl.edu>
References: <c968588d0803051012x4be861f7pbf489ae363b0983e@mail.gmail.com> <40e66e0b0803151310r3fb5c03fq2e50cb7892c60853@mail.gmail.com>
Thread-index: AciG2NM7uGJgFpT3T9iWTG/iAEFFZwAJBJbQ
Thread-topic: [S] Burned by factors
I agree with Doug 150%.  You should think of this as a learning
experience and grow from it rather than get all upset about S being not
"what you see is what you get".  S is more than that, other systems are
less.

Just to be a bit more constructive, though, here are a few comments.

Note that if you want to convert a factor into the numeric vector that
it *looks* like, then as.numeric(f) will not necessarily do it.  You get
the phenomenon shown below.  You need to use
as.numeric(as.character(f)).

In S-PLUS 8 there is an 'str' function, but in the pkgutils library,
part of the initiative to build a comaptibility with R.  This is a
welcome initiative that all S-PLUS users should embrace,
enthusiastically.

str(object) and summary(object) are both good, but my very first step is
always to use

> sapply(myData, class)

and look for anomalies with what I expect.  This is often enough to flag
a problem.  It's quick, easy and costs nothing (much).

The R funciton read.table allows you to specify colClasses, giving full
control over how things are converted when read.  I do not notice such
an argument in the S-PLUS functions read.table or importData (in S-PLUS
8.0.1 at least), so this looks like a bit of a lack that needs to be
plugged.  It should be easy enough.

Bill Venables.


-----Original Message-----
From: s-news-owner@lists.biostat.wustl.edu
[mailto:s-news-owner@lists.biostat.wustl.edu] On Behalf Of Douglas Bates
Sent: Sunday, 16 March 2008 6:10 AM
To: Kevin Wright
Cc: s-news@lists.biostat.wustl.edu
Subject: Re: [S] Burned by factors

On Wed, Mar 5, 2008 at 1:12 PM, Kevin Wright <kw.statr@gmail.com> wrote:
>  One place I find factors nice is when I create trellis plots of
>  subsets of data.  Factors keep the panels in the same place on the
>  page across different subsets of data, even in the situation that one
>  subset might have a panel with no data.

>  On the whole, however, I spend a significant amount of time fighting
>  factors and my conclusion is that S is far too eager to convert data
>  to factors.

>  Here's an example that burned me badly and cost me about six hours of
>  work. In essence, what happened was that data was read from two
>  different files. In one file, 'age' was read as numeric, while in the
>  second file 'age' had an unexpected, non-numeric value that caused a
>  behind-the-scenes conversion of age to a factor (instead of a numeric
>  with a missing value). Later in the code, merging these caused
>  unexpected results. Here is the essence of what happened:

>  age1 <- factor(c("20", "21", "22"))
>  age2 <- c(20, 21, 22)
>  ifelse(c(T, T, F), age1, age2)
>  [1]  1  2 22

>  The desired result was: 20, 21, 22.

>  Give a report to a client with erroneous results that traces back to
>  this phenomenon and you'll become paranoid about conversions to
>  factors.

Perhaps that is why it is a good idea to check the structure of a data
frame *before* beginning an analysis with it.  Even in my introductory
classes I emphasize that the first thing you always do with data is

str(myData)    # available in R, I'm not sure if it is in S-PLUS

and maybe

summary(myData)

I tend to preach this for the opposite reason, however.  I think it is
much more common to make the mistake of using a numeric vector when it
should be a factor than it is to use a factor when you think it should
be a numeric vector.
--------------------------------------------------------------------
This message was distributed by s-news@lists.biostat.wustl.edu.  To
unsubscribe send e-mail to s-news-request@lists.biostat.wustl.edu with
the BODY of the message:  unsubscribe s-news



<Prev in Thread] Current Thread [Next in Thread>