I agree with Doug 150%. You should think of this as a learning
experience and grow from it rather than get all upset about S being not
"what you see is what you get". S is more than that, other systems are
less.
Just to be a bit more constructive, though, here are a few comments.
Note that if you want to convert a factor into the numeric vector that
it *looks* like, then as.numeric(f) will not necessarily do it. You get
the phenomenon shown below. You need to use
as.numeric(as.character(f)).
In S-PLUS 8 there is an 'str' function, but in the pkgutils library,
part of the initiative to build a comaptibility with R. This is a
welcome initiative that all S-PLUS users should embrace,
enthusiastically.
str(object) and summary(object) are both good, but my very first step is
always to use
> sapply(myData, class)
and look for anomalies with what I expect. This is often enough to flag
a problem. It's quick, easy and costs nothing (much).
The R funciton read.table allows you to specify colClasses, giving full
control over how things are converted when read. I do not notice such
an argument in the S-PLUS functions read.table or importData (in S-PLUS
8.0.1 at least), so this looks like a bit of a lack that needs to be
plugged. It should be easy enough.
Bill Venables.
-----Original Message-----
From: s-news-owner@lists.biostat.wustl.edu
[mailto:s-news-owner@lists.biostat.wustl.edu] On Behalf Of Douglas Bates
Sent: Sunday, 16 March 2008 6:10 AM
To: Kevin Wright
Cc: s-news@lists.biostat.wustl.edu
Subject: Re: [S] Burned by factors
On Wed, Mar 5, 2008 at 1:12 PM, Kevin Wright <kw.statr@gmail.com> wrote:
> One place I find factors nice is when I create trellis plots of
> subsets of data. Factors keep the panels in the same place on the
> page across different subsets of data, even in the situation that one
> subset might have a panel with no data.
> On the whole, however, I spend a significant amount of time fighting
> factors and my conclusion is that S is far too eager to convert data
> to factors.
> Here's an example that burned me badly and cost me about six hours of
> work. In essence, what happened was that data was read from two
> different files. In one file, 'age' was read as numeric, while in the
> second file 'age' had an unexpected, non-numeric value that caused a
> behind-the-scenes conversion of age to a factor (instead of a numeric
> with a missing value). Later in the code, merging these caused
> unexpected results. Here is the essence of what happened:
> age1 <- factor(c("20", "21", "22"))
> age2 <- c(20, 21, 22)
> ifelse(c(T, T, F), age1, age2)
> [1] 1 2 22
> The desired result was: 20, 21, 22.
> Give a report to a client with erroneous results that traces back to
> this phenomenon and you'll become paranoid about conversions to
> factors.
Perhaps that is why it is a good idea to check the structure of a data
frame *before* beginning an analysis with it. Even in my introductory
classes I emphasize that the first thing you always do with data is
str(myData) # available in R, I'm not sure if it is in S-PLUS
and maybe
summary(myData)
I tend to preach this for the opposite reason, however. I think it is
much more common to make the mistake of using a numeric vector when it
should be a factor than it is to use a factor when you think it should
be a numeric vector.
--------------------------------------------------------------------
This message was distributed by s-news@lists.biostat.wustl.edu. To
unsubscribe send e-mail to s-news-request@lists.biostat.wustl.edu with
the BODY of the message: unsubscribe s-news
|