On Wed, Mar 5, 2008 at 1:12 PM, Kevin Wright <kw.statr@gmail.com> wrote:
> One place I find factors nice is when I create trellis plots of
> subsets of data. Factors keep the panels in the same place on the
> page across different subsets of data, even in the situation that one
> subset might have a panel with no data.
> On the whole, however, I spend a significant amount of time fighting
> factors and my conclusion is that S is far too eager to convert data
> to factors.
> Here's an example that burned me badly and cost me about six hours of
> work. In essence, what happened was that data was read from two
> different files. In one file, 'age' was read as numeric, while in the
> second file 'age' had an unexpected, non-numeric value that caused a
> behind-the-scenes conversion of age to a factor (instead of a numeric
> with a missing value). Later in the code, merging these caused
> unexpected results. Here is the essence of what happened:
> age1 <- factor(c("20", "21", "22"))
> age2 <- c(20, 21, 22)
> ifelse(c(T, T, F), age1, age2)
> [1] 1 2 22
> The desired result was: 20, 21, 22.
> Give a report to a client with erroneous results that traces back to
> this phenomenon and you'll become paranoid about conversions to
> factors.
Perhaps that is why it is a good idea to check the structure of a data
frame *before* beginning an analysis with it. Even in my introductory
classes I emphasize that the first thing you always do with data is
str(myData) # available in R, I'm not sure if it is in S-PLUS
and maybe
summary(myData)
I tend to preach this for the opposite reason, however. I think it is
much more common to make the mistake of using a numeric vector when it
should be a factor than it is to use a factor when you think it should
be a numeric vector.
|