Doug and Bill both offer a useful suggestion, though I am already a
dedicated user of str(data) and lapply(data, is.factor). The
colClasses argument was not available in the version of S-Plus where
the problem happened. My best problem-preventers are str(), head(),
and auto-indenting/paren-matching/fontification via ESS.
The subtle, but key word in my previous post was "unexpected". I had
multiple data sets that were used to develop a script. It was only
after the script had been put into production that the unexpected
non-numeric value appeared. The postmortem analysis of the situation
pointed to multiple faults, one of which was that S does the "right"
thing with factors 99% of the time (but it's the remaining 1% that
causes 99% of the grief...).
For example, is the following the "right" performance of factors?
ifelse(c(T,F,T),factor(4:6),7:9)
[1] 1 8 3
Let's not discuss coercion, precedence, inheritance, and so forth, but
just stop with "factors are useful, but occasionally risky and each
person must determine his/her risk tolerance."
Kevin Wright
On Sat, Mar 15, 2008 at 7:47 PM, <Bill.Venables@csiro.au> wrote:
> I agree with Doug 150%. You should think of this as a learning
> experience and grow from it rather than get all upset about S being not
> "what you see is what you get". S is more than that, other systems are
> less.
>
> Just to be a bit more constructive, though, here are a few comments.
>
> Note that if you want to convert a factor into the numeric vector that
> it *looks* like, then as.numeric(f) will not necessarily do it. You get
> the phenomenon shown below. You need to use
> as.numeric(as.character(f)).
>
> In S-PLUS 8 there is an 'str' function, but in the pkgutils library,
> part of the initiative to build a comaptibility with R. This is a
> welcome initiative that all S-PLUS users should embrace,
> enthusiastically.
>
> str(object) and summary(object) are both good, but my very first step is
> always to use
>
> > sapply(myData, class)
>
> and look for anomalies with what I expect. This is often enough to flag
> a problem. It's quick, easy and costs nothing (much).
>
> The R funciton read.table allows you to specify colClasses, giving full
> control over how things are converted when read. I do not notice such
> an argument in the S-PLUS functions read.table or importData (in S-PLUS
> 8.0.1 at least), so this looks like a bit of a lack that needs to be
> plugged. It should be easy enough.
>
> Bill Venables.
>
>
>
>
> -----Original Message-----
> From: s-news-owner@lists.biostat.wustl.edu
> [mailto:s-news-owner@lists.biostat.wustl.edu] On Behalf Of Douglas Bates
> Sent: Sunday, 16 March 2008 6:10 AM
> To: Kevin Wright
> Cc: s-news@lists.biostat.wustl.edu
> Subject: Re: [S] Burned by factors
>
> On Wed, Mar 5, 2008 at 1:12 PM, Kevin Wright <kw.statr@gmail.com> wrote:
> > One place I find factors nice is when I create trellis plots of
> > subsets of data. Factors keep the panels in the same place on the
> > page across different subsets of data, even in the situation that one
> > subset might have a panel with no data.
>
> > On the whole, however, I spend a significant amount of time fighting
> > factors and my conclusion is that S is far too eager to convert data
> > to factors.
>
> > Here's an example that burned me badly and cost me about six hours of
> > work. In essence, what happened was that data was read from two
> > different files. In one file, 'age' was read as numeric, while in the
> > second file 'age' had an unexpected, non-numeric value that caused a
> > behind-the-scenes conversion of age to a factor (instead of a numeric
> > with a missing value). Later in the code, merging these caused
> > unexpected results. Here is the essence of what happened:
>
> > age1 <- factor(c("20", "21", "22"))
> > age2 <- c(20, 21, 22)
> > ifelse(c(T, T, F), age1, age2)
> > [1] 1 2 22
>
> > The desired result was: 20, 21, 22.
>
> > Give a report to a client with erroneous results that traces back to
> > this phenomenon and you'll become paranoid about conversions to
> > factors.
>
> Perhaps that is why it is a good idea to check the structure of a data
> frame *before* beginning an analysis with it. Even in my introductory
> classes I emphasize that the first thing you always do with data is
>
> str(myData) # available in R, I'm not sure if it is in S-PLUS
>
> and maybe
>
> summary(myData)
>
> I tend to preach this for the opposite reason, however. I think it is
> much more common to make the mistake of using a numeric vector when it
> should be a factor than it is to use a factor when you think it should
> be a numeric vector.
> --------------------------------------------------------------------
> This message was distributed by s-news@lists.biostat.wustl.edu. To
> unsubscribe send e-mail to s-news-request@lists.biostat.wustl.edu with
> the BODY of the message: unsubscribe s-news
>
>
>
|