s-news
[Top] [All Lists]

Re: Burned by factors

To: Bill.Venables@csiro.au
Subject: Re: Burned by factors
From: "Kevin Wright" <kw.statr@gmail.com>
Date: Mon, 17 Mar 2008 09:00:49 -0500
Cc: bates@stat.wisc.edu, s-news@lists.biostat.wustl.edu
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; bh=rgNOt4A8H2KXhYujOfFCBE3NtqRCjdmUUMn6Yc1lfZ4=; b=tZOvrPOdPc/EzEbP/QWNmjb59E7wVpbKfAC5VYqfGO7UjeQNx9r1g4LZYXkeRcWHU0S+t3JCGx1YB4b8ESP/gomtEZBptkgxGwPRixTrE7BmCySGVWpCmpRuGd0iSUTqGCDOVZBAATGZpZWXhOW4Y10nD3AA4Gowe6Ts0WzmTRE=
Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references; b=I/TMCuT5wifTFtT4hSRWw1cEEt0EyvrGW7kTgNmPH8KskcJStqRUay4BCH0FlNLJpeM19QAgrThPpV7lyoY23fTVF64N0LqdlYsMn6cSjrHbI3qbVhC9v4QC3+VBwJzgRHFJAxa/MSkHJV/lVW4IBDckFjQm+B/ssfJI6JgsBoQ=
In-reply-to: <B998A44C8986644EA8029CFE6396A9240100CE1D@exqld2-bne.nexus.csiro.au>
References: <c968588d0803051012x4be861f7pbf489ae363b0983e@mail.gmail.com> <40e66e0b0803151310r3fb5c03fq2e50cb7892c60853@mail.gmail.com> <B998A44C8986644EA8029CFE6396A9240100CE1D@exqld2-bne.nexus.csiro.au>
Doug and Bill both offer a useful suggestion, though I am already a
dedicated user of str(data) and lapply(data, is.factor).  The
colClasses argument was not available in the version of S-Plus where
the problem happened.  My best problem-preventers are str(), head(),
and auto-indenting/paren-matching/fontification via ESS.

The subtle, but key word in my previous post was "unexpected".  I had
multiple data sets that were used to develop a script.  It was only
after the script had been put into production that the unexpected
non-numeric value appeared.  The postmortem analysis of the situation
pointed to multiple faults, one of which was that S does the "right"
thing with factors 99% of the time (but it's the remaining 1% that
causes 99% of the grief...).

For example, is the following the "right" performance of factors?

ifelse(c(T,F,T),factor(4:6),7:9)
[1] 1 8 3

Let's not discuss coercion, precedence, inheritance, and so forth, but
just stop with "factors are useful, but occasionally risky and each
person must determine his/her risk tolerance."

Kevin Wright


On Sat, Mar 15, 2008 at 7:47 PM,  <Bill.Venables@csiro.au> wrote:
> I agree with Doug 150%.  You should think of this as a learning
>  experience and grow from it rather than get all upset about S being not
>  "what you see is what you get".  S is more than that, other systems are
>  less.
>
>  Just to be a bit more constructive, though, here are a few comments.
>
>  Note that if you want to convert a factor into the numeric vector that
>  it *looks* like, then as.numeric(f) will not necessarily do it.  You get
>  the phenomenon shown below.  You need to use
>  as.numeric(as.character(f)).
>
>  In S-PLUS 8 there is an 'str' function, but in the pkgutils library,
>  part of the initiative to build a comaptibility with R.  This is a
>  welcome initiative that all S-PLUS users should embrace,
>  enthusiastically.
>
>  str(object) and summary(object) are both good, but my very first step is
>  always to use
>
>  > sapply(myData, class)
>
>  and look for anomalies with what I expect.  This is often enough to flag
>  a problem.  It's quick, easy and costs nothing (much).
>
>  The R funciton read.table allows you to specify colClasses, giving full
>  control over how things are converted when read.  I do not notice such
>  an argument in the S-PLUS functions read.table or importData (in S-PLUS
>  8.0.1 at least), so this looks like a bit of a lack that needs to be
>  plugged.  It should be easy enough.
>
>  Bill Venables.
>
>
>
>
>  -----Original Message-----
>  From: s-news-owner@lists.biostat.wustl.edu
>  [mailto:s-news-owner@lists.biostat.wustl.edu] On Behalf Of Douglas Bates
>  Sent: Sunday, 16 March 2008 6:10 AM
>  To: Kevin Wright
>  Cc: s-news@lists.biostat.wustl.edu
>  Subject: Re: [S] Burned by factors
>
>  On Wed, Mar 5, 2008 at 1:12 PM, Kevin Wright <kw.statr@gmail.com> wrote:
>  >  One place I find factors nice is when I create trellis plots of
>  >  subsets of data.  Factors keep the panels in the same place on the
>  >  page across different subsets of data, even in the situation that one
>  >  subset might have a panel with no data.
>
>  >  On the whole, however, I spend a significant amount of time fighting
>  >  factors and my conclusion is that S is far too eager to convert data
>  >  to factors.
>
>  >  Here's an example that burned me badly and cost me about six hours of
>  >  work. In essence, what happened was that data was read from two
>  >  different files. In one file, 'age' was read as numeric, while in the
>  >  second file 'age' had an unexpected, non-numeric value that caused a
>  >  behind-the-scenes conversion of age to a factor (instead of a numeric
>  >  with a missing value). Later in the code, merging these caused
>  >  unexpected results. Here is the essence of what happened:
>
>  >  age1 <- factor(c("20", "21", "22"))
>  >  age2 <- c(20, 21, 22)
>  >  ifelse(c(T, T, F), age1, age2)
>  >  [1]  1  2 22
>
>  >  The desired result was: 20, 21, 22.
>
>  >  Give a report to a client with erroneous results that traces back to
>  >  this phenomenon and you'll become paranoid about conversions to
>  >  factors.
>
>  Perhaps that is why it is a good idea to check the structure of a data
>  frame *before* beginning an analysis with it.  Even in my introductory
>  classes I emphasize that the first thing you always do with data is
>
>  str(myData)    # available in R, I'm not sure if it is in S-PLUS
>
>  and maybe
>
>  summary(myData)
>
>  I tend to preach this for the opposite reason, however.  I think it is
>  much more common to make the mistake of using a numeric vector when it
>  should be a factor than it is to use a factor when you think it should
>  be a numeric vector.
>  --------------------------------------------------------------------
>  This message was distributed by s-news@lists.biostat.wustl.edu.  To
>  unsubscribe send e-mail to s-news-request@lists.biostat.wustl.edu with
>  the BODY of the message:  unsubscribe s-news
>
>
>

<Prev in Thread] Current Thread [Next in Thread>