s-news
[Top] [All Lists]

Re: Burned by factors

To: "Kevin Wright" <kw.statr@gmail.com>
Subject: Re: Burned by factors
From: "Douglas Bates" <bates@stat.wisc.edu>
Date: Sat, 15 Mar 2008 15:10:11 -0500
Cc: s-news@lists.biostat.wustl.edu
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth; bh=RQdWoJDRyS0Qw5E1xG8lcOQdJU13cgwoiXkLfKYH3GM=; b=R8cr7ujPPXVTqDxSCr/C2bHNz5PPMdltckhFwC+f4KIIh0xBylBWGmlRoMgNfJCiq4k1D+xhp4ceaIbs5J+EcvX0mPTGDLhR0Z4uzUeHHR9YwvTE020Zt/bPQhXFCwlYmS0NVjCrxIFufm8npmLAvCFXPU+zl2ZtiIjdBi8RhtA=
Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth; b=FhdJPgLoBoU9vLKgWvfe9Y+PVqJg9HzOxxOO928RaBTqahPw+v9fX4Cs5OYjTHi9Hew+nlP6G3uk0ZdTqEqvsyF9K3UP5xTefsGSoT149R7HGwHb11UESod1YcTZtlx77QM5MY+GBS8UxAnFHq2sugqVUHqTb2ORMhlw5d8nPGs=
In-reply-to: <c968588d0803051012x4be861f7pbf489ae363b0983e@mail.gmail.com>
References: <c968588d0803051012x4be861f7pbf489ae363b0983e@mail.gmail.com>
On Wed, Mar 5, 2008 at 1:12 PM, Kevin Wright <kw.statr@gmail.com> wrote:
> One place I find factors nice is when I create trellis plots of
>  subsets of data.  Factors keep the panels in the same place on the
>  page across different subsets of data, even in the situation that one
>  subset might have a panel with no data.

>  On the whole, however, I spend a significant amount of time fighting
>  factors and my conclusion is that S is far too eager to convert data
>  to factors.

>  Here's an example that burned me badly and cost me about six hours of
>  work. In essence, what happened was that data was read from two
>  different files. In one file, 'age' was read as numeric, while in the
>  second file 'age' had an unexpected, non-numeric value that caused a
>  behind-the-scenes conversion of age to a factor (instead of a numeric
>  with a missing value). Later in the code, merging these caused
>  unexpected results. Here is the essence of what happened:

>  age1 <- factor(c("20", "21", "22"))
>  age2 <- c(20, 21, 22)
>  ifelse(c(T, T, F), age1, age2)
>  [1]  1  2 22

>  The desired result was: 20, 21, 22.

>  Give a report to a client with erroneous results that traces back to
>  this phenomenon and you'll become paranoid about conversions to
>  factors.

Perhaps that is why it is a good idea to check the structure of a data
frame *before* beginning an analysis with it.  Even in my introductory
classes I emphasize that the first thing you always do with data is

str(myData)    # available in R, I'm not sure if it is in S-PLUS

and maybe

summary(myData)

I tend to preach this for the opposite reason, however.  I think it is
much more common to make the mistake of using a numeric vector when it
should be a factor than it is to use a factor when you think it should
be a numeric vector.

<Prev in Thread] Current Thread [Next in Thread>