s-news
[Top] [All Lists]

Re: Removing levels from a factor

To: "Frank E Harrell Jr" <f.harrell@vanderbilt.edu>
Subject: Re: Removing levels from a factor
From: "Douglas Bates" <bates@stat.wisc.edu>
Date: Sat, 15 Mar 2008 15:30:02 -0500
Cc: "Austin, Matt" <maustin@amgen.com>, "s-news@lists.biostat.wustl.edu" <s-news@lists.biostat.wustl.edu>
Dkim-signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth; bh=LjBdXDjWMLdypcZwJTJoMyBkLuGqzr9zoDjue65sdbE=; b=lqiU97SkUmoXrnF//mjTlumhrpYoqzoLzDjs5K96M+3WGgfThVQD/L/SXRaPPQU/N2Bp6zeZyR9LIDwgaa0FUjp3/yzwcQw/d/SwigCagnDqZZ0JzDYmI9Jb44y+qXyg5ZV6t8DSlsdx9/oNS4YTHrJvUvRTEQ0wGWNGjqiJrpk=
Domainkey-signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:date:from:sender:to:subject:cc:in-reply-to:mime-version:content-type:content-transfer-encoding:content-disposition:references:x-google-sender-auth; b=A8LPfdBBERYbSI5gLNIbpSGFLEbW46JIibbNfpWXTIy5Fik9arMeCzT3uqpJkiLSOPR2428iyKZwu0DDbCU0nchk3cRm2qaGLgDFQLxkOtxwONQeMdXL5ChtzH14fsp7jItbTTGkk2Trtzl5nKm7jdRRwmeiu9mjcsX/AtRKT+w=
In-reply-to: <47CEE95B.9010202@vanderbilt.edu>
References: <A413DCB0A7390F41B86A3EF6E2C9743A37BDAC6C69@usto-pmsg-mbs02.am.corp.amgen.com> <47CEE95B.9010202@vanderbilt.edu>
On Wed, Mar 5, 2008 at 1:41 PM, Frank E Harrell Jr
<f.harrell@vanderbilt.edu> wrote:
> Austin, Matt wrote:
>  > I'm not Terry, but I'll give my opinion.  I find factors quite useful and 
> very natural--but I started programming using S and didn't migrate from 
> another language/application.
>  >
>  > When a modeling application implicitly coerces a vector from character to 
> factor, how does it chose the reference level when using treatment style 
> contrasts? It sorts alphabetically, which means in treatment contrasts you 
> would be comparing all other levels to the level that sorts alphabetically 
> first. In most applications of medical research, we have a preconceived 
> notion of what should be the reference (placebo, active comparator, lowest 
> dose).  By using factors we can explicitly state the ordering.  This is also 
> very useful for displaying data in either tables of figures--ordering is 
> important and this is taken care of when constructing the factor objects.
>  >
>  > I suppose it's a tradeoff of having to explicitly dump levels when needed 
> or take care of the ordering at the appropriate time just before the 
> modeling/display.  I chose the earlier first method.

>  I have to agree with Matt.  In my experience and in the experience of a
>  large number of R users in our department, the advantages of factors far
>  outweigh the disadvantages.

Yes.  This tends to be a religious issue with fundamentalists in both
camps.  I believe that the "factors are beneficial" camp is larger
than the "factors are the work of the devil" camp.

However, as far as I can tell no one answered the original question
which was how do you drop unused levels from a factor - assuming that
you do not want to do this by banishing the concept of a factor and
exiling any person who dares to speak the name. The answer in R, and I
believe also in S-PLUS, is to use drop = TRUE in a subscripting
expression.  Thus

newdata <- subset(data, Treatment %in% c("B", "C", "D"))

produces a data frame in which Treatment still has 5 levels, but

newdata$Treatment[drop = TRUE]

has 3 levels.  An example run in R (I don't have access to a copy of S-PLUS) is

> set.seed(123454321)
> (dd <- data.frame(Response = rnorm(20, mean = 100, sd = 20), Treatment = 
> gl(5, 4, labels = LETTERS[1:5])))
    Response Treatment
1   61.14481         A
2  128.50892         A
3   81.93145         A
4   87.32405         A
5  134.72183         B
6   58.98118         B
7  122.37794         B
8  135.43260         B
9   91.19961         C
10  76.57837         C
11  76.84436         C
12 108.38061         C
13  73.45558         D
14  84.02234         D
15  81.52792         D
16  88.71802         D
17  84.74533         E
18 104.90819         E
19 116.16492         E
20  88.32498         E
> str(newdata <- subset(dd, Treatment %in% LETTERS[2:4]))
'data.frame':   12 obs. of  2 variables:
 $ Response : num  134.7  59.0 122.4 135.4  91.2 ...
 $ Treatment: Factor w/ 5 levels "A","B","C","D",..: 2 2 2 2 3 3 3 3 4 4 ...
> newdata$Treatment[drop = TRUE]
 [1] B B B B C C C C D D D D
Levels: B C D

>  > -----Original Message-----
>  > From: s-news-owner@lists.biostat.wustl.edu 
> [mailto:s-news-owner@lists.biostat.wustl.edu] On Behalf Of Thompson, David 
> (MNR)
>  > Sent: Wednesday, March 05, 2008 7:07 AM
>  > To: Terry Therneau; s-news@lists.biostat.wustl.edu; Mark.Hearnden@nt.gov.au
>  > Subject: Re: [S] Removing levels from a factor
>  >
>  > Terry,
>  >
>  > Regarding the 'factors are only occasionally useful' comment, what are the 
> situations where factors are _actually_ useful?
>  > Do not (most?) modelling functions that require factors usually coerce 
> character values as required?

>  >> -----Original Message-----
>  >> From: s-news-owner@lists.biostat.wustl.edu
>  >> [mailto:s-news-owner@lists.biostat.wustl.edu] On Behalf Of Terry
>  >> Therneau
>  >> Sent: March 5, 2008 09:05 AM
>  >> To: s-news@lists.biostat.wustl.edu; Mark.Hearnden@nt.gov.au
>  >> Subject: Re: [S] Removing levels from a factor
>  >>
>  >> The easiest thing is to turn off factors:
>  >>
>  >>> options(stringsAsFactors=F)
>  >>> data$Treatment <- as.character(data$Treatment)
>  >> Now you can subset the data frame and things will work as you would
>  > anticipate.
>  >> Comment: factors are occassionaly useful, but only occasionally.  Much
>  > grief
>  >> can be avoided by turning them off by default.  Our biostat group (>100
>  > people,
>  >> over 1200 projects a year) has had the above option as a part of our
>  > global
>  >> defaults for many years, and has not yet seen a downside to the
>  > decision.

<Prev in Thread] Current Thread [Next in Thread>