On Wed, Mar 5, 2008 at 1:41 PM, Frank E Harrell Jr
<f.harrell@vanderbilt.edu> wrote:
> Austin, Matt wrote:
> > I'm not Terry, but I'll give my opinion. I find factors quite useful and
> very natural--but I started programming using S and didn't migrate from
> another language/application.
> >
> > When a modeling application implicitly coerces a vector from character to
> factor, how does it chose the reference level when using treatment style
> contrasts? It sorts alphabetically, which means in treatment contrasts you
> would be comparing all other levels to the level that sorts alphabetically
> first. In most applications of medical research, we have a preconceived
> notion of what should be the reference (placebo, active comparator, lowest
> dose). By using factors we can explicitly state the ordering. This is also
> very useful for displaying data in either tables of figures--ordering is
> important and this is taken care of when constructing the factor objects.
> >
> > I suppose it's a tradeoff of having to explicitly dump levels when needed
> or take care of the ordering at the appropriate time just before the
> modeling/display. I chose the earlier first method.
> I have to agree with Matt. In my experience and in the experience of a
> large number of R users in our department, the advantages of factors far
> outweigh the disadvantages.
Yes. This tends to be a religious issue with fundamentalists in both
camps. I believe that the "factors are beneficial" camp is larger
than the "factors are the work of the devil" camp.
However, as far as I can tell no one answered the original question
which was how do you drop unused levels from a factor - assuming that
you do not want to do this by banishing the concept of a factor and
exiling any person who dares to speak the name. The answer in R, and I
believe also in S-PLUS, is to use drop = TRUE in a subscripting
expression. Thus
newdata <- subset(data, Treatment %in% c("B", "C", "D"))
produces a data frame in which Treatment still has 5 levels, but
newdata$Treatment[drop = TRUE]
has 3 levels. An example run in R (I don't have access to a copy of S-PLUS) is
> set.seed(123454321)
> (dd <- data.frame(Response = rnorm(20, mean = 100, sd = 20), Treatment =
> gl(5, 4, labels = LETTERS[1:5])))
Response Treatment
1 61.14481 A
2 128.50892 A
3 81.93145 A
4 87.32405 A
5 134.72183 B
6 58.98118 B
7 122.37794 B
8 135.43260 B
9 91.19961 C
10 76.57837 C
11 76.84436 C
12 108.38061 C
13 73.45558 D
14 84.02234 D
15 81.52792 D
16 88.71802 D
17 84.74533 E
18 104.90819 E
19 116.16492 E
20 88.32498 E
> str(newdata <- subset(dd, Treatment %in% LETTERS[2:4]))
'data.frame': 12 obs. of 2 variables:
$ Response : num 134.7 59.0 122.4 135.4 91.2 ...
$ Treatment: Factor w/ 5 levels "A","B","C","D",..: 2 2 2 2 3 3 3 3 4 4 ...
> newdata$Treatment[drop = TRUE]
[1] B B B B C C C C D D D D
Levels: B C D
> > -----Original Message-----
> > From: s-news-owner@lists.biostat.wustl.edu
> [mailto:s-news-owner@lists.biostat.wustl.edu] On Behalf Of Thompson, David
> (MNR)
> > Sent: Wednesday, March 05, 2008 7:07 AM
> > To: Terry Therneau; s-news@lists.biostat.wustl.edu; Mark.Hearnden@nt.gov.au
> > Subject: Re: [S] Removing levels from a factor
> >
> > Terry,
> >
> > Regarding the 'factors are only occasionally useful' comment, what are the
> situations where factors are _actually_ useful?
> > Do not (most?) modelling functions that require factors usually coerce
> character values as required?
> >> -----Original Message-----
> >> From: s-news-owner@lists.biostat.wustl.edu
> >> [mailto:s-news-owner@lists.biostat.wustl.edu] On Behalf Of Terry
> >> Therneau
> >> Sent: March 5, 2008 09:05 AM
> >> To: s-news@lists.biostat.wustl.edu; Mark.Hearnden@nt.gov.au
> >> Subject: Re: [S] Removing levels from a factor
> >>
> >> The easiest thing is to turn off factors:
> >>
> >>> options(stringsAsFactors=F)
> >>> data$Treatment <- as.character(data$Treatment)
> >> Now you can subset the data frame and things will work as you would
> > anticipate.
> >> Comment: factors are occassionaly useful, but only occasionally. Much
> > grief
> >> can be avoided by turning them off by default. Our biostat group (>100
> > people,
> >> over 1200 projects a year) has had the above option as a part of our
> > global
> >> defaults for many years, and has not yet seen a downside to the
> > decision.
|