Dear S users,
Once again amazed by the response, I will endeavor to provide a useful
summary to the following question:
> ...I have a data set (not a data ;-) ) 40,000 records long, with a factor
> with 370 levels. About a dozen or more of the factor levels are bogus
> (mistakes) and I would like to exclude a hand full more for other
> reasons. How do I select the subset of the original data set that
> includes only the approximately 350 levels that I want? ....
>
> Original data frame
> Factor response
> 1 5.1
> 1 3.2
> 1 4.3
> 2 7.6
> ... ...
> 370 2.8
>
> factor level data frame
> Factor
> 1 T
> 2 F
> ... ...
> 370 T
A couple of key functions that I was unaware of were "is.element(el, set)",
and %in% in R and from the Hmisc library of Frank Harrrell (usage: a %in% b
is similar to "is.element( a, b)", both of which are based on match(). Frank
pointed me to several examples in Azola and Harrell at
http://hesweb1.med.virginia.edu/biostat/s/doc/splus.pdf Section 4.3, where
one soluiton includes
levels(x) <- list(....).
Given these functions, I could select the factor levels I wanted to keep, or
those I wanted to remove (whichever was easier) and then subset my data
based on those criteria. For instance, if I wanted to exclude factor level
2, Tom Richards suggested
newDF <- DF[!(as.numeric(DF$Factor) %in% c(1,2,3,5,6,8,9)),]
Doug Bates suggested:
"%in%" <- function(x, table) match(x, table, nomatch = 0) > 0
the type of expression you want is
"%w/o%" <- function(x,y) x[!x %in% y] #-- x without y
(1:10) %w/o% c(3,7,12)
but you want to apply it to the rows as in
df[ ! df$Factor %in% badlevels, ]
Don McKenzie suggested an explicit use of match():
new.df <- orig.df[match(orig.df$factor,fac.df$factor[fac.df$level=="T"]
!=NA),]
where fac.df is a table of factor levels and a logical vector for htose I
want to keep.
Sam Buttrey orked out the example using match explicitly:
Suppose you have these two data frames:
zap <- data.frame (factor (c(1,1,1,2,2,3,3)), num = 1:7) # your original,
here 7 by 2
zip <- data.frame (factor (c(1,2,3)), I(c(T, F, T))) # Keep 1 and 3, get rid
of 2
Now match (zap[,1], zip[,1]) gives you a vector of length 7 telling you
where, in "zip," each factor level can be found. Then you want to go to the
corresponding row in "zip" and see what's in column 2.
zip[match (zap[,1], zip[,1]),2] is a vector of length 7 telling you which
rows of zap should be kept. So, finally,
zap[zip[match (zap[,1], zip[,1]),2],]
does what you want.
Sally Rodriguez worked out the following example where each record was
already labelled T or false:
Suppose your indicator variable, a vector of length x = 40,000, is
"Ind.include". It contains either "T" or "F" for inclue and exclude,
respectively. I'll call the data set you want to subset, "my.data".
Then you can get the row numbers of those "F" and then simply drop those
cases and create a "keeper" data set containing only case with "T" in
Ind.include by:
> Ind.include[1:10]
[1] T T T T F F T T F T
> drops_(1:x)[Ind.include==F]
> drops[1:10]
[1] 5 6 9 15 17 18 20 21 22 24
keeper_my.data[-drops,]
Hope that helps,
finally Tony Plate pointed out some apparent eccentricities of S+
The behavior you describe happens because of S+'s habit of making frequent,
silent, willful and inconsistent (!:-) changes to the way data is stored.
> x <- data.frame(code=letters[c(1,2,3,4,1,2,3,4)], value=1:8)
> keep <- data.frame(code=letters[1:4],keep=c(T,F,T,F))
> x
code value
1 a 1
2 b 2
3 c 3
4 d 4
5 a 5
6 b 6
7 c 7
8 d 8
> keep
code keep
1 a TRUE
2 b FALSE
3 c TRUE
4 d FALSE
> x[as.logical(as.character(keep$keep[match(x$code, keep$code)])), ]
code value
1 a 1
3 c 3
5 a 5
7 c 7
>
The obvious thing to do doesn't work because keep$keep is not actually a
logical vector -- it is a factor (and "logical" indexing only works with
true logicals):
> x[keep$keep[match(x$code, keep$code)], ]
code value
X2 b 2
X1 a 1
X21 b 2
X12 a 1
X23 b 2
X14 a 1
X25 b 2
X16 a 1
> is.logical(keep$keep)
[1] F
> mode(keep$keep)
[1] "numeric"
>
However, if you are extra careful in way you construct your data frame with
the logical vector, and make sure it really is logical, then everything
works as it should:
> keep2 <- data.frame(code=letters[1:4],keep=I(c(T,F,T,F)))
> mode(keep2$keep)
[1] "logical"
> x[keep2$keep[match(x$code, keep2$code)], ]
code value
1 a 1
3 c 3
5 a 5
7 c 7
>
Hope this helps!
--------
Alternatively, I could use the factor function to specify wthe levels I
wanted (From Bert Gunter)
cleaned.factor<-factor(OF, levels=levels(OF), exclude=levels.exclude)
cleaned.factor<-cleaned.factor[!is.na(cleaned.factor)]
--
Dr. M. Henry H. Stevens
Postdoctoral Associate
Department of Ecology, Evolution, & Natural Resources
14 College Farm Road
Cook College, Rutgers University
New Brunswick, NJ 08901-8551
email: hstevens@rci.rutgers.edu
phone: 732-932-9631
fax: 732-932-8746
|