Hello S-users,
thanks to Bill Dunlap, Scott Chasalow and Leonid Gibiansky for there quick
replies. I tried both Bill's and Scott's solutions and I'm pretty
impressed! My "apply" solution was still running, after over an hour, when
I received the replies, I killed it and tried there solutions. Here is how
fast they are:
> ttt.time <- proc.time()
> ttt.rep <- which.duplicated(anc.all.no.doublons[, 'nosoum']) # Bill's
> solution, see function
> ElapsedTime(ttt.time, proc.time()) # at the end of
> message.
Elapsed time = 0 h. 0 min. 6.271484375 s.
> ttt.time <- proc.time()
> ttt.rep.2 <- seq(along = anc.all.no.doublons[,
+ 'nosoum'])[!is.na(match(anc.all.no.doublons[, 'nosoum'],
+ anc.all.no.doublons[, 'nosoum'][duplicated(anc.all.no.doublons[,
+ 'nosoum'])]))]
> ElapsedTime(ttt.time, proc.time())
Elapsed time = 0 h. 0 min. 6.109375 s.
> all.equal(ttt.rep, ttt.rep.2) # and of course this is the "same"
> answer.
[1] T
Bill's function:
which.duplicated <- function(x) {
tab <- table(x)
tab <- tab[tab>1]
which(is.element(as.character(x), names(tab)))
}
The original message follows, thanks again to all respondents,
Gérald Jean
Analyste-conseil (statistiques), Actuariat
télephone : (418) 835-4900 poste (7639)
télecopieur : (418) 835-6657
courrier électronique: gerald.jean@spgdag.ca
"In God we trust all others must bring data" W. Edwards Deming
Gerald.Jean@spgdag.ca@lists.biostat.wustl.edu le 2001/02/28 14:09:34
Envoyé par : s-news-owner@lists.biostat.wustl.edu
Pour : s-news@wubios.wustl.edu
cc :
Objet : [S] Speeding up way of finding Indices of duplicates.
Hello S-users,
S-2000, R3; NT4.0, SP5
I have a large data.frame and I want to find the indices of duplicate
entries in one of the column. Through the function duplicated I can find
the indices of entries which have appeared before, but what I am interested
in is the indices of all entries appearing more than once. I have been
implementing this through the apply function but it takes for ever to run.
In my application here I have roughly 180K observations and duplicated
tells me that roughly 30K have appeared before, hence I am after 60K or
more indices. Here is my, very slow way, of doing it:
ttt.duplicated <- duplicated(anc.all.no.doublons[, 'nosoum'])
ttt.paste.all <- anc.all[, 'nosoum']
ttt.paste <- anc.all[, 'nosoum'][ttt.duplicated]
ttt.ind <- apply(as.matrix(ttt.paste), 1, FUN = function(x, y)
which(y == x),
ttt.paste.all)
ttt.ind <- unlist(ttt.ind)
ttt.ind <- sort(ttt.ind)
Any hint on how to speed that up?
Thanks,
Gérald Jean
Analyste-conseil (statistiques), Actuariat
télephone : (418) 835-4900 poste (7639)
télecopieur : (418) 835-6657
courrier électronique: gerald.jean@spgdag.ca
"In God we trust all others must bring data" W. Edwards Deming
---------------------------------------------------------------------
This message was distributed by s-news@lists.biostat.wustl.edu. To
unsubscribe send e-mail to s-news-request@lists.biostat.wustl.edu with
the BODY of the message: unsubscribe s-news
|