s-news
[Top] [All Lists]

Summary: Speeding up way of finding Indices of duplicates.

To: s-news@wubios.wustl.edu
Subject: Summary: Speeding up way of finding Indices of duplicates.
From: Gerald.Jean@spgdag.ca
Date: Wed, 28 Feb 2001 15:38:18 -0500
Hello S-users,

thanks to Bill Dunlap, Scott Chasalow and Leonid Gibiansky for there quick
replies.  I tried both Bill's and Scott's solutions and I'm pretty
impressed!  My "apply" solution was still running, after over an hour, when
I received the replies, I killed it and tried there solutions.  Here is how
fast they are:

> ttt.time <- proc.time()
> ttt.rep <- which.duplicated(anc.all.no.doublons[, 'nosoum']) # Bill's 
> solution, see function
> ElapsedTime(ttt.time, proc.time())                           # at the end of 
> message.
 Elapsed time =  0 h.  0 min.  6.271484375 s.

> ttt.time <- proc.time()
> ttt.rep.2 <- seq(along = anc.all.no.doublons[,
+              'nosoum'])[!is.na(match(anc.all.no.doublons[, 'nosoum'],
+              anc.all.no.doublons[, 'nosoum'][duplicated(anc.all.no.doublons[,
+              'nosoum'])]))]
> ElapsedTime(ttt.time, proc.time())
 Elapsed time =  0 h.  0 min.  6.109375 s.

> all.equal(ttt.rep, ttt.rep.2)          # and of course this is the "same" 
> answer.
[1] T

Bill's function:

which.duplicated <- function(x) {
     tab <- table(x)
     tab <- tab[tab>1]
     which(is.element(as.character(x), names(tab)))
}

The original message follows, thanks again to all respondents,

Gérald Jean
Analyste-conseil (statistiques), Actuariat
télephone            : (418) 835-4900 poste (7639)
télecopieur          : (418) 835-6657
courrier électronique: gerald.jean@spgdag.ca

"In God we trust all others must bring data"  W. Edwards Deming





Gerald.Jean@spgdag.ca@lists.biostat.wustl.edu le 2001/02/28 14:09:34

Envoyé par :   s-news-owner@lists.biostat.wustl.edu


Pour :    s-news@wubios.wustl.edu
cc :
Objet :   [S] Speeding up way of finding Indices of duplicates.


Hello S-users,

S-2000, R3; NT4.0, SP5

I have a large data.frame and I want to find the indices of duplicate
entries in one of the column.  Through the function duplicated I can find
the indices of entries which have appeared before, but what I am interested
in is the indices of all entries appearing more than once.  I have been
implementing this through the apply function but it takes for ever to run.
In my application here I have roughly 180K observations and duplicated
tells me that roughly 30K have appeared before, hence I am after 60K or
more indices.  Here is my, very slow way, of doing it:

ttt.duplicated <- duplicated(anc.all.no.doublons[, 'nosoum'])
ttt.paste.all  <- anc.all[, 'nosoum']
ttt.paste      <- anc.all[, 'nosoum'][ttt.duplicated]
ttt.ind        <- apply(as.matrix(ttt.paste), 1, FUN = function(x, y)
which(y == x),
                        ttt.paste.all)
ttt.ind <- unlist(ttt.ind)
ttt.ind <- sort(ttt.ind)

Any hint on how to speed that up?

Thanks,

Gérald Jean
Analyste-conseil (statistiques), Actuariat
télephone            : (418) 835-4900 poste (7639)
télecopieur          : (418) 835-6657
courrier électronique: gerald.jean@spgdag.ca

"In God we trust all others must bring data"  W. Edwards Deming

---------------------------------------------------------------------
This message was distributed by s-news@lists.biostat.wustl.edu.  To
unsubscribe send e-mail to s-news-request@lists.biostat.wustl.edu with
the BODY of the message:  unsubscribe s-news




<Prev in Thread] Current Thread [Next in Thread>
  • Summary: Speeding up way of finding Indices of duplicates., Gerald . Jean <=