s-news
[Top] [All Lists]

Speeding up way of finding Indices of duplicates.

To: s-news@wubios.wustl.edu
Subject: Speeding up way of finding Indices of duplicates.
From: Gerald.Jean@spgdag.ca
Date: Wed, 28 Feb 2001 14:09:34 -0500
Hello S-users,

S-2000, R3; NT4.0, SP5

I have a large data.frame and I want to find the indices of duplicate
entries in one of the column.  Through the function duplicated I can find
the indices of entries which have appeared before, but what I am interested
in is the indices of all entries appearing more than once.  I have been
implementing this through the apply function but it takes for ever to run.
In my application here I have roughly 180K observations and duplicated
tells me that roughly 30K have appeared before, hence I am after 60K or
more indices.  Here is my, very slow way, of doing it:

ttt.duplicated <- duplicated(anc.all.no.doublons[, 'nosoum'])
ttt.paste.all  <- anc.all[, 'nosoum']
ttt.paste      <- anc.all[, 'nosoum'][ttt.duplicated]
ttt.ind        <- apply(as.matrix(ttt.paste), 1, FUN = function(x, y) which(y 
== x),
                        ttt.paste.all)
ttt.ind <- unlist(ttt.ind)
ttt.ind <- sort(ttt.ind)

Any hint on how to speed that up?

Thanks,

Gérald Jean
Analyste-conseil (statistiques), Actuariat
télephone            : (418) 835-4900 poste (7639)
télecopieur          : (418) 835-6657
courrier électronique: gerald.jean@spgdag.ca

"In God we trust all others must bring data"  W. Edwards Deming


<Prev in Thread] Current Thread [Next in Thread>