Hello S-users,
S-2000, R3; NT4.0, SP5
I have a large data.frame and I want to find the indices of duplicate
entries in one of the column. Through the function duplicated I can find
the indices of entries which have appeared before, but what I am interested
in is the indices of all entries appearing more than once. I have been
implementing this through the apply function but it takes for ever to run.
In my application here I have roughly 180K observations and duplicated
tells me that roughly 30K have appeared before, hence I am after 60K or
more indices. Here is my, very slow way, of doing it:
ttt.duplicated <- duplicated(anc.all.no.doublons[, 'nosoum'])
ttt.paste.all <- anc.all[, 'nosoum']
ttt.paste <- anc.all[, 'nosoum'][ttt.duplicated]
ttt.ind <- apply(as.matrix(ttt.paste), 1, FUN = function(x, y) which(y
== x),
ttt.paste.all)
ttt.ind <- unlist(ttt.ind)
ttt.ind <- sort(ttt.ind)
Any hint on how to speed that up?
Thanks,
Gérald Jean
Analyste-conseil (statistiques), Actuariat
télephone : (418) 835-4900 poste (7639)
télecopieur : (418) 835-6657
courrier électronique: gerald.jean@spgdag.ca
"In God we trust all others must bring data" W. Edwards Deming
|