s-news
[Top] [All Lists]

Re: Speeding up way of finding Indices of duplicates.

To: Gerald.Jean@spgdag.ca
Subject: Re: Speeding up way of finding Indices of duplicates.
From: Pierre Kleiber <pkleiber@honlab.nmfs.hawaii.edu>
Date: Wed, 28 Feb 2001 10:54:35 -1000
Cc: s-news@wubios.wustl.edu
References: <OFB33C9224.E749F8E8-ON85256A01.006855DF@spgdag.ca>
I don't really understand your approach to it, but if I understand your
question correctly, I think you can simply do the following:

> vec -> anc.all.no.doublons$nosoum
> ttt.ind -> seq(length(vec))[vec%in%vec[duplicated(vec)]]

This took about 1 second on my linux box with a vector length
of about 200K, most of which were duplicates.

Gerald.Jean@spgdag.ca wrote:
> 
> Hello S-users,
> 
> S-2000, R3; NT4.0, SP5
> 
> I have a large data.frame and I want to find the indices of duplicate
> entries in one of the column.  Through the function duplicated I can find
> the indices of entries which have appeared before, but what I am interested
> in is the indices of all entries appearing more than once.  I have been
> implementing this through the apply function but it takes for ever to run.
> In my application here I have roughly 180K observations and duplicated
> tells me that roughly 30K have appeared before, hence I am after 60K or
> more indices.  Here is my, very slow way, of doing it:
> 
> ttt.duplicated <- duplicated(anc.all.no.doublons[, 'nosoum'])
> ttt.paste.all  <- anc.all[, 'nosoum']
> ttt.paste      <- anc.all[, 'nosoum'][ttt.duplicated]
> ttt.ind        <- apply(as.matrix(ttt.paste), 1, FUN = function(x, y) which(y 
> == x),
>                         ttt.paste.all)
> ttt.ind <- unlist(ttt.ind)
> ttt.ind <- sort(ttt.ind)
> 
> Any hint on how to speed that up?
> 
> Thanks,
> 
> Gérald Jean
> Analyste-conseil (statistiques), Actuariat
> télephone            : (418) 835-4900 poste (7639)
> télecopieur          : (418) 835-6657
> courrier électronique: gerald.jean@spgdag.ca
> 
> "In God we trust all others must bring data"  W. Edwards Deming
> 
> ---------------------------------------------------------------------
> This message was distributed by s-news@lists.biostat.wustl.edu.  To
> unsubscribe send e-mail to s-news-request@lists.biostat.wustl.edu with
> the BODY of the message:  unsubscribe s-news

-- 
-----------------------------------------------------------------
Pierre Kleiber             Email: pkleiber@honlab.nmfs.hawaii.edu
Fishery Biologist                     Tel: 808 983-5399/737-7544
NOAA FISHERIES - Honolulu Laboratory         Fax: 808 983-2902
2570 Dole St., Honolulu, HI 96822-2396 
-----------------------------------------------------------------
 "God could have told Moses about galaxies and mitochondria and
  all.  But behold... It was good enough for government work."
-----------------------------------------------------------------

<Prev in Thread] Current Thread [Next in Thread>