s-news
[Top] [All Lists]

Re: Horribly slow aggregate: alternatives?

To: "'Wim Kimmerer'" <kimmerer@sfsu.edu>, s-news@wubios.wustl.edu
Subject: Re: Horribly slow aggregate: alternatives?
From: "Liaw, Andy" <andy_liaw@merck.com>
Date: Mon, 21 Jul 2003 15:59:01 -0400
I tried the following in R, but don't imagine Splus being too different
here:

> x <- data.frame(a=sample(LETTERS, 23000, rep=T), 
                  b=sample(LETTERS, 23000, rep=T),
                  c=sample(LETTERS, 23000, rep=T),
                  d=sample(LETTERS, 23000, rep=T),
                  e=sample(LETTERS, 23000, rep=T),
                  y1=rnorm(23000), y2=rnorm(23000))

> system.time(res <- tapply(x$y1, x[,1:5], sum))
[1] 3.86 0.73 4.67   NA   NA

This, however, gives the answer in a 5-dimensional array with lots of empty
cells.  You may want to do something like:

> system.time(res <- tapply(x$y1, do.call("paste", c(x[,1:5], sep=":")),
sum))
[1] 2.22 0.01 2.37   NA   NA

This only gives you sums of existent combinations.

HTH,
Andy


> -----Original Message-----
> From: Wim Kimmerer [mailto:kimmerer@sfsu.edu] 
> Sent: Monday, July 21, 2003 3:41 PM
> To: s-news@wubios.wustl.edu
> Subject: [S] Horribly slow aggregate: alternatives?
> 
> 
> Splusers (v. 61.1 windows 98 PIII with 256MB): 
> 
> I have a data frame with about 23000 records and 7 columns.  
> I want to get sums of the last 2 columns by unique 
> combinations of the first 5 columns, which results in about 
> 18000 records.
> 
> So... I used aggregate, and when I couldn't get any work done 
> for several hours because the computer was humming and 
> grinding away at this problem, I nuked Splus, exported the 
> data, imported into Access (YUK) and ran a query which 
> took.... well I don't know but it was less than a second.
> 
> I looked at Hmisc for alternatives: there is a function 
> called summarize, but that only works for functions that 
> return >1 value (as far as I know), and will not run the 
> function on more than one data value (you can give it a 
> matrix but for each combination of the grouping variables, it 
> performs the function on all of the values in all columns of 
> the matrix corresponding to those rows).  Thus, summarize is 
> not suitable for what I want to do, without needless trickery.
> 
> I realize that aggregate uses tapply which uses loops, but 
> geez.... several hours at least, compared to under one second?  
> 
> Is there an alternative that does not loop?
> 
> Thanks...Wim
> ======================
> Dr. Wim Kimmerer
> Romberg Tiburon Center
> San Francisco State University
> 3152 Paradise Drive
> Tiburon CA 94920
> Ph. (415) 338-3515
> Fax (415) 435-7120
> http://online.sfsu.edu/~kimmerer/
> --------------------------------------------------------------------
> This message was distributed by 
> s-news@lists.biostat.wustl.edu.  To unsubscribe send e-mail 
> to s-news-request@lists.biostat.wustl.edu with the BODY of 
> the message:  unsubscribe s-news
> 

------------------------------------------------------------------------------
Notice: This e-mail message, together with any attachments, contains 
information of Merck & Co., Inc. (Whitehouse Station, New Jersey, 
USA) that may be confidential, proprietary copyrighted and/or legally 
privileged, and is intended solely for the use of the individual or entity
named on this message. If you are not the intended recipient, and
have received this message in error, please immediately return this by 
e-mail and then delete it.
------------------------------------------------------------------------------

<Prev in Thread] Current Thread [Next in Thread>