s-news
[Top] [All Lists]

Re: scalability

To: David L Lorenz <lorenz@usgs.gov>
Subject: Re: scalability
From: Steve Karmesin <skarmesin@blackmesacapital.com>
Date: Fri, 26 Mar 2004 16:16:56 -0700
Cc: s-news@lists.biostat.wustl.edu
In-reply-to: <OF08949065.1D44A7AE-ON86256E63.006E1A84@cr.usgs.gov>
References: <OF08949065.1D44A7AE-ON86256E63.006E1A84@cr.usgs.gov>
User-agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.6) Gecko/20040113
As others have said, what apply has to do in this case is loop over the 900,000 cases and do a 'sum' over three elements each time. In this case the overhead of calling an S+ function totally swamps the numeric operations.

Doing this on smaller datasets (300x30x3) on my machine (2CPU, 3GHz Xeon running Windows 2000 and S-Plus 6.1) shows an overhead of about 140 microseconds per call to sum, so I would expect it to take 100*1e-6*9e5=90 seconds.

The thing is, it is worse than this. If I do a case with 900x90x3 it takes 300 usec per 'sum'.

R is fairly stable at just under 15usec per 'sum' on my machine.

In many cases when benchmarking ever larger arrays slowdowns in per-element times are due to cache effects, but the numbers here seem so much larger than any conceivable memory bandwidth times that I don't think that is what it is. It seems most likely to be a memory management effect -- perhaps S is allocating and deallocating a bunch of things it doesn't have to per function invocation?

In doing some systematic tests where it runs through different sizes repeatedly, I'm getting some strange hysteresis effects in the timings, which would make my hypothesize that the issue is memory management, but I'm not just sure what I would do if I was trying soak up that much time per invocation.

-Steve Karmesin

David L Lorenz wrote:

Hi,
 I ran into an interesting question from one of our users. He had an array
of about 3000 by 300 by 3. He tried to use apply to sum the last dimension:

result <- apply(array, c(1,2), sum)

 I'm not sure he was ever able to get the result.  He was surprised
because he could use apply over different dimensions and had no problem:

wrong.result <- apply(array, c(2,3), sum)

 I suggested that he simply break down the problem into a simple
summation:

result <- array[,,1] + array[,,2] + array[,,3]

 That executed very fast.

 My question is "Has anybody constructed a list of functions that do not
scale well under certain circumstances?"  I remember seeing something
within the last year about outer being very slow for long vectors and
clearly, there are some problems with apply.
 Thanks.
Dave


--------------------------------------------------------------------
This message was distributed by s-news@lists.biostat.wustl.edu.  To
unsubscribe send e-mail to s-news-request@lists.biostat.wustl.edu with
the BODY of the message:  unsubscribe s-news


<Prev in Thread] Current Thread [Next in Thread>