s-news
[Top] [All Lists]

Re: identifying which column/row is passed in calls to apply()

To: Rich@Mango-Solutions.com
Subject: Re: identifying which column/row is passed in calls to apply()
From: Tony Plate <tplate@blackmesacapital.com>
Date: Fri, 17 Feb 2006 10:53:47 -0700
Cc: "'Schwarz,Paul'" <PSchwarz@gcrinsight.com>, s-news@lists.biostat.wustl.edu
In-reply-to: <20060211115934.46FEF1008C60@mailgate.biostat.wustl.edu>
References: <20060211115934.46FEF1008C60@mailgate.biostat.wustl.edu>
User-agent: Mozilla Thunderbird 1.0.5 (Windows/20050711)
Rich@Mango-Solutions.com wrote:
You could use sapply to pass in a vector of numbers (for column numbers) or
characters (for column names) and index the columns of the input data.
Something like this:-


innerFun <- function(i, df) {

        if (i == 1) mean(df[[i]])
        else median(df[[i]])
}

sapply(1:length(myDf), innerFun, df=myDf)


Not as efficient as apply for larger data structures I'd imagine, since
you're passing the entire dataset in each time ...

Rich.

Actually, S-PLUS usually avoids make unnecessary copies of datasets in this type of situation. (AFAIK, S-PLUS uses techniques along the lines of only duplicating an object when there is a change to an object, so local "copies" to unchanged versions can just be references).

Here's some operations on a large dataset with measurements that show that extra copies are not made:

> x <- matrix(rnorm(1e6),ncol=1000,dimnames=list(paste("r",1:1000,sep=""),paste("c",1:1000,sep="")))
>
> # ordinary apply() version
> mem.tally.reset()
> print(sys.time({r1 <- apply(x, 1, sum); print(mem.tally.report())}))
 new database evaluation
            0    8097114
[1] 0.250 1.063
> mem.tally.reset()
> print(sys.time({r1 <- apply(x, 1, sum); print(mem.tally.report())}))
 new database evaluation
            0    8097114
[1] 0.265 1.062
> all(r1==r2)
[1] T
>
> # version that uses sapply() to loop over indices and access global
> # variable x
> mem.tally.reset()
> print(sys.time({r2 <- sapply(seq(nrow(x)), function(i) sum(x[i,])); print(mem.tally.report())}))
 new database evaluation
            0    8879045
[1] 0.625 1.454
> mem.tally.reset()
> print(sys.time({r2 <- sapply(seq(nrow(x)), function(i) sum(x[i,])); print(mem.tally.report())}))
 new database evaluation
            0    8879045
[1] 0.797 1.640
>
> # version that uses sapply() to loop over indices and also
> # passes in matrix x
> mem.tally.reset()
> print(sys.time({r3 <- sapply(seq(nrow(x)), function(i,y) sum(y[i,]),x); print(mem.tally.report())}))
 new database evaluation
            0    8935493
[1] 0.891 1.672
> mem.tally.reset()
> print(sys.time({r3 <- sapply(seq(nrow(x)), function(i,y) sum(y[i,]),x); print(mem.tally.report())}))
 new database evaluation
            0    8939525
[1] 0.891 1.703
> all(r1==r3)
[1] T
>

So, the both sapply() versions use about 1.1 times the memory used by the apply() version, and takes about 2.5 to 3 times the CPU. The latter makes some sense because apply() in S-PLUS uses an optimized C function for iterations when its FUN returns vectors of identical sizes.

This was done with S-PLUS 7.0.6 for windows, and I've observed similar behavior in all versions from S-PLUS 6 onwards. One thing to note is that loops done using sapply() (or lapply()) can be much more memory efficient than loops done with 'for' -- sometimes unneeded memory used in iterations of 'for' loops is not effectively reclaimed at the end of each iteration, leading to out-of-memory errors.

-- Tony Plate


-----Original Message-----
From: s-news-owner@lists.biostat.wustl.edu
[mailto:s-news-owner@lists.biostat.wustl.edu] On Behalf Of Schwarz,Paul
Sent: 11 February 2006 02:19
To: s-news@lists.biostat.wustl.edu
Subject: [S] identifying which column/row is passed in calls to apply()

S-News readers,

How would I identify which column (or row) is being operated on in a
call to the function specified in apply()? That is, if I wanted to
perform some column-specific (or row-specific) operation, how do I
identify which column/row is being passed to the function that's
specified in the call to apply(), where the number of columns ranges
from 1 to ncol(mymat)?

apply( mymat, 2, function(x){ ? } )


I suspect that this is easy, but I'm drawing a blank on how to do this.

Thanks a lot everyone.

-Paul
--------------------------------------------------------------------
This message was distributed by s-news@lists.biostat.wustl.edu.  To
unsubscribe send e-mail to s-news-request@lists.biostat.wustl.edu with
the BODY of the message:  unsubscribe s-news

--------------------------------------------------------------------
This message was distributed by s-news@lists.biostat.wustl.edu.  To
unsubscribe send e-mail to s-news-request@lists.biostat.wustl.edu with
the BODY of the message:  unsubscribe s-news



<Prev in Thread] Current Thread [Next in Thread>