s-news
[Top] [All Lists]

Re: S-PLUS Vs some other softwares

To: "Pravin" <jadhavpr@vcu.edu>
Subject: Re: S-PLUS Vs some other softwares
From: Tim Hesterberg <timh@insightful.com>
Date: 5 Mar 2004 15:33:10 -0800
Cc: s-news@wubios.wustl.edu
In-reply-to: <04C15615C183B043A3D804C5B989844B076D19D3@cdsx08.cder.fda.gov> (jadhavpr@vcu.edu)
References: <04C15615C183B043A3D804C5B989844B076D19D3@cdsx08.cder.fda.gov>
The problem posed is to find 500,000 regression coefficients,
one for each subject in a large data frame.

Below is a solution that takes 16 seconds in a test problem with
500,000 subjects.  In contrast, the solution mentioned in the 
summary of earlier responses to the question took 19 minutes.

I'll throw in a quick plug for the "Advanced Programming in S-PLUS"
class I'll teach March 18-19 in San Francisco.  Part of what we'll do
is look at ways like this to make S+ run faster.  If interested see:
        http://www.insightful.com/services/schedule.asp



You're trying to do (for 500K patients, roughly 9 rows per patient):

Nsub <- 500000
slope <- rep(0, Nsub)
for(i in 1:Nsub) {
  od.fit <- lm(data.y~data.x, data=data.ram, subset= sub==i, singular.ok=T)
  slope[i] <- coef(od.fit)[2]
}
rbind(slope[1:50], betas[1:50])
# Editorial comment -- I've added spaces for readability.  Please
# do this in postings to s-news.  It is a good habit to follow in general.


# The following solution is based on the formula for the slope coefficient:
#       beta = sum( Y*X ) / sum( X*X )
# where Y = y - mean(y), X = x - mean(x)
# we can simplify further to:
#       beta = sum( y*X ) / sum( X*X )


# Generate some artificial data to test the solution:
set.seed(0)  # for reproducibility
rowsPerPatient <- sample(9, size = 500000, replace=T)
# random number of subjects, including 1, to test robustness of solution
data.x <- rnorm(sum(rowsPerPatient))
data.y <- rnorm(sum(rowsPerPatient))
data.ram <- data.frame(data.x, data.y, sub = rep(1:500000, rowsPerPatient))


#Solution
# For fair comparison with other methods, start with the data frame
# rather than the variables data.x, etc.

# Assume that data.ram is sorted by subject
groupsums <- function(x, counts){
  # sum x by group, where counts contains the size of each group.
  diff( c(0, cumsum(x)[cumsum(counts)]))
}
groupmeans <- function(x, counts){
  groupsums(x, counts) / counts
}
computeBetas <- function(data){
  # assumes names "data.x", "data.y", and "sub"
  counts <- rle(data$sub)$lengths  # equals rowsPerPatient
  x <- data$data.x
  x <- x - rep( groupmeans(x, counts), counts )  # subtract group means
  groupsums(x * data$data.y, counts) / groupsums(x^2, counts)
}
sys.time(computeBetas(data.ram))   # took 15.18 seconds

# My machine is 851 MHz; the 19-minute solution was on a 2 GHz machine.
# Caveat -- my data frame may be smaller, average of 5 rows per patient.


# Check that answers are the same
betas <- computeBetas(data.ram)
all.equal(slope[1:50], betas[1:50])

========================================================
| Tim Hesterberg       Research Scientist              |
| timh@insightful.com  Insightful Corp.                |
| (206)802-2319        1700 Westlake Ave. N, Suite 500 |
| (206)802-2500 (fax)  Seattle, WA 98109-3044, U.S.A.  |
|                      www.insightful.com/Hesterberg   |
========================================================
Download the S+Resample library from www.insightful.com/downloads/libraries

I'll teach two classes in San Francisco:
        March 16-17 Bootstrap Methods and Permutation Tests
        March 18-19 Advanced Programming in S-PLUS
For details see
        http://www.insightful.com/services/schedule.asp


<Prev in Thread] Current Thread [Next in Thread>