The problem posed is to find 500,000 regression coefficients,
one for each subject in a large data frame.
Below is a solution that takes 16 seconds in a test problem with
500,000 subjects. In contrast, the solution mentioned in the
summary of earlier responses to the question took 19 minutes.
I'll throw in a quick plug for the "Advanced Programming in S-PLUS"
class I'll teach March 18-19 in San Francisco. Part of what we'll do
is look at ways like this to make S+ run faster. If interested see:
http://www.insightful.com/services/schedule.asp
You're trying to do (for 500K patients, roughly 9 rows per patient):
Nsub <- 500000
slope <- rep(0, Nsub)
for(i in 1:Nsub) {
od.fit <- lm(data.y~data.x, data=data.ram, subset= sub==i, singular.ok=T)
slope[i] <- coef(od.fit)[2]
}
rbind(slope[1:50], betas[1:50])
# Editorial comment -- I've added spaces for readability. Please
# do this in postings to s-news. It is a good habit to follow in general.
# The following solution is based on the formula for the slope coefficient:
# beta = sum( Y*X ) / sum( X*X )
# where Y = y - mean(y), X = x - mean(x)
# we can simplify further to:
# beta = sum( y*X ) / sum( X*X )
# Generate some artificial data to test the solution:
set.seed(0) # for reproducibility
rowsPerPatient <- sample(9, size = 500000, replace=T)
# random number of subjects, including 1, to test robustness of solution
data.x <- rnorm(sum(rowsPerPatient))
data.y <- rnorm(sum(rowsPerPatient))
data.ram <- data.frame(data.x, data.y, sub = rep(1:500000, rowsPerPatient))
#Solution
# For fair comparison with other methods, start with the data frame
# rather than the variables data.x, etc.
# Assume that data.ram is sorted by subject
groupsums <- function(x, counts){
# sum x by group, where counts contains the size of each group.
diff( c(0, cumsum(x)[cumsum(counts)]))
}
groupmeans <- function(x, counts){
groupsums(x, counts) / counts
}
computeBetas <- function(data){
# assumes names "data.x", "data.y", and "sub"
counts <- rle(data$sub)$lengths # equals rowsPerPatient
x <- data$data.x
x <- x - rep( groupmeans(x, counts), counts ) # subtract group means
groupsums(x * data$data.y, counts) / groupsums(x^2, counts)
}
sys.time(computeBetas(data.ram)) # took 15.18 seconds
# My machine is 851 MHz; the 19-minute solution was on a 2 GHz machine.
# Caveat -- my data frame may be smaller, average of 5 rows per patient.
# Check that answers are the same
betas <- computeBetas(data.ram)
all.equal(slope[1:50], betas[1:50])
========================================================
| Tim Hesterberg Research Scientist |
| timh@insightful.com Insightful Corp. |
| (206)802-2319 1700 Westlake Ave. N, Suite 500 |
| (206)802-2500 (fax) Seattle, WA 98109-3044, U.S.A. |
| www.insightful.com/Hesterberg |
========================================================
Download the S+Resample library from www.insightful.com/downloads/libraries
I'll teach two classes in San Francisco:
March 16-17 Bootstrap Methods and Permutation Tests
March 18-19 Advanced Programming in S-PLUS
For details see
http://www.insightful.com/services/schedule.asp
|