Dear S-plus users,
On 05/20/2002 I asked for help to speed up a for loop, written to obtain
leave-one-out predictions for a series of models. I receive three kind replies,
from Dimitris C. Rizopoulos, Nick Ellis and Tim Hesterberg. Both the answers
and the original question follow below. To summarize and comment, Dimitris
suggested using first lapply to create a list in which the i element is the
original data frame without the i row, and then using again lapply (inside a
for loop) to update the model with the elements of this list and predict for
the i row of the original data frame. This code reduces the time for a list of
three diferent models (from 254 min to 204) but, unfortunately, the generation
of the list of data frames requires too much computer resources and forces me
to clean the original data frame by removing columns not to be used in the
models (f.e., I can not do the lapply for all the different models at a time).
However, the solution would be most advantageous for more reduced data sets
(mine has 1144 rows). Nick Ellis suggested to use an exact updating formula,
but this is only possible for lm models (he also provided a clever hint for
gams: replace them for splines and obtain the model frame by the model=T
argument to lm). Although, my models have binomial errors and logistic link,
therefore I do not think it would be straightforward to convert them to lm.
Finally, Tim Hesterberg offered me to test a new cross-validation function that
will be in the next bootstrap library (sorry, but I would prefer not to migrate
to S+6.0 yet).
Thank you very much for your answers. I am afraid that I will not be able to
fully solve my problem, but you have gave me interesting hints for other
instances. I learn a lot by trying to implement your solutions.
============Replies:
============Dimitris C. Rizopoulos:
> model.1 <- lm(Fuel ~ Weight + Disp., fuel.frame)
> model.2 <- lm(Fuel ~ Weight + Mileage, fuel.frame)
> model.3 <- lm(Fuel ~ Weight + Type, fuel.frame)
> models <- list(model.1, model.2, model.3)
## I suppose that all the models are for the same
data.frame
> tic <- proc.time()
> n <- nrow(fuel.frame)
> data.new <- lapply(1:n, function(x)
{
fuel.frame[ - x, ]
}
)
> predictions <- list(NULL)
> for(i in 1:n) {
predictions[[i]] <- lapply(models, function(x)
{
model <- update(x, data = data.new[[i]])
predict(model, newdata = fuel.frame[i, ])
}
)
}
> tac <- proc.time() - tic
> tac
[1] 20.23
============Nick Ellis
If your models could be expressed as lm() models you could use an exact
updating formula. Even if your models are gams you can replace them by
equivalent spline models and get the X matrix using model=T argument to lm.
However if your models are generalized linear models then there is no simple
updating formula. You can try updating based on the last iteration of the
IWLS algorithm, which gives approximate leave-one-out residuals on the scale
of the linear predictors.
I'd be interested in any other suggestions you receive.
> x <- 1:10
> y<-rnorm(x)
> fit <- lm(y~x)
> X <- cbind(1,x)
> hat <- X %*% solve(crossprod(X),t(X))
> y.leave.one.out <- numeric(length(y))
> for (i in seq(along=y)) {
+ fit1 <- update(fit,subset=-i)
+ y.leave.one.out[i] <- predict(fit1,data.frame(x=x))[i]
+ }
> cbind(y.leave.one.out,exact.formula=y-resid(fit)/(1-diag(hat)))
y.leave.one.out exact.formula
1 0.78971163 0.78971163
2 0.02907449 0.02907449
3 0.15054909 0.15054909
4 0.13358301 0.13358301
5 0.10169435 0.10169435
6 -0.25449423 -0.25449423
7 -0.26993036 -0.26993036
8 -0.84561459 -0.84561459
9 -0.21696104 -0.21696104
10 -1.46143658 -1.46143658
============Tim Hesterberg
The next version of the S-PLUS bootstrap library includes
a cross-validation function, which may be helpful in your
application; it leaves out an observation at a time and returns
the vector of fitted values, and other quantities.
Are you interested in testing this? You would need to upgrade
to S+6.0 or later. If interested, please respond to
bootstrap-beta@insightful.com
============Original question:
I am trying to implement a "leave-one-out" procedure for the predictions of a
model and I managed to do that... but in a very uneffectively way (details
below). To be specific: I would like to (1) parameterize a model with every
observation except one (the model was specified previously), (2) make the
corresponding prediction for this one observation left out, (3) repeat until I
have a vector of predictions for every observation. There is a further
complexity: I have a large amount of previously specified models (about 70), so
I would want to repeat the steps above for each one of these models. I have
windows 2000, 256 Mb RAM and my clumsy code is:
> version
S-PLUS 2000 Professional Edition Release 2 for Microsoft Windows : 1999
First I set a vector with the number of previously specified models:
lista <- c("aegcau.escala150.b2", "siteur.escala150.b2", "trotro.escala150.b2")
Then, a for inside a while loop:
comienzo <- proc.time()
predicciones2 <- 1:nrow(tmpESCALA150) # tmpESCALA150 has the response
(n=1144) and predictors
predicciones <- numeric(nrow(tmpESCALA150)) # set up an empty vector
for predictions
n <- 1
while (n < length(lista)+1) { # to visit every element of
"lista"
for (i in 1:nrow(tmpESCALA150)) { # to visit every element of original
database (tmpESCALA150)
predicciones[i] <-
predict.gam(
gam(eval(parse(text=(paste(as.character(lista[n]), "$formula")))),
data=tmpESCALA150[-i,], family=binomial), # model$formula is previously
specified
tmpESCALA150[i,], type="response") # predict response for the i
observation left out
}
predicciones2 <- cbind(predicciones2, predicciones)
# clumsy, but I finally need a data set with the predictions
("predicciones2")
n <- n+1
}
predicciones2
tiempo <- proc.time()-comienzo
tiempo/60
The loop takes 254 min to complete with only three specified models (in
"lista"). A loop for a single model takes 45 min. I am afraid that looping
through 70 models would be unaffordably slow.
Any suggestions to speed up the loop would be appreciated,
Javier Seoane
Department of Applied Biology
Estación Biológica de Doñana, CSIC
Avda. María Luisa s/n
41013, Sevilla
SPAIN
|