Dear all,
Here a summary of the responses that I got to the following problem:
****************** original problem ***********************************
I am currently trying to run a couple of robust models in Splus
Version 5.0 Release 2 for Sun SPARC, SunOS 5.5 with the function
lmRobMM () with the final aim of comparing these models with the
function anova(,test="RF").
One of the smaller models that I am trying to run has about 450 cases
and uses one factor as the explanatory variable (with almost 40
levels; the larger models have one to six additional continuous
variables). The calculation of this model is running now for almost 20
hours of CPU time [it is up to over 100 hours by now].
Does anybody else have experience with the efficiency of this
function, any idea whether this is normal behaviour and I just need to
be patient or on how to speed things up?
***********************************************************************
Most answers recommended to move to Splus 5.1 and see whether the
general increase in efficiency also helps with my problem (Bert Gunter
<bert_gunter@merck.com>, Brian Ripley <ripley@stats.ox.ac.uk>, Sylvia
Isler <sisler@statsci.com>). I am currently trying to locate our
shipment of version 5.1 and may soon report on my experience regarding
lmRobMM.
Doug Martin <doug@statsci.com> provided some additional thoughts and
hints:
1. With 40 levels, you have in effect p = 40 (dummy) variables. The
default resampling algorithm is set at 4.6*2^p which is 5.058e+12 for
p = 40. This default rule provides a high breakdown point (BP = .5)
with probability .999. You can choose to use fewer samples. But then
you lose this high probability of high breakdown point. The details
may be found in Section 3 of Yohai, Stahel and Zamar (1991) - see
the Bibliography of the On-Line User Manual Supplement for 4.5 (or
equivalent for UNIX) for the source of this reference. Perhaps we
can provide the details via email on Monday, and check a bit to
see how many samples are required for lower probabilities such as
.9, etc.
2. Another possiblity is to try the genetic algorithm instead of the
resampling algorithm, experimenting with the algorithm parameters. I do
not believe there are any high-probability of high-breakdown point
properties for the genetic algorithm. But some people believe it works
well (a study we did several years ago with a small number of variables
showed that it was very similar to the resampling method). In any event,
though highly desirable, high-breakdown point is not a be-all and end-all.
3. On UNIX S-PLUS 5.1 is faster than 5.0.
4. More importantly: For models with (some) factor variables there is a
much better algorithm than the current resampling algorithm, due to
Maronna and Yohai (submitted, but not yet published). It turns out that
we already started implementing the algorithm, and hope to have a beta
version soon. We regard this as a very important improvement to lmRobMM,
and although it does not solve your problem today, I hope you might want
to be a beta tester as soon as the new version is available?
5. Finally for regression with many variables, e.g., 50 or more, there
is another "fast" algorithm described by Pena and Yohai in JASA, that
we will also implement soon.
Thank you all for your generous help and advice! I will let you know
when and how I succeeded to solve the problem.
-- ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Lorenz Gygax LGygax@amath.unizh.ch; room: 36-L-40
Department of Applied Mathematics
University of Zuerich-Irchel
Winterthurerstr. 190; CH-8057 Zurich
voice: 41-1-635-58-52 fax: 41-1-635-57-05
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-----------------------------------------------------------------------
This message was distributed by s-news@wubios.wustl.edu. To unsubscribe
send e-mail to s-news-request@wubios.wustl.edu with the BODY of the
message: unsubscribe s-news
|