s-news
[Top] [All Lists]

[S] Enhancements to S-Plus

To: s-news <s-news@wubios.wustl.edu>
Subject: [S] Enhancements to S-Plus
From: Frank E Harrell Jr <fharrell@virginia.edu>
Date: Tue, 31 Mar 1998 11:00:42 -0500
Cc: roosen@statsci.com
Mmdf-warning: Parse error in original version of preceding line at mail.virginia.edu
Reply-to: Frank E Harrell Jr <fharrell@virginia.edu>
Sender: owner-s-news@wubios.wustl.edu
Here is my vote for what not to expend great efforts adding to S-Plus:
exact methods.  We have so many bigger things to worry about such
as non-normal errors, non-linear covariable effects, unaccounted-for
heterogeneity, that I've never been very concerned about getting
an "exact" P-value for an over-simplified model.  A classic saying of
 Tukey about exact solutions to the wrong problem comes to mind.  
 Even in the case of a 2x2 table, the presence of strong risk factors can cause 
a
heterogeneity of risks great enough to make unadjusted analyses
incorrect.  I would rather use the bootstrap or a full Bayesian approach
to get confidence intervals or probabilities of positive effects.  And I'm
still not a fan of conditioning when marginal cell counts were not 
pre-specified by the experimental design (and mine never are).  Lastly,
exact methods don't always extend well.  On the other hand it is very
easy to extend the bootstrap to account for intra-cluster correlation,
for example.

My second vote on what not to implement is type III sums of squares
and F-tests, which are more problematic than most statisticians assume.

Here are my votes on what would be worth doing, not in any particular
order:

1. Handle NAs in a smart way for all modeling functions.  For example,
    the survival modeling functions written by Terry Therneau keep track of
    which observations were deleted by NAs so that for example
    plot(age, resid(fit)) will work, by making sure that resid(fit) properly
    aligns with age.  [On our web page there is a document "Supplemental
    Notes" to my biostatistical modeling course that gives several hints
    for dealing with NAs while using the lm function.]  Modeling functions
    in my Design library use Therneau's technique.  This needs to be builtin
    to other S-Plus functions.

2. Sample size and power calculations for the normal-errors model, accounting
    for uncertainty in the estimate of sigma.  For example, the user could 
    provide the data (or sufficient statistics) used to estimate sigma 
    and the program could compute
    an entire power 'distribution' taking the uncertainty into account.  Sample 
size
    calculations to achieve certain precision (e.g., width of confidence 
intervals)
    would also be welcome.  A deluxe help system (see item 6 below) would allow
    users to quickly find example simulation programs for handling non-normal
    models. 

3. Continue to expand capabilities for random effects models, with various
    post-fit estimation, multi-level hierarchies, and other analytic 
capabilities.  
    Some of this can be done by having an elegant interface with the WINBUGS 
    Bayesian modeling package from  Cambridge.

4. Bootstrap and multiple imputation methods for accounting for imputing
    missing values when making inferences.  Some new na.action functions
    would also be welcome.  These functions could develop imputation rules
    (using tree, nonparametric regression, nearest neighbor, etc.) that
    could be saved and re-executed on demand.  Imputations can be tedious
    and it's a shame to have to re-develop imputation models for each
    analysis.  The imputation function could save enough information to
    be able to repeat the development of the imputation rule as quickly as
    possible, so that you could put this step inside a bootstrap look in order
    to be able to properly account for this component of variation.  Interested
    uses may want to look at the impute and transcan functions in my
    Hmisc library for some other ideas.

5. Anything that helps with non-randomly missing serial data.

6. A world-class online help facility that allows users to navigate in many
    ways, e.g., getting to a comprehensive set of examples of managing
    and recoding data.  For Windows users, where installing an add-on
    library is as easy as unzipping a .zip file, it would be nice to have a
    help button that updates the local PC from a master table of contents of 
    libraries available from statlib; another button would automatically 
    download and install a library.  See how Microsoft (yes they do a few
    things right) allows users to easily update Office products.
   
When deciding on future directions for software all of the debates about 
statistics
come alive.  I know that many will criticize my point of view.  I just wanted 
to give my
$.02 worth from the standpoint of an applied biostatistician.

---------------------------------------------------------------------------
Frank E Harrell Jr
Professor of Biostatistics and Statistics
Director, Division of Biostatistics and Epidemiology
Dept of Health Evaluation Sciences
University of Virginia School of Medicine
http://www.med.virginia.edu/medicine/clinical/hes/biostat.htm


-----------------------------------------------------------------------
This message was distributed by s-news@wubios.wustl.edu.  To unsubscribe
send e-mail to s-news-request@wubios.wustl.edu with the BODY of the
message:  unsubscribe s-news

<Prev in Thread] Current Thread [Next in Thread>