s-news
[Top] [All Lists]

Re: [S] Statistics question

To: <s-news@wubios.wustl.edu>
Subject: Re: [S] Statistics question
From: "John Sorkin" <john@grecc.ab.umd.edu>
Date: Sat, 27 May 2000 14:27:37 -0400
Cc: <ripley@stats.ox.ac.uk>, <Bill.Venables@cmis.csiro.au>
Disposition-notification-to: "John Sorkin" <john@grecc.ab.umd.edu>
Importance: Normal
Sender: owner-s-news@wubios.wustl.edu
Prof. Venables and Prof Ripley have correctly pointed out the problem that
can occur when using type III sum squares; they can lead to improper
inferences. I believe however, that the degree of disapproval with type III
sum squares that has been expressed by some researchers may be a bit
exaggerated. Any statistical test can be misapplied. Consider a simple two
sample Student's t test. If the variance of the two samples are not the
same, any inferences drawn from the test is questionable. A regression
analysis performed when the variance is a function of one of the independent
variables (i.e. non constant variance) can also lead to questionable
inferences. Any one of us could suggest any one of a number of examples in
which easily performed statistics performed using easy-to-use, popular,
statistical programs has lead to potential problems in inference. The fault
then is not in the program that performs the computation of the statistic,
but rather in the user of the program who does not use the program
correctly. Our job as statisians (or in my case as an epidemiologist) is to
make sure that analyses are performed correctly.

I believe however that there is a more fundamental (or perhaps less
appreciated) problem with sums of squares, especially with unbalanced
designs. With unbalanced design the usual analytic approach weights the
analysis so that the unbalanced groups (i.e. treatments which are treated as
factors) contribute equally to the analysis. Thus different experimental
units (e.g. subjects of patients) contribute differentially to the analysis
depending on which group they are in. In effect, the analysis becomes an
analysis of the effect of the group in which response of each of the
experimental units (subjects) is determined by the size of the group they
are in. It does not seem correct to me that a given subject's contribution
the the sum of squares and thus the inferences derived from the analysis
should vary by group assignment. Unfortunately I do not know of any easy way
around this problem. I invite suggestions, and in particular hope that Prof.
Venables and Prof Ripley will add their clear thinking to this problem.


John Sorkin

-----Original Message-----
From: Bill Venables [mailto:Bill.Venables@cmis.csiro.au]
Sent: Friday, May 26, 2000 6:17 PM
To: Sorkin, John
Subject: Re: [S] Statistics question



At 11:57 AM 5/26/00 -0400, Martin H. H. Stevens wrote:
>
>I am afraid this may turn out to be an elementary statistics lesson, but
>here goes, and thanks (and/or apologies) in advance...

It may, but you are probably not alone in needing it.  Sigh.

>
>When I use Type I sums of squares in a linear model, the sum of squares
>of each factor (from summary.aov()) plus the residual add up to the
>total sum of squares ([n-1]*s^2 of the response variable.  When I use
>Type III SS,  I get something less than the total sums of squares when I
>add up the partial Sums of Squares....  Why?

It is because the sums of squares in each case, though labelled similarly
are in many cases testing different hypotheses.

The Type I table tests a sequence of nested hypotheses, that is, in each
case that the term presented contributes nothing more to the model in
addition to all *previous* terms.  This is why it is called a sequential
analysis of variance table (sometimes).

The Type III table tests a sequence of non-nested hypotheses.  Each sum of
squares tests the hypothesis that the term contributes nothing more to the
model in addition to all *other* terms,  before and after.  This makes it
pretty well the same as what you do when you test all terms in a regression
model, even though you would usually use t-tests rather than present them
in an anova table.  In regression analysis most people would not consider
linear terms when a second degree term in the model were present (except in
rather special circumstances).

The objection to Type III sums of squares is that they encourage naive
users to do silly things such as test main effects in the presence of
interactions, without really asking whether the test makes sense or not,
that is, whether it really addresses a question of any interest.

I have an objection to *all* types of sums of squares, actually.  I think
looking at linear models in terms of "which types of sums of squares should
I be using?" is just plain muddled.  If you have a testing problem, the
real question is "which null hypotheses should I be testing, within which
outer hypothesis?"  Once you sort that out the way to do it is completely
clear.  The trouble is people do not want to sort that out, because it
requires thinking clearly about the problem and it is much easier to rely
on someone else doing that and providing you with a variety of Types of
sums of squares to choose from.

More often than not, too, the problem that people should be looking at is
an estimation problem and not a testing problem at all, but I digress..

Well, you did ask!

Bill Venables.



On Fri, 26 May 2000, Martin H. H. Stevens wrote:

>
> I am afraid this may turn out to be an elementary statistics lesson, but
> here goes, and thanks (and/or apologies) in advance...
>
> When I use Type I sums of squares in a linear model, the sum of squares
> of each factor (from summary.aov()) plus the residual add up to the
> total sum of squares ([n-1]*s^2 of the response variable.  When I use
> Type III SS,  I get something less than the total sums of squares when I
> add up the partial Sums of Squares....  Why?

I think the question is why you expect it to add up?

You can do analysis of variance when terms are added sequentially, and
so each term is given the reduction in SSq *after adding all the previous
terms&.  Theorem: the SSqs add up.  (Note: it is theorem, but it has
conditions.)

Now the egregious so-called type III is not an analysis of variance. Here
it is equivalent to drop1: that is it computes the increase in SSq on
dropping each term independently.  Since F1 was the last to be added, it
has the same value on dropping.  However, your first table has the SSq for
y1 alone, and the second for y1 after adding f1.

I think ssType=3 is best left to those who have been forceably
dragged away from a 3-letter statistics package which is stuck in the days
of capital letters on punched cards.  The rest of us can use drop1,
which at least does sensible things in the presence of interactions
(unless abused).

> Example below.
>
> > y2 <- y1 + 50*runif(50)
> > y1 <- 1:50
> > F1 <- as.factor(c(rep("A",25),rep("B",25)))
> > results <- lm(y2 ~ y1 + F1)
>
> > SST <- (50-1) * var(y2)
> > SST
> [1] 25351.38
>
> > summary.aov(results, ssType=1)
>           Df Sum of Sq  Mean Sq  F Value     Pr(F)
>        y1  1  14705.09 14705.09 67.42195 0.0000000
>        F1  1    395.35   395.35  1.81264 0.1846493
> Residuals 47  10250.95   218.11
>
> > summary.aov(results, ssType=3)
> Type III Sum of Squares
>           Df Sum of Sq  Mean Sq  F Value     Pr(F)
>        y1  1   6055.76 6055.761 27.76531 0.0000034
>        F1  1    395.35  395.347  1.81264 0.1846493
> Residuals 47  10250.95  218.105
> >
>
>
>
> --
> Dr. M. Henry H. Stevens
> email: hstevens@rci.rutgers.edu

--
Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595


-----------------------------------------------------------------------
This message was distributed by s-news@wubios.wustl.edu.  To unsubscribe
send e-mail to s-news-request@wubios.wustl.edu with the BODY of the
message:  unsubscribe s-news

<Prev in Thread] Current Thread [Next in Thread>