In lieu of feeling qualified to edit this discussion, I submit it all,
unedited, in one place.
Thank you all for your comments.
Martin H. H. Stevens writes:
>
> When I use Type I sums of squares in a linear model, the sum of
squares
> of each factor (from summary.aov()) plus the residual add up to the
> total sum of squares ([n-1]*s^2 of the response variable. When I use
> Type III SS, I get something less than the total sums of squares when
I
> add up the partial Sums of Squares.... Why?
>
The type III SS for an individual effect is the increment in the model
SS
when the term in question is fitted last in the model. For example,
suppose we have a multiple linear regression with three covariates, so
that
yhat = b_0 + b_1 X_1 + b_2 X_2 + b_3 X_3.
Then, the Type III SS for X_1 is the increase in model SS in going from
a
model with covariates X_2 and X_3 to one with covariates X_2, X_3, X_1,
so
SS(X_1 | X_2, X_3) = SS(X_2, X_3, X_1) - SS(X_2, X_3).
The type III SS's for the other terms are computed by applying the same
idea above. So, for example,
SS(X_2 | X_1, X_3) = SS(X_1, X_3, X_2) - SS(X_1, X_3) and
SS(X_3 | X_1, X_2) = SS(X_1, X_2, X_3) - SS(X_1, X_2).
If you add these up, there is no guarantee that they will add up to the
`total' model SS, because they are conditional increments in SS due to
different conditionings. If the X's are orthogonal, then the type III SS
(or type I) should add up, but when the X's are intercorrelated, they
won't.
This idea extends to ANOVA models, but if the data are unbalanced,
the type III SS won't necessarily add up to the total model SS, either.
This topic has (rightfully, IMO) generated some controversy on this list
in the past. A search into the list archives might be instructive:
http://www.biostat.wustl.edu/s-news/s-news-archive/search.html
Search for type III and/or type I sums of squares.
As Type x sums of squares were popularized in SAS, a check of the
chapter
on `types of estimable functions' in the SAS-STAT manuals explains the
differences among the four different types of SS available in SAS.
Hope this helps,
Dennis
--
Dennis Murphy, Ph.D. |It's what you learn after you|Phone: (204)
474-6275
Statistics, U. Manitoba | know it all that counts.
|dmurphy@cc.umanitoba.ca
Winnipeg, MB Canada | John Wooden | On,
Wisconsin!
-------- When you come to a fork in the road, take it. Yogi Berra
--------
Dear Martin,
For an unbalanced design, the Type III sums of squares do not in general
add to the regression SS for the model; they do, however, test the
hypotheses that each set of effects is zero, which is what one usually
wants.
To get the Type III SS for a particular term in the model (in the
absence
of empty cells), delete that term and subtract the regression SS from
that
for the full model. (Type II SS, in contrast, respect the hierarchy of
effects in the model.)
Type I SS do add to the overall regression SS but in general do not test
reasonable hypotheses.
I hope that this helps,
John
________________________________
John Fox
Department of Sociology
McMaster University
email: jfox@McMaster.ca
web: davinci.socsci.mcmaster.ca
________________________________
Subject:
Re: [S] Statistics question
Date:
Fri, 26 May 2000 23:07:27 +0100 (BST)
From:
Prof Brian D Ripley <ripley@stats.ox.ac.uk>
To:
"Martin H. H. Stevens" <hstevens@rci.rutgers.edu>
CC:
Mailing List S+ <s-news@wubios.wustl.edu>
On Fri, 26 May 2000, Martin H. H. Stevens wrote:
>
> I am afraid this may turn out to be an elementary statistics lesson,
but
> here goes, and thanks (and/or apologies) in advance...
>
> When I use Type I sums of squares in a linear model, the sum of
squares
> of each factor (from summary.aov()) plus the residual add up to the
> total sum of squares ([n-1]*s^2 of the response variable. When I use
> Type III SS, I get something less than the total sums of squares when
I
> add up the partial Sums of Squares.... Why?
I think the question is why you expect it to add up?
You can do analysis of variance when terms are added sequentially, and
so each term is given the reduction in SSq *after adding all the
previous
terms&. Theorem: the SSqs add up. (Note: it is theorem, but it has
conditions.)
Now the egregious so-called type III is not an analysis of variance.
Here
it is equivalent to drop1: that is it computes the increase in SSq on
dropping each term independently. Since F1 was the last to be added, it
has the same value on dropping. However, your first table has the SSq
for
y1 alone, and the second for y1 after adding f1.
I think ssType=3 is best left to those who have been forceably
dragged away from a 3-letter statistics package which is stuck in the
days
of capital letters on punched cards. The rest of us can use drop1,
which at least does sensible things in the presence of interactions
(unless abused).
> Example below.
>
> > y2 <- y1 + 50*runif(50)
> > y1 <- 1:50
> > F1 <- as.factor(c(rep("A",25),rep("B",25)))
> > results <- lm(y2 ~ y1 + F1)
>
> > SST <- (50-1) * var(y2)
> > SST
> [1] 25351.38
>
> > summary.aov(results, ssType=1)
> Df Sum of Sq Mean Sq F Value Pr(F)
> y1 1 14705.09 14705.09 67.42195 0.0000000
> F1 1 395.35 395.35 1.81264 0.1846493
> Residuals 47 10250.95 218.11
>
> > summary.aov(results, ssType=3)
> Type III Sum of Squares
> Df Sum of Sq Mean Sq F Value Pr(F)
> y1 1 6055.76 6055.761 27.76531 0.0000034
> F1 1 395.35 395.347 1.81264 0.1846493
> Residuals 47 10250.95 218.105
> >
>
>
>
> --
> Dr. M. Henry H. Stevens
> email: hstevens@rci.rutgers.edu
--
Brian D. Ripley, ripley@stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272860 (secr)
Oxford OX1 3TG, UK Fax: +44 1865 272595
Subject:
Re: [S] Statistics question
Date:
Sat, 27 May 2000 08:17:29 +1000
From:
Bill Venables <Bill.Venables@cmis.csiro.au>
To:
"Martin H. H. Stevens" <hstevens@rci.rutgers.edu>, Mailing List
S+
<s-news@wubios.wustl.edu>
At 11:57 AM 5/26/00 -0400, Martin H. H. Stevens wrote:
>
>I am afraid this may turn out to be an elementary statistics lesson,
but
>here goes, and thanks (and/or apologies) in advance...
It may, but you are probably not alone in needing it. Sigh.
>
>When I use Type I sums of squares in a linear model, the sum of squares
>of each factor (from summary.aov()) plus the residual add up to the
>total sum of squares ([n-1]*s^2 of the response variable. When I use
>Type III SS, I get something less than the total sums of squares when
I
>add up the partial Sums of Squares.... Why?
It is because the sums of squares in each case, though labelled
similarly
are in many cases testing different hypotheses.
The Type I table tests a sequence of nested hypotheses, that is, in each
case that the term presented contributes nothing more to the model in
addition to all *previous* terms. This is why it is called a sequential
analysis of variance table (sometimes).
The Type III table tests a sequence of non-nested hypotheses. Each sum
of
squares tests the hypothesis that the term contributes nothing more to
the
model in addition to all *other* terms, before and after. This makes
it
pretty well the same as what you do when you test all terms in a
regression
model, even though you would usually use t-tests rather than present
them
in an anova table. In regression analysis most people would not
consider
linear terms when a second degree term in the model were present (except
in
rather special circumstances).
The objection to Type III sums of squares is that they encourage naive
users to do silly things such as test main effects in the presence of
interactions, without really asking whether the test makes sense or not,
that is, whether it really addresses a question of any interest.
I have an objection to *all* types of sums of squares, actually. I
think
looking at linear models in terms of "which types of sums of squares
should
I be using?" is just plain muddled. If you have a testing problem, the
real question is "which null hypotheses should I be testing, within
which
outer hypothesis?" Once you sort that out the way to do it is
completely
clear. The trouble is people do not want to sort that out, because it
requires thinking clearly about the problem and it is much easier to
rely
on someone else doing that and providing you with a variety of Types of
sums of squares to choose from.
More often than not, too, the problem that people should be looking at
is
an estimation problem and not a testing problem at all, but I digress..
Well, you did ask!
Bill Venables.
Subject:
Re: [S] Statistics question
Date:
Sat, 27 May 2000 09:17:34 -0400 (EDT)
From:
Dave Krantz <dhk@paradox.psych.columbia.edu>
To:
Bill.Venables@cmis.csiro.au
CC:
dhk@paradox.psych.columbia.edu, s-news@wubios.wustl.edu
I agree with all of what you say about sums of squares, hypothesis
tests,
and linear modelling. However, I find type I sums of squares at least
as objectionable as type III, because (at least in my sorts of problems)
they also create many tests involving model comparisons that one is
not really interested in. Type III SSq (selected judiciously) have
some real usefulness as a component of model selection. I don't think
I've personally made the sorts of errors in using them that you
mention, but I find myself regularly criticizing others for those
errors, so you are basically correct.
There are three underlying problems here: distinguishing model
selection
from estimation (when, in many cases, they are closely related);
establishing a broad, common-sense strategy for model selection,
in which sums of squares play only a limited role; and making clear
that EVERY sum of squares is a comparison of 2 models, and should
be attended to ONLY if the comparison of those two models is
interesting.
The former two problems may be difficult to address purely from a
packaged software standpoint, though someone doing AI and statistics
might give it a try. The latter problem can be dealt with in software,
by insisting on explicit specification of a pair of models to generate
any sum of squares, and by labelling that SSq by a pair of model
names or specifications. I believe this should be the standard,
and the current hodgepodge of shortcuts should be eliminated from
good software.
Dave Krantz
-----------------------------------------------------------------------
This message was distributed by s-news@wubios.wustl.edu. To unsubscribe
send e-mail to s-news-request@wubios.wustl.edu with the BODY of the
message: unsubscribe s-news
John Sorkin wrote:
Prof. Venables and Prof Ripley have correctly pointed out the problem
that
can occur when using type III sum squares; they can lead to improper
inferences. I believe however, that the degree of disapproval with type
III
sum squares that has been expressed by some researchers may be a bit
exaggerated. Any statistical test can be misapplied. Consider a simple
two
sample Student's t test. If the variance of the two samples are not the
same, any inferences drawn from the test is questionable. A regression
analysis performed when the variance is a function of one of the
independent
variables (i.e. non constant variance) can also lead to questionable
inferences. Any one of us could suggest any one of a number of examples
in
which easily performed statistics performed using easy-to-use, popular,
statistical programs has lead to potential problems in inference. The
fault
then is not in the program that performs the computation of the
statistic,
but rather in the user of the program who does not use the program
correctly. Our job as statisians (or in my case as an epidemiologist) is
to
make sure that analyses are performed correctly.
I believe however that there is a more fundamental (or perhaps less
appreciated) problem with sums of squares, especially with unbalanced
designs. With unbalanced design the usual analytic approach weights the
analysis so that the unbalanced groups (i.e. treatments which are
treated as
factors) contribute equally to the analysis. Thus different experimental
units (e.g. subjects of patients) contribute differentially to the
analysis
depending on which group they are in. In effect, the analysis becomes an
analysis of the effect of the group in which response of each of the
experimental units (subjects) is determined by the size of the group
they
are in. It does not seem correct to me that a given subject's
contribution
the the sum of squares and thus the inferences derived from the analysis
should vary by group assignment. Unfortunately I do not know of any easy
way
around this problem. I invite suggestions, and in particular hope that
Prof.
Venables and Prof Ripley will add their clear thinking to this problem.
John Sorkin
--
Dr. M. Henry H. Stevens
Postdoctoral Associate
Department of Ecology, Evolution, & Natural Resources
14 College Farm Road
Cook College, Rutgers University
New Brunswick, NJ 08901-8551
email: hstevens@rci.rutgers.edu
phone: 732-932-9631
fax: 732-932-8746
--
Dr. M. Henry H. Stevens
Postdoctoral Associate
Department of Ecology, Evolution, & Natural Resources
14 College Farm Road
Cook College, Rutgers University
New Brunswick, NJ 08901-8551
email: hstevens@rci.rutgers.edu
phone: 732-932-9631
fax: 732-932-8746
-----------------------------------------------------------------------
This message was distributed by s-news@wubios.wustl.edu. To unsubscribe
send e-mail to s-news-request@wubios.wustl.edu with the BODY of the
message: unsubscribe s-news
|