s-news
[Top] [All Lists]

[S] SUM: Linear Regression on Complex Model with Dummy Variables

To: s-news@wubios.wustl.edu
Subject: [S] SUM: Linear Regression on Complex Model with Dummy Variables
From: CUDE.Curtis@deq.state.or.us
Date: Fri, 25 Feb 2000 16:03:58 -0800
Sender: owner-s-news@wubios.wustl.edu
I originally wrote:  
                
                Hello, all.  I'm trying to fit a linear regression to a
complex model incorporating dummy variables.  The nature of my problem is
addressed neither in the manuals shipped with S-Plus 2000 Pro, nor in the
S-news archives.  Also, I am relatively new to S-Plus, having spent only the
last two months working through the manuals during spare time.

                The dataset:  >600 paired E. Coli and Fecal coliform counts
at surface water quality monitoring stations throughout Oregon.  All
stations represent ambient surface water quality.

                Null hypothesis: Fecal coliform can be reliably predicted
from E. coli using a single regression for the entire state.
                Alternative hypothesis: Fecal coliform cannot be reliably
predicted from E. coli using a single regression for the entire state, but
instead must be predicted based on smaller geographic areas, e.g., basins.
                This work is based on Francy, D. S., D. N. Myers, and K. D.
Metzker (1993), Eschericia coli and Fecal-Coliform Bacteria as Indicators of
Recreational Water Quality, Water-Resources Investigations Report 93-4083,
US Geological Survey, Columbus, Ohio.

                Using log10-transformed E. Coli and fecal coliform data,
I've developed lnRobMM relationships for the "grouped" statewide data which
I'll call the "simple" model.  I've also developed lnRobMM relationships for
data falling in to each of seven categories based on their geographic
location within basins or catchments.  My goal is to create a "complex"
model using dummy variables to account for the seven categories and compare
it to the "simple" model.

                I haven't been able to determine how to create such a model,
although I did write a script that will produce the fitted values for the
"complex" model.

                function(data, log.FC.fit)
                {
                        ZCoast <- ifelse(TSBasin == "Coast", 1, 0)
                        ZWilly <- ifelse(TSBasin == "Willamette", 1, 0)
                        ZUmpqua <- ifelse(TSBasin == "Umpqua", 1, 0)
                        ZRogue <- ifelse(TSBasin == "Rogue", 1, 0)
                        ZKlamath <- ifelse(TSBasin == "Klamath", 1, 0)
                        ZNEOr <- ifelse(TSBasin == "NEOregon", 1, 0)
                        ZSEOr <- ifelse(TSBasin == "SEOregon", 1, 0)
                        log.FC.fit <- ((0.2451 + 0.9279 * log.E.Coli) *
ZCoast) +
                                ((0.4067 + 0.8854 * log.E.Coli) * ZWilly) +
                                ((0.4145 + 0.8851 * log.E.Coli) * ZUmpqua) +
                                ((0.2799 + 0.9467 * log.E.Coli) * ZRogue) +
                                ((-0.0348 + 1.0708 * log.E.Coli) * ZKlamath)
+
                                ((0.2133 + 0.9572 * log.E.Coli) * ZNEOr) +
                                ((0.5286 + 0.8354 * log.E.Coli) * ZSEOr)
                        log.FC.fit
                }

                I thought that this would pass log.FC.fit out as a model
that I could compare to the simple model, but S-Plus wouldn't recognize
log.FC.fit.  I tried to define a lnRobMM using predefined dummy variables,
accompanied by the intercepts and slopes, but S-Plus treated the dummy
variables as variables to fit, rather than as predefined variables.  I feel
as if I'm missing some fundamental concept.  Please enlighten me and I will
summarize.

Four respondents (Thank you Anne York, Ian Jonsen, Prof. Brian D. Ripley,
and Albyn Jones) suggested that I simply compare (using anova) the
regression of the "simple" model (log.Fecal ~ log.E.Coli) to regression of
the "complex" model incorporating the interaction of the factor "TSBasin".
The first two respondents suggested that I use glm while the second two
respondents suggested lm.  The distribution of data and the data types
pointed toward use of lm.  All respondents suggested using "+" and/or "*" to
incorporate the interaction of "TSBasin" with log.E.Coli.  I ended up using
"/" to compare the results of substituting smaller spatial units, i.e.
subbasin and location.   

There was no significant difference between treating log.E.Coli equally
without regard to space and treating log.E.Coli differently for each
monitoring location.  However, there were >100 locations with <10 samples
per location, so this result was not too surprising.  There was no
significant difference regarding subbasins, either.  There was a significant
difference between the "simple" model and the "complex" model incorporating
the interaction of the factor "TSBasin".  However, examination of the
resulting coefficients in the "complex" model revealed that the magnitude of
difference in treatment for each "TSBasin" was small, on the order of 7E-3
to 3E-2 times log(E. Coli), well within the range of variability of the
analysis.

Thank you,
Curtis Cude
Ambient Monitoring Coordinator
Oregon Water Quality Index Coordinator
Water Quality Monitoring Section
Laboratory Division
Oregon Department of Environmental Quality
1712 SW Eleventh Avenue
Portland, OR  97201
Ph: (503) 229-5983
Fax: (503) 229-6924
E: cude.curtis@deq.state.or.us


-----------------------------------------------------------------------
This message was distributed by s-news@wubios.wustl.edu.  To unsubscribe
send e-mail to s-news-request@wubios.wustl.edu with the BODY of the
message:  unsubscribe s-news

<Prev in Thread] Current Thread [Next in Thread>
  • [S] SUM: Linear Regression on Complex Model with Dummy Variables, CUDE . Curtis <=