s-news
[Top] [All Lists]

Re: automate writing formulas

To: <s-news@lists.biostat.wustl.edu>
Subject: Re: automate writing formulas
From: "Michael Camilleri" <MichaelCamilleri@branz.co.nz>
Date: Wed, 2 Aug 2006 10:07:56 +1200
In-reply-to: <44CE2AEA.9000100@vanderbilt.edu>
Thread-index: Aca1tvGEgPTiZxd1RSGzqAZOWltgWw==
Thread-topic: automate writing formulas
I have dealt with similar issues when I have several hundred potential
variables to choose from, of which only a few are of interest to a
particular model. The issue for me is having a process that easily lets
me screen the variables I am interested in, (usually less then 20), and
a semi-automated process saves a lot of hassle rewriting formulas and
making sure the variable names are spelt correctly. Most of my models
end up having 0-5 terms only.

I developed a neat bit of code that adds terms to a model using the
step() function (works with lm or glm, and maybe others), using a
wildcard search on the variable names of the model data frame. I've
included it for anyone to use or adapt if they wish. This is much
quicker for screening variables then having to write out exact formulas
for the step() or add1() functions.

It uses the oldGrep() function for wildcard searching of variable names,
and this may not work on all platforms. Easy enough to change to your
usual method of wildcard searching. Also uses the paste() function which
might work differently on the various platforms.

This function does not do away with the need to check and compare
various possible model configurations, and to apply some serious thought
to what the model should be. It can be particularly dangerous when you
have a lot of missing data, or categorical variables that have only a
few cases in one category. Both these types of data sets can easily end
up fitting a model to the extremes of only one or two variables, or
explaining more variation by removing observations. In cases like this
the order in which you screen variables can have a big impact on the
final model, which is often a sign of trouble.

Michael

********************************************* 
Michael Camilleri
Building Physicist
Mailto:MichaelCamilleri@branz.co.nz
BRANZ Ltd.               Ph:  (+64 4) 237 1170
Private Bag 50-908      Fax: (+64 4) 237 1171
Porirua City 5240,      DDI: (+64 4) 237 1174
New Zealand       Web: www.branz.co.nz/main.php?page=HEEP


Usage:
test.model_lm(Energy~Temperature,EnergyData,na=na.omit)
test.model_auto.model(test.model,'Solar*')
# Will screen all variables starting with 'Solar' using step()

 auto.model<-
function(model, addTerms, display = T)
{
# addTerms are search terms, using the oldGrep() wildcard format        
# terms that are character data are converted to factors
# These will be added and tested one at a time
        model.call <- model$call
        dataName <- as.character(model.call$data)
        data <- get(dataName)
        addTerms <- names(data)[oldGrep(addTerms, names(data))]
        if(length(addTerms) == 0) {
                warning("No terms to add\n")
                return(model)
        }
        print(addTerms)
        for(i in addTerms) {
                print(i)
                if(class(data[, i]) == "character")
                        model <- step(model, scope = list(upper =
as.formula(paste("~ .+ factor(", i, ")",
                                sep = ""))))
                else model <- step(model, scope = list(upper =
as.formula(paste("~ .+", i, sep = ""))))
        }
        if(display) {
                if(length(coefficients(model)) <= 1)
                        print("Model only has an intercept term")
                else {
                        print(summary(model, cor = F))
                        print(drop1(model))
                        par(mfcol = c(3, 1))
                        plot(model, which = c(1, 3, 6))
                }
        }
        return(model)
}

<Prev in Thread] Current Thread [Next in Thread>