s-news
[Top] [All Lists]

[S] Producing a data frame of summary statistics

To: "s-news" <s-news@wubios.wustl.edu>
Subject: [S] Producing a data frame of summary statistics
From: "Frank E Harrell Jr" <fharrell@virginia.edu>
Date: Sun, 26 Apr 1998 21:05:21 -0400
Reply-to: "Frank E Harrell Jr" <fharrell@virginia.edu>
Sender: owner-s-news@wubios.wustl.edu
Thanks to John Wallace and Todd Taylor for providing using information.
But for the general case where the function used to summarize the data may
be multi-valued (e.g., quantile), aggregate will not work; in this case "dfify" 
will not
work for arrays of lists such as what tapply returns when there are multiple 
stratification
variables.  Also, results from
tapply must be re-formatted carefully because non-existent combinations
will result in NULL as an element of the returned array.  I chose to use
sapply (which is what tapply calls) to have more control, and then to
construct a data frame from the combinations of summary stats produced
by sapply.  The resulting function is below.  I'll soon add it to the Hmisc
library, along with online help.  Comments/suggestions welcome.
Note that summarize is a (faster) special case of Hmisc's summary.formula
function when method='cross' overall=F.

Examples:

summarize(temperature, month, 
function(x)c(Mean=mean(x,na.rm=T),Median=median(x,na.rm=T)))

w <- summarize(temperature, month, mean, na.rm=T)
xyplot(temperature ~ month, data=w)   # plot mean temperature by month

w <- summarize(temperature, llist(year,month), quantile, probs=c(.25,.5,.75), 
na.rm=T)
xYplot(Cbind(temperature[,2],temperature[,-2]) ~ month | year, data=w)

xYplot is in Hmisc, to handle "error bars".
llist is a little function that preserves the names of its arguments.  It also
works with the Hmisc label function, e.g.:

    for(v in llist(temperature,rainfall)) { hist(v); title(label(v)) }  
#label(v)=attr(v,'label')
    # If no labels pre-defined for the two variables, will use the variable 
names as the labels


summarize <- function(X, by, FUN, ..., 
       stat.name=as.character(deparse(substitute(X))), 
       type=c('matrix','variables')) {
  type <- match.arg(type)
  if(!is.list(by)) {
 nameby <- as.character(deparse(substitute(by)))
 by <- list(by)
 names(by) <- nameby
 }
  nby <- length(by)
  typical.computation <- FUN(X, ...)
  nc <- length(typical.computation)
  byc <- do.call('paste',c(by,sep='|'))
  r <- if(nc==1)sapply(split(X, byc), FUN, ..., simplify=T) else
    as.matrix(sapply(split(X, byc), FUN, ..., simplify=T))
  ans <- unpaste(if(nc==1)names(r) else dimnames(r)[[2]], sep='|')
  names(ans) <- names(by)
  if(nc>1 && (nc != nrow(r))) stop('program logic error')
  snames <- names(typical.computation)
  if(length(snames)) snames <- paste(stat.name, snames) else
    snames <- if(nc==1) stat.name else paste(stat.name,1:nc)
  wrn <- .Options$warn
  .Options$warn <- -1
  notna <- rep(T, length(ans[[1]]))
  for(i in 1:length(by)) {
 byi <- by[[i]]
    ansi <- ans[[i]]
    if(is.category(byi))
   ansi <- structure(as.numeric(ansi), 
      levels=levels(byi), class='factor')
 else if(is.numeric(byi)) ansi <- as.numeric(ansi)
    names(ansi) <- NULL
    ans[[i]] <- ansi
 notna <- notna & !is.na(ansi)
  }
  if(type=='matrix' || nc==1) {
 ans[[stat.name]] <- if(nc==1) r else 
   structure(t(r), dimnames=list(NULL, snames))
  } else {
 snames <- make.names(snames)
 for(i in 1:length(snames)) ans[[snames[i]]] <- r[i,]
  }
  notna <- notna & !is.na(if(nc==1) r else t(r) %*% rep(1,nc)) 
  ans <- structure(ans, class='data.frame', 
       row.names=1:length(ans[[1]]))[notna,]
  iorder <- do.call('order', structure(unclass(ans)[1:nby],names=NULL))
  ## order can bomb if data frame given (preserves names)
  ans[iorder,]
}


llist <- function(..., labels = T)
{
 dotlist <- list(...)
 lname <- names(dotlist)
 name <- vname <- as.character(sys.call()[-1])
 for(i in 1:length(dotlist)) {
  vname[i] <- lname[i]
  if(length(vname[i]) == 0 || vname[i] == "")
   vname[i] <- name[i]
  lab <- vname[i]
  if(labels) {
   lab <- attr(dotlist[[i]], "label")
   if(length(lab) == 0)
    lab <- vname[i]
  }
  label(dotlist[[i]]) <- lab
 }
 names(dotlist) <- vname[1:length(dotlist)]
 dotlist
}

---------------------------------------------------------------------------
Frank E Harrell Jr
Professor of Biostatistics and Statistics
Director, Division of Biostatistics and Epidemiology
Dept of Health Evaluation Sciences
University of Virginia School of Medicine
http://www.med.virginia.edu/medicine/clinical/hes/biostat.htm


-----------------------------------------------------------------------
This message was distributed by s-news@wubios.wustl.edu.  To unsubscribe
send e-mail to s-news-request@wubios.wustl.edu with the BODY of the
message:  unsubscribe s-news

<Prev in Thread] Current Thread [Next in Thread>
  • [S] Producing a data frame of summary statistics, Frank E Harrell Jr <=