s-news
[Top] [All Lists]

aggregate.data.frame with too many dimensions

To: "S-News (s-news@lists.biostat.wustl.edu)" <s-news@lists.biostat.wustl.edu>
Subject: aggregate.data.frame with too many dimensions
From: "Fowler, Mark" <FowlerM@mar.dfo-mpo.gc.ca>
Date: Thu, 24 Feb 2005 11:18:34 -0400
This is a revisit of a question posed by Jeremy Stanley in March 2004. The
issue is failure of the aggregate function as the number of records, by and
summary columns increases. The earlier question may have been confused with
the known bug of passing summary function arguments (...) in the
aggregate.data.frame for S+ 6.2, addressed by David Smith. Or maybe the
question was answered off the list? The problem is quite odd, and a bit
tricky to explain, so maybe the earlier question needs a second poser to be
taken more seriously. I repeat the question with my own example here, and
further note that it applies to both 6.2 and the 6.1-ish Statserver version
of S+.

The problem presented itself when the following command in S+ 6.1 (-ish,
Statserver version), 

x5 <- agg.data.frame(poldat[, c("effhrs","effcnt","TotalWt","msctons")],
        list(region = poldat$region, cfv = poldat$cfv, yland = poldat$yland,
mland = poldat$mland, dland = poldat$dland,
         nafodiv = poldat$nafodiv, nafoarea = poldat$nafoarea,  
        tonclass = poldat$tonclass, lenclass = poldat$lenclass, ycaught =
poldat$ycaught, mcaught = poldat$mcaught, dcaught = poldat$dcaught,
        gear = poldat$gear, depzone = poldat$depzone, mspec = poldat$mspec),
sum)

with 4 summary and 15 by columns with about 7000 records, generated the
error: 

'Problem in rep.int(1:n, times): Cannot create - data would have length
greater than 536870911 (.Machine$integer.max/sizeof(integer)) See text
output or StatServer6/tmp/#/all.err for traceback.

Repeating this in 6.2 generated:

Problem in (lens <- sapply(unlist(val, recursive = F), length)) > 1: needed
atomic data, got an object of class "list"

Taking my cue from the example presented by Jeremy Stanley, I successively
reduced the input dimensions in various ways (summary variables, indices,
size of data file). Very unpredictable results were associated with
different sets of dimensions. With a small enough dataset the full command
would work, but as the dataset got larger, reduction of the number of
summary variables and/or indices was necessary to make the function work.
Also, there is an 'overlap' region of dimensions within which the function
will fail without producing an error message, instead giving a single record
as the result. This record will be the index and summary variable values for
one of the data records, nothing summarized. The record chosen seems to be
the last record it reached during processing before the function apparently
died without tripping an error flag. It remains consistent for the same
command and data, but moves with changes to the dimensions. This overlap
region in the dimensions aggregate.data.frame can handle may explain the
occurrence of strange results in the example by Jeremy Stanley.



>       Mark Fowler
>       Marine Fish Division
>       Bedford Inst of Oceanography
>       Dept Fisheries & Oceans
>       Dartmouth NS Canada
>       fowlerm@mar.dfo-mpo.gc.ca
> 
> 

<Prev in Thread] Current Thread [Next in Thread>
  • aggregate.data.frame with too many dimensions, Fowler, Mark <=