s-news
[Top] [All Lists]

Efficiency of arrays vs. data frames for large data sets

To: <s-news@lists.biostat.wustl.edu>
Subject: Efficiency of arrays vs. data frames for large data sets
From: "Christopher Green" <cggreen@u.washington.edu>
Date: Sun, 17 Oct 2004 18:24:15 -0700
Thread-index: AcS0sS40W27WPZUfRl6l3avB4ZeKzQ==
Hello all,
 
I am working with large financial data sets (several GB's) for a research
project.  The data have three natural indices: time (e.g., month/year),
asset/company (e.g., IBM), and measurement (e.g., price or return).

Let's assume for simplicity that the data are all numeric.  Since the data
are naturally three-dimensional in the above sense, we might think about
storing the data in a three-dimensional array in S-PLUS.  We might also
think about "stacking" the data, say by asset, into a two-dimensional data
frame, with an added (factor) variable to indicate the asset.  (Example:

Date Price Asset
01/99  2.45 abc
02/99  2.56 abc
01/99 10.45 xyz
02/99 10.47 xyz

)

My question: for large amounts of data, which of these setups is generally
more efficient? The primarily concern here is the speed of subsetting the
data by measurement value, date, and/or asset.

My intuition is that the array should be faster in general, as it is atomic.
But I have built both in S-PLUS, they seem to be pretty comparable in terms
of memory use and speed (as measured with object.size and proc.time).  

I haven't been able to find a clear explanation of this aspect anywhere
(on-line docs, books, searching the web)...just the general principle that
the atomic structures should be faster. From what I've heard the data frame
is implemented as an array of pointers...presumably the atomics are
implemented as contiguous blocks of memory (say, something returned by a
malloc or something of that ilk)?

Are there any strong reasons to prefer one representation over the other?

(If it matters, I am using S-PLUS 6.2 Build 6713 on a Windows 2000 SP4
machine, Pentium IV 1.5 Ghz processor with 1GB RAM.)

Thanks,

Chris Green
Graduate Student, Statistics
University of Washington


<Prev in Thread] Current Thread [Next in Thread>