John Cornell wrote:
--------------------- begin quote ---------------------
I have a dataset with 1.6 million records and 45 binary variables. I
want
to apply monothetic (mona) and fuzzy (fanny) clustering methods to the
datamatrix. The original dataset was created in SAS, but I created a
comma delimited text version to import into S-Plus. When I try to
import
the dataset via the Import GUI interface, the program freezes and stops
responding. Is there a more efficient way to import a large dataset
into
S-Plus or R?
---------------------- end quote ----------------------
I've found that importing text files is relatively slow.
A fast way that I found to import a data file about as large
as the one you mention is to convert the file to binary
(8 byte doubles) and then use the "readBin" function to slurp
it up as one long vector, then use matrix to reshape it into
so many rows and columns. It is very fast (a few seconds) to
read the binary data.
To convert to binary I just use a C program that is mostly just
double x;
while (scanf("%lf",&x)==1) fwrite( &x, sizeof(x), 1, stdout );
The presence of commas complicates things -- how about running
the file through sed 's/,/ /g' to remove them.
Bear in mind that you may need to transpose the matrix once
you've loaded it -- the customary storage order is column major.
I use R on Linux, and I haven't tried this approach with S+.
Hope this helps,
Robert Dodier
__________________________________
Do you Yahoo!?
The New Yahoo! Shopping - with improved product search
http://shopping.yahoo.com
|