s-news
[Top] [All Lists]

Re: Improting large datasets into S-Plus and R

To: Stephen Ban <ban@zoology.ubc.ca>
Subject: Re: Improting large datasets into S-Plus and R
From: Sven.Knudsen@adeptscience.dk
Date: Fri, 26 Sep 2003 13:45:36 +0200
Cc: "John E. Cornell, Ph.D." <cornell@uthscsa.edu>, S-News <s-news@lists.biostat.wustl.edu>, s-news-owner@lists.biostat.wustl.edu
In-reply-to: <5.1.1.6.0.20030925130659.00af05f0@pop.zoology.ubc.ca>

The import gui is sometimes acting strange - might have something to do with windows. A more direct approach is to use the command line - but the block read and write method seem to work much faster. You can then append (this  slows down the process) - or read the whole data as one large block. In fact, the latter works faster than traditional import methods. Look for the functions

        OpenData
        ReadNextDataRows
        WriteNextDataRows
        closeData

Here is an example of a function that imports data and calculate min and max


minmaxBlock <- function(file,type,nr) {
        #Construct data handle
        dh <- openData(file,type=type,openType="read", rowsToRead=nr)
        tempblock <- readNextDataRows(dh)
        tempmin <- sapply(tempblock,min)
        tempmax <- sapply(tempblock,max)
        while(T) {
                tempblock <- readNextDataRows(dh)
                if( length(tempblock) == 0 )
                        break
                tempmin <- pmin(tempmin, sapply(tempblock,min))
                tempmax <- pmax(tempmax, sapply(tempblock,max))
        }
        list(min=tempmin, max=tempmax)
}
Note that the function does not import the data, but aggregates it (a bit inspiret of Insightful Miner, without braking any patent :-)

I have tested it using datasets with 1 mill records x 5 cols - in which case it uses less than 1 min. Using traditional import, the time used is 3.5 min (Windows 2000, S+6.1, 512 MB RAM, PIII 800 Mhz)

Hope this is to any inspiration.

Adept Scientific ApS

Sven Jesper Knudsen
Senior Consultant




Stephen Ban <ban@zoology.ubc.ca>
Sent by: s-news-owner@lists.biostat.wustl.edu

09/25/2003 10:08 PM

       
        To:        "John E. Cornell, Ph.D." <cornell@uthscsa.edu>, S-News <s-news@lists.biostat.wustl.edu>
        cc:        
        Subject:        Re: [S] Improting large datasets into S-Plus and R



I also have large datasets (1.5-2.0 million records), and I found a weird
solution. SPSS had no problems opening my datasets. If you then save them
as SPSS *.sav files, S-PLUS was able to import these no problem, even
though it would choke on the original raw data.

Hope you have access to SPSS :)

Stephen

At 01:10 PM 25/09/2003 -0500, John E. Cornell, Ph.D. wrote:

>I have a dataset with 1.6 million records and 45 binary variables.  I want
>to apply monothetic (mona) and fuzzy (fanny) clustering methods to the
>datamatrix.  The original dataset was created in SAS, but I created a comma
>delimited text version to import into S-Plus.  When I try to import the
>dataset via the Import GUI interface, the program freezes and stops
>responding.  Is there a more efficient way to import a large dataset into
>S-Plus or R?
>
>John Cornell
>
>****************************************************************************
>********
>
>Expectation, hope, intention toward possibility that has still not become:
>this is not only a basic feature of human consciousness, but, ..., a basic
>determination within objective reality as a whole.
>
>   --Ernst Bloch
>   --The Principle of Hope (Vol. 1)
>
>
>
>--------------------------------------------------------------------
>This message was distributed by s-news@lists.biostat.wustl.edu.  To
>unsubscribe send e-mail to s-news-request@lists.biostat.wustl.edu with
>the BODY of the message:  unsubscribe s-news

--------------------------------------------------------------------
This message was distributed by s-news@lists.biostat.wustl.edu.  To
unsubscribe send e-mail to s-news-request@lists.biostat.wustl.edu with
the BODY of the message:  unsubscribe s-news

<Prev in Thread] Current Thread [Next in Thread>