s-news
[Top] [All Lists]

Re: really large files - import vs. use

To: Eva Goldwater <goldwater@schoolph.umass.edu>, Chushu Gu <chushugu@hotmail.com>
Subject: Re: really large files - import vs. use
From: Tony Plate <tplate@blackmesacapital.com>
Date: Thu, 16 Dec 2004 11:13:53 -0700
Cc: s-news@lists.biostat.wustl.edu
In-reply-to: <Pine.GSO.4.55.0412160922490.15981@shell1.oit.umass.edu>
References: <E07964B84690CC47B01421B2A71D4F0F095F4ACA@rinnycs0000> <BAY102-DAV68C69D43FD9FE74F1F8A8CDAE0@phx.gbl> <Pine.GSO.4.55.0412160922490.15981@shell1.oit.umass.edu>
S-PLUS also has problems working with very large data files. A general rule of thumb is that it's difficult to use lm() when the dataset size is more than 1/6 of the usable virtual memory. Other simple operations can work with larger datasets. I have at times worked with datasets that occupy a 1/4 of usuable virtual memory, but I'm always expecting an "unable to allocate" error when I do that.

read.table() uses a surprisingly large amount of memory during import -- it seems to have a similar overhead to lm() (in a very rough and approximate sense). That's why there are problems with the import process.

-- Tony Plate

At Thursday 07:33 AM 12/16/2004, Eva Goldwater wrote:
Hello

As a long-time SAS user but S-Plus newbie, I am quite puzzled by this
solution, as it implies that S-Plus has no problem WORKING with very large
files, but only with the import process.  Is that correct, or am I missing
something???

Eva Goldwater                           email: goldwater@schoolph.umass.edu
Biostatistics Consulting                Phone: (413) 545-2949
418 Arnold House                        Fax:   (413) 545-1645
715 North Pleasant Street
University of Massachusetts
Amherst, MA 01003-9304

On Wed, 15 Dec 2004, Chushu Gu wrote:

> My reccomendation:
>
> Using SAS to seperate the files. But you need to know how large the file
> Splus can handle (How many records in source file).
> Import all the files as SAS data sets.When all files imported, rbind will
> finish the work.
>
> I used to process a large file in this way.
>
> Code in SAS:
> Assume only 10000 can be imported in Splus.
>
> data temp1;
> infile 'c:\largefile' firstobs=1 obs=10000;
> input a b $ c;
> run;
> data temp2;
> infile 'c:\largefile' firstobs=10001 obs=20000;
> input a b $ c;
> run;
> ..
>
> If you are an SAS expert, a simple macro would do the trick.
>
> Then you got all the data sets temp1, temp2, ...
> Import them directly by Splus, the file name for these dataset maybe
> temp1.sas7bdat,temp2.sas7bdat ...
>
>
> Hope this helps,
>
> Chushu Gu
>
>
> ----- Original Message -----
> From: "Bos, Roger" <BosR@ny.rothinc.com>
> To: <s-news@lists.biostat.wustl.edu>
> Sent: Tuesday, December 14, 2004 10:45 AM
> Subject: [S] help importing really large files
>
>
> > Has anyone found a trick to importing really large txt files into S+ 6.2
> > under XP? I sent the question to Insightful and their only recommendation
> > was to break it up into smaller files.  The file is 350 megs, which is
> large
> > I grant, but my machine has 4 gigs of memory.  If I did want to break it
> up,
> > what utility could I use to do so?  Excel is not going to read it either.
> > See below for my full question and support's answer.  Thanks in advance.
> >
> >
> > I get the "unable to obtain requested dynamic memory" error when I try to
> > read in a large file into S+ 6.2 using the following command:
> >
> > data <-
> >
> read.table("M:\\tina\\R2000V10SPLS29m.TXT",header=TRUE,sep=",",as.is=TRUE,na
> > strings="NA")
> > dim(data)
> >
> > The text file is 347,456 KB big.  My windows XP machine has 4 Gigs of
> > memory, which I believe is the max it can handle.  I also believe that my
> > virtual memory is maxed out.  I read the FAQ on this topic, but it mostly
> > said to optimize the code and I am just trying to read it in.  I
> understand
> > that the operating system steals half of this.  Do I need to change any
> > setting to make sure S+ is fully utilizing my memory capabilities?
> Anything
> > else I can try?
> >
> > --------------------------------------------------------------------------
> --
> > -----------
> > Solution:
> >
> > The file you are trying to import is a very large file. The calculation we
> > use to calculate the size of the data you are trying to import is:
> >
> > (rows)*(columns)*8*4.5
> >
> > You should import the file by breaking it into smaller files. Then import
> > these smaller files into S-Plus and finally, recombine them inside S-Plus.
> >
> >
> >
> > Please let me know if you have any questions.
> >
> > Sincerely,
> >
> > Jacob Geballe
> >
> >
> ===========================================================================
> >  Jacob Geballe                       email: support@insightful.com
> >  Technical Support Engineer            FAX: (206) 283-8691
> >  Insightful Corporation                Phone: (206) 283-8802 ext.235
> >  www.insightful.com                          1-800-569-0123 ext.235
> >
> ===========================================================================
> >
> > Roger J. Bos, CFA
> > Rothschild Asset Management
> > 1251 Avenue of the Americas
> > New York, NY  10020
> > 212-403-5471
> >
> >
> > ********************************************************************** *
> This message is for the named person's use only. It may
> > contain confidential, proprietary or legally privileged
> > information. No right to confidential or privileged treatment
> > of this message is waived or lost by any error in
> > transmission. If you have received this message in error,
> > please immediately notify the sender by e-mail,
> > delete the message and all copies from your system and destroy
> > any hard copies. You must not, directly or indirectly, use,
> > disclose, distribute, print or copy any part of this message
> > if you are not the intended recipient.
> > **********************************************************************
> > --------------------------------------------------------------------
> > This message was distributed by s-news@lists.biostat.wustl.edu.  To
> > unsubscribe send e-mail to s-news-request@lists.biostat.wustl.edu with
> > the BODY of the message:  unsubscribe s-news
> >
> --------------------------------------------------------------------
> This message was distributed by s-news@lists.biostat.wustl.edu.  To
> unsubscribe send e-mail to s-news-request@lists.biostat.wustl.edu with
> the BODY of the message:  unsubscribe s-news
>

--------------------------------------------------------------------
This message was distributed by s-news@lists.biostat.wustl.edu.  To
unsubscribe send e-mail to s-news-request@lists.biostat.wustl.edu with
the BODY of the message:  unsubscribe s-news


<Prev in Thread] Current Thread [Next in Thread>