s-news
[Top] [All Lists]

Re: really large files - import vs. use

To: Alan Zaslavsky <zaslavsk@hcp.med.harvard.edu>, s-news@lists.biostat.wustl.edu
Subject: Re: really large files - import vs. use
From: Tony Plate <tplate@blackmesacapital.com>
Date: Fri, 17 Dec 2004 09:19:11 -0700
In-reply-to: <200412162010.iBGKAkc11963@nightingale.hcp.med.harvard.edu>
References: <200412162010.iBGKAkc11963@nightingale.hcp.med.harvard.edu>
Actually, read.table() does know the total size of the data set before invoking scan() -- because it invokes count.fields() on the file first. (I.e., it reads the file twice.) The number of records is stored in the variable nrec. So, it's an easy change to make read.table() to preallocate space for scan() (using the 'what=' argument). Maybe there's some good reason for not preallocating, but I suspect that it's more likely just a case of over-mature code.

-- Tony Plate

At Thursday 01:10 PM 12/16/2004, Alan Zaslavsky wrote:
The read.table() function uses scan() to actually read the data (after
getting the field headings, if any, from the first line).  Note the following
comment from the help file for scan():

   As it reads more and more records, scan allocates more space to
   accommodate the growing vectors. If you supply a what argument that is
   identical in size to the result you expect, S-PLUS uses that space and
   does not have to perform memory allocations. This may produce
   significant memory savings when dealing with large files of data.

read.table() doesn't know how big the dataset is until it has read it.
If you use scan() directly you can use the what= argument to allocate the
correct amount of space from the beginning and thereby save some memory
and time.  It would not be hard to modify read.table() to include an
argument for the number of cases that will be read, and then create a
"what" argument for scan() that makes use of this information.  It could
also speed things to know in advance which columns are numeric and specify
that in the "what" argument to scan.

--------------------------------------------------------------------
This message was distributed by s-news@lists.biostat.wustl.edu.  To
unsubscribe send e-mail to s-news-request@lists.biostat.wustl.edu with
the BODY of the message:  unsubscribe s-news


<Prev in Thread] Current Thread [Next in Thread>