s-news
[Top] [All Lists]

Re: compare scan(), read.table(), and importData()

To: "Pikounis, V. Bill" <v_bill_pikounis@merck.com>, s-news@wubios.wustl.edu
Subject: Re: compare scan(), read.table(), and importData()
From: Don MacQueen <macq@llnl.gov>
Date: Fri, 23 Mar 2001 07:43:31 -0800
In-reply-to: <994DC7B72E4CD311A56D0008C707B35903CF3CCA@usrymx05.merck.com>
References: <994DC7B72E4CD311A56D0008C707B35903CF3CCA@usrymx05.merck.com>
Oops, I forgot to include

 version
Version 6.0 Release 1 for Sun SPARC, SunOS 5.6 : 2000

I've used importData() with SAS datasets on the order of 50000 records and not noticed a problem.

-Don

At 8:33 AM -0500 3/23/01, Pikounis, V. Bill wrote:
Don and all,

For the portion of your post dealing with importData(), I have had very
similar experiences, and coincidentally, was planning to write to the list
on issues related to this topic.  In my case, I have for example 2.5 million
observations and 5 columns in a csv file.  I only tried using S-PLUS 6 for
Linux and S-PLUS 6 for Windows Beta 2.  The machines these reside on our
similar with dual processors around 800 MHZ and 1.5 GB or RAM.

(What versions and platforms were you using?)

read.table() takes just under 8 minutes to read in a file like the above,
which is acceptable for me.  importData() was taking so long on the same
file that I killed the process after nearly an hour.  My recollection may be
faulty, but it seems that importData() would not be so bad on other formats,
such as a SAS data set. Also, I wonder if the Sv4 read enhancements come
into play here with read.table().

Also interestingly, exportData() was much faster in producing a CSV file
than write.table() after processing such a data set -- a couple of seconds
versus over 2 minutes, respectively, to output a file of approximately
12,000 observations and 30 columns.

I'd certainly like to hear any others experiences and opinions on these
types of issues.

Thanks,
Bill

______________________________
Bill Pikounis, Ph.D.
   E-mail: v_bill_pikounis@merck.com
   Phone: 732.594.3913
   Fax:  732.594.1565

Merck Research Laboratories
P.O. Box 2000, MailDrop RY70-38
126 E. Lincoln Ave
Rahway, NJ 07065-0900


 -----Original Message-----
 From: Don MacQueen [mailto:macq@llnl.gov]
 Sent: Thursday, March 22, 2001 8:43 PM
 To: s-news@wubios.wustl.edu
 Subject: [S] compare scan(), read.table(), and importData()


 I just spent the last hour or so figuring out why read.table() and
 importData() produced dataframes with different numbers of rows from
 the same ascii data file.

 It turns out that read.table() ignores blank lines in the data file;
 importData() gives a row of all NAs.

 This is not documented in the help() pages for either of these. In
 fact, help(read.table) says explicitly,
    "creates a data frame with the same number of rows
     as there are lines in the file"
 and therefore does not do what it says it does. The
 documentation is wrong.

 A performance comparison of
    (1) scan() using the n= argument followed by conversion to
 a data frame
    (2) read.table()
    (3) importData()
 on a file with 116577 lines and 16 fields had the following elapsed
 times in seconds (two repetitions each)
     (1) 20, 23
     (2) 29, 30
     (3) 69, 71

 So
    - scan + conversion is fastest, but not all that much
 compared with
 read.table
    - importData is much slower

 However, I don't know of convenient way to disable the coercion of
 characters to factors when converting a list, as created by scan() to
 a dataframe. read.table() has the as.is=T option, and importData()
 has the stringsAsFactors=F option. The latter being, perhaps,
 somewhat better, since it gives character variables the class "AsIs"
 which prevents subsequent conversion to factors, at least in some
 types of operations.

 My bottom line: read.table() is the best choice.

 -Don
 --
 --------------------------------------
 Don MacQueen
 Environmental Protection Department
 Lawrence Livermore National Laboratory
 Livermore, CA, USA
 --------------------------------------
 ---------------------------------------------------------------------
 This message was distributed by s-news@lists.biostat.wustl.edu.  To
 unsubscribe send e-mail to s-news-request@lists.biostat.wustl.edu with
 the BODY of the message:  unsubscribe s-news
 >

--
--------------------------------------
Don MacQueen
Environmental Protection Department
Lawrence Livermore National Laboratory
Livermore, CA, USA
--------------------------------------

<Prev in Thread] Current Thread [Next in Thread>