s-news
[Top] [All Lists]

compare scan(), read.table(), and importData()

To: s-news@wubios.wustl.edu
Subject: compare scan(), read.table(), and importData()
From: Don MacQueen <macq@llnl.gov>
Date: Thu, 22 Mar 2001 17:42:41 -0800
I just spent the last hour or so figuring out why read.table() and importData() produced dataframes with different numbers of rows from the same ascii data file.

It turns out that read.table() ignores blank lines in the data file; importData() gives a row of all NAs.

This is not documented in the help() pages for either of these. In fact, help(read.table) says explicitly,
  "creates a data frame with the same number of rows
   as there are lines in the file"
and therefore does not do what it says it does. The documentation is wrong.

A performance comparison of
  (1) scan() using the n= argument followed by conversion to a data frame
  (2) read.table()
  (3) importData()
on a file with 116577 lines and 16 fields had the following elapsed times in seconds (two repetitions each)
   (1) 20, 23
   (2) 29, 30
   (3) 69, 71

So
- scan + conversion is fastest, but not all that much compared with read.table
  - importData is much slower

However, I don't know of convenient way to disable the coercion of characters to factors when converting a list, as created by scan() to a dataframe. read.table() has the as.is=T option, and importData() has the stringsAsFactors=F option. The latter being, perhaps, somewhat better, since it gives character variables the class "AsIs" which prevents subsequent conversion to factors, at least in some types of operations.

My bottom line: read.table() is the best choice.

-Don
--
--------------------------------------
Don MacQueen
Environmental Protection Department
Lawrence Livermore National Laboratory
Livermore, CA, USA
--------------------------------------

<Prev in Thread] Current Thread [Next in Thread>