I just spent the last hour or so figuring out why read.table() and
importData() produced dataframes with different numbers of rows from
the same ascii data file.
It turns out that read.table() ignores blank lines in the data file;
importData() gives a row of all NAs.
This is not documented in the help() pages for either of these. In
fact, help(read.table) says explicitly,
"creates a data frame with the same number of rows
as there are lines in the file"
and therefore does not do what it says it does. The documentation is wrong.
A performance comparison of
(1) scan() using the n= argument followed by conversion to a data frame
(2) read.table()
(3) importData()
on a file with 116577 lines and 16 fields had the following elapsed
times in seconds (two repetitions each)
(1) 20, 23
(2) 29, 30
(3) 69, 71
So
- scan + conversion is fastest, but not all that much compared with
read.table
- importData is much slower
However, I don't know of convenient way to disable the coercion of
characters to factors when converting a list, as created by scan() to
a dataframe. read.table() has the as.is=T option, and importData()
has the stringsAsFactors=F option. The latter being, perhaps,
somewhat better, since it gives character variables the class "AsIs"
which prevents subsequent conversion to factors, at least in some
types of operations.
My bottom line: read.table() is the best choice.
-Don
--
--------------------------------------
Don MacQueen
Environmental Protection Department
Lawrence Livermore National Laboratory
Livermore, CA, USA
--------------------------------------
|