I just want to add my two cents on the question of whether or not to use
factors in a data frame. The notion of "factor" is built in to the
statistical-modeling features of S in a way that can be extremely useful and
convenient. However, S is also well-suited to more general data-management
applications, and here I have experienced the grief and felt your pain, when
I have experienced mental lapses regarding the dual numeric/character nature
of factor variables.
Another aspect to think about may be execution speed. We have recently done
some benchmarking on factor vs. character columns for certain common
operations for a large-scale drug-safety application (2 million record
database). For operations such as:
subset.frame <- my.frame[my.frame$var == theValue,]
or
subset.frame <- my.frame[is.element(my.frame$var, theSubset),]
we found that the character data type was advantageous for "var", etc.
However for:
big.frame <- merge(my.first.frame, my.second.frame, by="var")
or
my.table <- table(my.frame$var)
we found that the factor data type conferred a speed advantage. Your
mileage, of course, may vary, depending on your application, and on the
dimensions of the data frame involved. All I can say is that if speed is
important to you, it may be worth experimenting with factor vs. character in
particular fields to see which one makes the application run faster.
Alan Hochberg
VP, Research
ProSanos Corporation
225 Market St. Ste. 502,
Harrisburg, PA 17101
Tel 717-635-2124 * Fax 717-635-2575
|