Brian Ripley wrote on Sept 6:
>And of course, row names in data frames have long been documented to be
>unique, so dup.names.ok in data.frame() must be a design error.
>
>I believe the root problem is that instead of creating a new class for the
>purpose, the semantics of the existing class "data.frame" have been
>changed, apparently for Insightful internal convenience.
>
>John Chambers has said that it is time for some S standardization effort,
>and I back that. Concepts like data frames should be frozen for all time.
Since starting at Insightful, I have found people here to be very
conservative about changing code in ways that may be backwardly
incompatible. However, there are cases where change is necessary, for
the language to evolve and maintain relevance. The most substantial
change has been the change from SV3 to SV4.
I'll respond to Brian's comment about dup.row.names (not dup.names.ok)
in data frames. S was originally developed by researchers at Bell
Labs, who tended to work with relatively small data sets, and
refer to specific rows by name. Many people now work with larger data
sets, where the specification of unique row names causes problems.
Checking for uniqueness in row names when creating or subscripting a
data frame requires O(n log(n)) operations (where n is the number of
rows). This becomes very slow for large data frames. Modifying row
names to force uniqueness is even slower.
Similarly, statistical techniques like the bootstrap have gained
popularity since S was invented. Here one expects duplicate rows
in a bootstrapped data set, and modifying row names to force uniqueness
slows bootstrapping down dramatically.
Hence the addition of an optional argument to data.frame(), allowing
a user to indicate that row names need not be unique. We made the
change backwardly compatible, with the default to still force unique
row names. If we had the luxury of designing from scratch, for future
users without worrying about backward compatibility, we'd have
done it differently -- in fact to make row names completely optional.
The reason for enriching the existing class, rather than creating
a new class, is to keep the language simple, and simple for users.
People are familiar with a "data.frame" and how to create them.
Creating a new class would force users to learn a whole new set
of objects, and commands to create and manipulate them, and force
programmers everywhere to write and maintain methods for the new class.
And the headaches would multiply once you start creating new versions
of all the classes that inherit from data frames.
Terry Therneau touched on one point that is relevant here, the
flexibility of an old-style classes like "data.frame", where
the class can evolve over time:
> The new style classes have the significant restriction that absolutely
>no "extra" information may be attached to such an object, and have it remain
>of the original class. This may be good computer science, but the notion that
>every necessary attribute of a class will be visualized at the class's
>conception is naive in practicality. After 10+ years of working with the
>survival code, I still make additions to the basic objects.
|