In machine learning, datasets are often split into three partitions.
The first is used for fitting models, and is often referred to as the
"training" set.
The second is used for testing performance of fitted models, and it's
common practice to iterate a fit-test process multiple times. This set
is sometimes called the "validation" set.
The third is used for testing the final selected version of the fitted
model. It's sometimes called the "test" set. It should be used only
once (at least should have no feedback to the model construction or
fitting process).
Lambert.Winnie wrote:
*This is NOT an S-LUS-specific question,* just letting you know so you
don’t have to read any further if not interested in anything non-S-PLUS.
There is a bit of a controversy in my office concerning specific
statistical terminology. I developed a set of logistic regression
equations that calculate the probability of lightning occurrence for the
day using a 15-year data set of several observation types. I stratified
the data into two sets: one was used to create the equations, and the
other was used to test the equations’ performance. In my field, these
are commonly called the ‘dependent’ and ‘independent’ data sets,
respectively.
One of us insists that the common terminology be used, the other says
the data sets should be called ‘development’ and ‘testing’ since that is
what they are used for, and since the terms ‘dependent’ and
‘independent’ refer to other issues in statistics.
Any statistics expert willing to jump into the fray is welcome. There is
no money riding on this, only pride.
|