I would like to classify new observations (testing test) on the basis
of previous observations (training set). Typical datasets include
between 3 and 5 classes with 80 obs. per class. I have p variables
(features) that could be used for that purpose (p between 2 and 16).
The problem is that the variables have larger values for the testing set
than for the training set. However, it is reasonable to think that one could
use the value of each variable relative to the sum of all variables to
classify each observation. Thus, I thought of normalizing the datasets
as follows (each row is a different observation):
training[,vars]<-sweep(training[,vars],1,rowSums(training[,vars]),FUN="/")
testing[,vars]<-sweep(testing[,vars],1,rowSums(testing[,vars]),FUN="/")
To classify the new observations, I have tried to use linear discriminant
analysis but I have encountered a problem when using cross-validation
(when checking performance of the classifier on the training set):
lda(epoch~.,CV=T,data=training)
I get the following warning and error messages:
Warning messages:
Warning in lda.default(x, grouping, CV = ..1): variables are collinear
Warning messages:
Warning in data[1:ll] <- old: Replacement length not a multiple of number
of elements to replace
Error in x - matrix(dm[i, ], n, p, byrow = T): Dimension attributes do not
match
Obviously, the first error message is related to the normalization which
makes the variables collinear (is it a problem for discriminant analysis?).
The origin of the other messages is not clear to me and I would appreciate
any help [S+2000ProR3].
Also, I would appreciate any suggestions about alternative approaches.
Thank you,
Gabriel Baud-Bovy
Additional remarks:
1) I don't get any error message with lda(CV=T) if I don't normalize.
2) I get only a warning message about collinearity from lda() if I don't do
crossvalidation:
fit<-lda(class~.,data=training)
predict(fit)
3) I get the following wargning message when using discrim() (only if I
normalize the datasets).
fit <- discrim(factor(epoch) ~ ., family = Classical("homo"), data = training)
Warning messages:
Warning in discrim(factor(epoch) ~ ., family = Classical("homo"), data =
training[, c("epoch",: Rank deficiency in group 1
Warning messages:
Warning in discrim(factor(epoch) ~ ., family = Classical("homo"), data =
training[, c("epoch",: Rank deficiency in group 2
Warning messages:
Warning in discrim(factor(epoch) ~ ., family = Classical("homo"), data =
training[, c("epoch",: Rank deficiency in group 3
Warning messages:
Warning in discrim(factor(epoch) ~ ., family = Classical("homo"), data =
training[, c("epoch",: Rank deficiency in pooled covariance
but I don't get any additional message when using either predict(fit) or
crossvalidate(fit).
4) In a previous post to this list, Prof. Ripley pointed out
to me a problem with some of my datasets. In particular, if I understood
well, one should not use the crossvalidation when only one observation has a
non-zero value in one of the classes because the class distribution is
degenerate under cross-validation and this observation has density 0.
I made sure that this is not the case this time. Note that this
problem was difficult to spot for me because error messages with
functions discrim() and crossvalidate() depended on the platform
used. Besides I was not getting any error messages when using
option recompute=T.
Note to Insightful: It would be extremely helpfull it these sorts of
fine points were discussed in the help files or in the manuals.
|