Dear all,
An established protocol for covariate analyses in population PK-PD modelling
involves step.gam() as a pre-selection tool to limit the number of covariates
that need to be included and tested in the actual PK-PD model. This tends to be
done iteratively: step.gam() is used to test a large number of covariates
against the posthoc estimates of the PK-PD model, those that are returned in
the final step.gam model are included one by one in the PK-PD model, the
covariate that results in the most significant improvement of the PK-PD model
fit is retained, new posthocs are estimated and the procedure is repeated until
no further covariates are returned by step.gam().
From a previous posting by Professor Ripley I understand that step.gam() tests
tAIC <- deviance(tfit) + 2 * (n - tfit$df.resid) * scale
i.e. deviance + scale x estimated no of parameters (also referred to as
'Hastie's AIC').
It follows that the penalty for adding an additional covariate to the model (or
leaving one in) is adjusted by a scaling coefficient. According to the
step.gam() help file, the default choice for this scale argument is the "scaled
Chi-squared statistic for the initial model", however it is added that "if
forward selection is to be performed, this is not necessarily a sound choice".
So what, then, is a sound choice when both forward and backward selection are
to be performed?
In my experience, the default scaling parameter tends to decrease considerably
with each iterative step in the protocol above, relaxing the penalty for
additonal parameters and resulting in new covariates allowed into the final
model. Hence it seems that by default step.gam() tries very hard to include at
least some covariates from the available pool. Visual inspection of these
covariates against the random effects of the PK-PD model usually shows very
little structure, and in the end they are usually discarded.
Is there a way to set the scale argument to step.gam() more judiciously?
According to protocol, we have to test all covariates, so if we skip the
step.gam() part we would have to iteratively include all covariates into the
PK-PD model (involving NONMEM runs of many hours each for often 50+
covariates). Apart from the waste of time, I also have statistical reservations
to such a 'model-mining' procedure, because with so many covariates it seems to
me that one is bound to find 'significant' ones by pure chance alone (given an
inclusion criterion of p<.95).
All advice is greatly appreciated.
Thanks,
Willem de Winter
w.dewinter@lapp.nl
|