Critical assessment of QSAR models of environmental toxicity against Tetrahymena
pyriformis: Focusing on applicability domain and overfitting by variable selection
Tetko, I., Sushko, I., Pandey, A., Zhu, H., Tropsha, A., Papa, E., Öberg, T., Todeschini, R., Fourches, D., Varnek, A.
Journal of Chemical Information and Modeling 48, 1733-1746 (2008).
Abstract
The estimation of the accuracy of predictions is a critical problem in QSAR
modeling. The “distance to model” can be defined as a metric that defines
the similarity between the training set molecules and the test set compound for
the given property in the context of a specific model. It could be expressed in
many different ways, e.g., using Tanimoto coefficient, leverage, correlation in
space of models, etc. In this paper we have used mixtures of Gaussian
distributions as well as statistical tests to evaluate six types of distances to
models with respect to their ability to discriminate compounds with small and
large prediction errors. The analysis was performed for twelve QSAR models of
aqueous toxicity against T. pyriformis obtained with different machine-learning
methods and various types of descriptors. The distances to model based on
standard deviation of predicted toxicity calculated from the ensemble of models
afforded the best results. This distance also successfully discriminated
molecules with low and large prediction errors for a mechanism-based model
developed using log P and the Maximum Acceptor Superdelocalizability descriptors.
Thus, the distance to model metric could also be used to augment mechanistic
QSAR models by estimating their prediction errors. Moreover, the accuracy of
prediction is mainly determined by the training set data distribution in the
chemistry and activity spaces but not by QSAR approaches used to develop the
models. We have shown that incorrect validation of a model may result in the
wrong estimation of its performance and suggested how this problem could be
circumvented. The toxicity of 3182 and 48774 molecules from the EPA High
Production Volume (HPV) Challenge Program and EINECS (European chemical
Substances Information System), respectively, was predicted, and the accuracy of
prediction was estimated. The developed models are available online at http://www.qspr.org
site.
DOI: 10.1021/ci800151m
|
|