Published in the Internet Journal of Chemistry
Vol 4, Article 11, 2001

Prediction of physical properties for PCB congeners from molecular descriptors

Tomas Öberg

T. Öberg Konsult AB
Gamla Brov. 13, SE-371 60 Lyckeby, Sweden

Keywords: Polychlorinated biphenyls, QSPR, principal components regression, PLS modeling.

Abstract

A methodology is described to model quantitative structure-property relationships that can accept significant contamination with "bad data". This approach is used to model and predict the vapor pressures, the water solubilities, the octanol-water partitioning coefficients and the Henry's laws coefficients for all 209 congeners of polychlorinated biphenyls (PCB). The model predictions seem to provide a reliable summary and extension of the currently available database on these compounds.

Introduction

The physical properties for organic chemical compounds are important in determining their distribution and fate in the environment1. Examples of such properties are the vapor pressure, the water solubility, the octanol-water partition coefficient and the Henry's law coefficient. Experimental measurements of these properties have become easier with the introduction of new methods, e.g. determination of octanol-water partitioning using gas chromatography, but the number of compounds under consideration still makes it necessary to also use models for estimation.

Polychlorinated biphenyls (PCB) are a group of 209 different congeners that have attracted much attention as environmental pollutants. On May 22 2001, 127 governments adopted the Stockholm Convention on Persistent Organic Pollutants. PCBs were among the chemicals initially selected for elimination from production and use2. Production of PCB has been banned in the industrialized world since many years, but large quantities still remain in the environment3. It is therefore still of utmost importance to estimate and monitor their fate in the environment and the biological food chains. The physical properties of the PCBs can also be used as input in calculating relationships to biological activity4, or to other physical properties5. Measured physical properties have been reported for 20-60% of the PCB congeners6.

The molecular structure holds the key to predicting the physical properties, and it is easy to recognize the trends within a homologous group such as the PCBs7,8. The literature contains many estimation methods, and well known are the additive group and bond contribution methods9,10,11,12,13,14. These methods are fairly robust and can be applied to a wide variety of organic molecules. A robust and general method will however by definition lack some accuracy and precision when considering local phenomena, such as the properties within a specific compound group.

Topological, geometrical and electronic descriptors can give more in-depth descriptions of the molecules and serve as a basis for developing predictive models with an improved accuracy and precision15,16,17. The purpose of this investigation is to develop multivariate calibration models and predict the vapor pressure, the water solubility, the octanol-water partitioning, and the Henry's law constant for all 209 PCB congeners. We will also explore the possibilities of using these models to validate available experimental data.

Experimental

Experimentally determined values for the physical properties were obtained from the PhysProp Database (Syracuse Research Corporation, Syracuse, NY, USA)6. Vapor pressures were reported for 42 congeners, water solubilities for 122 congeners, octanol-water partitioning coefficients for 92 congeners, and Henry's law constants for 91 congeners. The data in the PhysProp Database are collected from a large number of investigators, so we can assume that there is variation both within and between the various laboratories and investigators.

With the purpose of validating the methodology we also have included a set of experimental data with an expected high accuracy and precision in the form of retention times in gas chromatography using different columns and separation conditions. Retention time data were obtained from the manufacturers data sheets18,19,20.

The chemical structure of each congener was sketched on a PC using the software HyperChem (HyperCube, Inc., Gainesville, Florida, USA). Each compound was modeled using the force-field routine MM+, an extension by HyperCube of the standard MM2 force field21. The molecular structures were then used as input for the generation of 853 descriptors with the software Dragon (Milano Chemometrics and QSAR Research Group, University of Milano-Bicocca, Milano, Italy), listed in the enclosed text file varfile.txt. Todeschini and Consonni have reviewed these molecular descriptors22.

The multivariate analysis and calibration was carried out with the software Unscrambler (CAMO ASA, Oslo, Norway), Matlab (MathWorks, Inc., Natick, MA, USA) and Progress (Rousseeuw & Leroy, 1987). Principal component analysis (PCA), principal component regression (PCR), PLS-regression (PLSR) and least median of squares regression (LMSR) were used as the modeling methods. Martens and Næs have reviewed PCA, PCR and PLSR23. Rousseeuw and Leroy have reviewed LMSR and other methods for robust regression and outlier detection24.

Results and discussion

Here we use the numbering system for PCB congeners currently assigned by Ballschmiter and the International Union of Pure and Applied Chemistry25.

447 descriptors with constant values for all congeners were excluded from the data analysis. The raw data hence consist of 406 descriptor variables (varlist.txt) and the seven dependent response variables. The raw data is listed in the enclosed tab separated text file rawdat.txt (the first row lists the column headings, each following row corresponds to a congener and each column to a variable). All descriptor variables were autoscaled to zero mean and unit variance. The three dependent retention-times variables were also autoscaled, while the four dependent physical property variables were log-transformed prior to data analysis and modeling.

The descriptor data were initially modeled using principal component analysis (PCA). A five-component model explained 65% of the variance in the calibration data and 59% of the variance in ten randomly selected cross-validation segments. The first two score vectors are shown in figure 1. All five score vectors are listed in the enclosed tab separated text file scores.txt.

Score plot for the two first principal components.

Figure 1
Scores for the two first principal components (IUPAC-numbers shown for each congener).

The position of the congeners on the score plot relates to the chemical structures. The congener groups line up from left to the right with increasing number of chlorine atoms. In a similar manner the vertical distribution reflect the substitution pattern, with non-ortho chlorinated biphenyls ("co-planar") at the bottom and those with tri- and tetra-ortho substitution at the top. The ortho-substitution pattern directly influences the energy barrier of rotation and it is also correlated to the biological activities of these compounds26.

The first five principal components were used as independent variables for LMSR. Such a semi-robust regression model was estimated for each of the dependent physical property variables and subsequently used to identify outliers in the dependent variables. An object was declared an outlier if the standardized residual was larger than 2.5. PCR and PLSR can also be extended to become robust both with regard to independent and dependent variables27,28, but this would not serve any purpose in the present investigation.

As a second step, a reweighted PCR could be run by assigning zero weight, or some value on a scale between zero and one, to the outlying objects. Instead we have proceeded with a reweighted PLSR were each outlying object was assigned zero weight, i.e. removed from the computation of the regression model. The PLSR1 procedure was used to obtain models with optimal accuracy and precision. A further step to get parsimonious models was to assign zero weight to descriptor variables with minor influence in the PLS1-regression. These variables were selected on the criteria that the weighted regression coefficients were approximately less than half of the maximum values when all variables were included.

Vapor pressure

Experimental measurements were available for 42 congeners. Outlying objects were identified using the robust PCR procedure described above. 34 objects (objlist1.txt) and 260 descriptor variables (varlist1.txt) were assigned non-zero weight in the successive PLSR1-regression. The calibration model was validated using a test set of 12 randomly selected objects (testset1.txt). The number of latent variables to keep in the PLS-model was estimated to one, yielding a model with a coefficient of determination R2 for the test set of 0.972. The standard error of prediction SEP, estimated from the test set, was 0.21 (log mm Hg). Figure 2 show predicted versus measured results for all 42 congeners, with the eight outlying objects marked as filled rectangles.

Predicted vs. measured vapor pressure.
Figure 2
Predicted vs. measured vapor pressure (log mm Hg), 42 PCB congeners.

The antilogarithms of the measured and the predicted vapor pressures at 25° C (mm Hg) for all congeners, and the accompanying residuals, are listed in the enclosed tab separated text file vp.txt.

Water solubility

Experimental measurements were available for 122 congeners. Outlying objects were identified using the robust PCR procedure described above. 119 objects (objlist2.txt) and 275 descriptor variables (varlist2.txt) were assigned non-zero weight in the successive PLSR1-regression. The calibration model was validated using a test set of 47 randomly selected objects (testset2.txt). The number of latent variables to keep in the PLS-model was estimated to one, yielding a model with a coefficient of determination R2 for the test set of 0.941. The standard error of prediction SEP, estimated from the test set, was 0.33 (log mg/l). Figure 3 show predicted versus measured results for all 122 congeners, with the three outlying objects marked as filled rectangles.

Predicted vs. measured water solubility.
Figure 3
Predicted vs. measured water solubility (log mg/l), 122 PCB congeners.

The antilogarithms of the measured and the predicted water solubilities at 25° C (mg/l) for all congeners, and the accompanying residuals, are listed in the enclosed tab separated text file water.txt.

Partitioning coefficient octanol-water

Experimental measurements were available for 92 congeners. Outlying objects were identified using the robust PCR procedure described above. 87 objects (objlist3.txt) and 227 descriptor variables (varlist3.txt) were assigned non-zero weight in the successive PLSR1-regression. The calibration model was validated using a test set of 34 randomly selected objects (testset3.txt). The number of latent variables to keep in the PLS-model was estimated to one, yielding a model with a coefficient of determination R2 for the test set of 0.983. The standard error of prediction SEP, estimated from the test set, was 0.15 (log P). Figure 4 show predicted versus measured results for all 92 congeners, with the five outlying objects marked as filled rectangles.

Predicted vs. measured partitioning coefficient octanol-water.
Figure 4
Predicted vs. measured partitioning coefficient octanol-water (log P), 92 PCB congeners.

The logarithms of the measured and the predicted partitioning coefficients octanol-water (log P) for all congeners, and the accompanying residuals, are listed in the enclosed tab separated text file logp.txt.

Henry's law constant

Experimental measurements were available for 91 congeners. Outlying objects were identified using the robust PCR procedure described above. 79 objects (objlist4.txt) and 145 descriptor variables (varlist4.txt) were assigned non-zero weight in the successive PLSR1-regression. The calibration model was validated using a test set of 31 randomly selected objects (testset4.txt). The number of latent variables to keep in the PLS-model was estimated to two, yielding a model with a coefficient of determination R2 for the test set of 0.960. The standard error of prediction SEP, estimated from the test set, was 0.086 (log atm-m3/mol). Figure 5 show predicted versus measured results for all 91 congeners, with the twelve outlying objects marked as filled rectangles.

Predicted vs. measured Henry's law constant.
Figure 5
Predicted vs. measured Henry's law constant (log atm-m3/mol), 91 PCB congeners.

The antilogarithms of the measured and the predicted Henry's law constants at 25° C (atm-m3/mol) for all congeners, and the accompanying residuals, are listed in the enclosed tab separated text file henry.txt.

Retention times in gas chromatography

As an additional validation of this approach to establish quantitative structure property relationships (QSPR) for PCB congeners we have also tried to model the retention times obtained from gas chromatographic separation on three different columns: Rtx-CLP, SPB-Octyl and HT8.

Experimental measurements were available for 207-209 congeners. The retention times on all three columns showed a high correlation in between. 209 objects and 201 descriptor variables were assigned non-zero weight in PLSR2-regression. The calibration model was validated using a test set of 82 randomly selected objects. The number of latent variables to keep in the PLS-model was estimated to two, yielding a model with coefficients of determination R2, for the test set, between 0.979 and 0.989. The standard error of prediction SEP, estimated from the test set, was 0.086-1.17 (min). Figure 6 show predicted versus measured results for 207 congeners separated on the HT8 column.

Predicted vs. measured retention times.
Figure 6
Predicted vs. measured retention times (min), 207 PCB congeners on a HT8 column.

The four physical properties were best described by constitutional and topological descriptors, molecular walk counts, WHIM and GETAWAY descriptors. The retention times correlated with descriptors from all groups

The results presented above shows that it is possible to obtain a good fit and low prediction errors using multivariate calibration models for physical properties, and retention times on gas chromatography columns, based solely on computationally derived descriptors. Experimental data with the smallest expected experimental errors were also the easiest to model, i.e. the retention times.

Deviation between measured values and model predictions can be due either to model error or experimental error. The "experimental error" result from both intra- and inter-laboratory variation, and especially the last factor is important since different laboratories often have used different methodology. We feel that there are rather strong indications that the experimental error is the limiting factor for these structure-property modeling efforts, since the model fit improves both with a robust approach and with more reliable data. Others have reported a similar experience with some of the group contribution methods for estimation of the partitioning coefficient octanol-water29.

The deviation between experimental measurements and model predictions are particularly pronounced for two objects with regard to the Henry's law constant. PCB #77 and #172 have predicted values of 1.0E-4 and 1.8E-5 atm-m3/mol. The reported experimentally determined values in the PhysProp database are 9.4E-6 respectively 1.3E-6 atm-m3/mol. We therefore made a check with the original papers, where the reported data for PCB #77 and #172 actually are a magnitude higher 9.4E-5 and 1.3E-5 atm-m3/mol30,31. The large deviations are obviously due to errors in the transfer between the published data and the PhysProp database.

Validation is a general problem when using data compiled from many different sources. The usual approach to this problem is to carefully re-evaluate all original data and investigations. However, in many cases this can prove to be difficult and at least very time consuming. Furthermore, experimental errors will often remain after this process. Another way of dealing with the problem is to use high-breakdown methods for data evaluation, i.e. robust methods for model building that can accept significant contamination with bad data. Least median of squares regression is an example of such a robust method, with a breakdown point of 50%. This method will work if we can expect at least 50% "good data", and this does seem as a conservative assumption in many practical situations.

How reliable are then the model predictions compared to the individual experimentally determined results? Each model interpolation is actually based on a substantial number of experiments performed in various laboratories. We are therefore inclined to put more faith in the model interpolations if a reported experimental value show up as an outlier with a high residual. It will be very interesting to see if repeated measurements on some of the congeners with the largest reported deviations will provide a more definitive answer to this.

Conclusions

We have in this investigation reported estimations of some important basic physical parameters for all 209 congeners of polychlorinated biphenyls. These estimations were made from computationally derived descriptors using a robust approach to multivariate calibration. In a number of cases large deviations were detected from the reported experimentally determined values. Some of these could directly be assigned to typing errors. The most reliable measurements available, retention times from gas chromatography, were also the easiest to predict with accuracy and precision. The model predictions therefore seem to provide a reliable summary and extension of the currently available database on these compounds.

Supplementary materials

Member of the Swedish EnviroNet.© Tomas Öberg Konsult AB