Published in the Internet Journal
of Chemistry
Vol 4, Article 11, 2001
Prediction of physical properties for PCB congeners from
molecular descriptors
Tomas Öberg
T. Öberg Konsult AB
Gamla Brov. 13, SE-371 60 Lyckeby, Sweden
Keywords: Polychlorinated biphenyls, QSPR, principal components
regression, PLS modeling.
Abstract
A methodology is described to model quantitative
structure-property relationships that can accept significant
contamination with "bad data". This approach is used to model and
predict the vapor pressures, the water solubilities, the
octanol-water partitioning coefficients and the Henry's laws
coefficients for all 209 congeners of polychlorinated biphenyls
(PCB). The model predictions seem to provide a reliable summary and
extension of the currently available database on these
compounds.
Introduction
The physical properties for organic chemical compounds are
important in determining their distribution and fate in the
environment1. Examples of such properties are the vapor
pressure, the water solubility, the octanol-water partition
coefficient and the Henry's law coefficient. Experimental
measurements of these properties have become easier with the
introduction of new methods, e.g. determination of octanol-water
partitioning using gas chromatography, but the number of compounds
under consideration still makes it necessary to also use models for
estimation.
Polychlorinated biphenyls (PCB) are a group of 209 different
congeners that have attracted much attention as environmental
pollutants. On May 22 2001, 127 governments adopted the Stockholm
Convention on Persistent Organic Pollutants. PCBs were among the
chemicals initially selected for elimination from production and
use2. Production of PCB has been banned in the
industrialized world since many years, but large quantities still
remain in the environment3. It is therefore still of
utmost importance to estimate and monitor their fate in the
environment and the biological food chains. The physical properties
of the PCBs can also be used as input in calculating relationships
to biological activity4, or to other physical properties5.
Measured physical properties
have been reported for 20-60% of the PCB congeners6.
The molecular structure holds the key to predicting the physical
properties, and it is easy to recognize the trends within a
homologous group such as the PCBs7-8. The literature
contains many estimation methods, and well known are the additive
group and bond contribution methods9-14. These methods
are fairly robust and can be applied to a wide variety of organic
molecules. A robust and general method will however by definition
lack some accuracy and precision when considering local phenomena,
such as the properties within a specific compound group.
Topological, geometrical and electronic descriptors can give
more in-depth descriptions of the molecules and serve as a basis
for developing predictive models with an improved accuracy and
precision15-17. The purpose of this investigation is to
develop multivariate calibration models and predict the vapor
pressure, the water solubility, the octanol-water partitioning, and
the Henry's law constant for all 209 PCB congeners. We will also
explore the possibilities of using these models to validate
available experimental data.
Experimental
Experimentally determined values for the physical properties
were obtained from the PhysProp Database (Syracuse Research
Corporation, Syracuse, NY, USA)5. Vapor pressures were
reported for 42 congeners, water solubilities for 122 congeners,
octanol-water partitioning coefficients for 92 congeners, and
Henry's law constants for 91 congeners. The data in the PhysProp
Database are collected from a large number of investigators, so we
can assume that there is variation both within and between the
various laboratories and investigators.
With the purpose of validating the methodology we also have
included a set of experimental data with an expected high accuracy
and precision in the form of retention times in gas chromatography
using different columns and separation conditions. Retention time
data were obtained from the manufacturers data
sheets18-20.
The chemical structure of each congener was sketched on a PC
using the software HyperChem (HyperCube, Inc., Gainesville,
Florida, USA). Each compound was modeled using the force-field
routine MM+, an extension by HyperCube of the standard MM2 force
field21. The molecular structures were then used as
input for the generation of 853 descriptors with the software
Dragon (Milano Chemometrics and QSAR Research Group, University of
Milano-Bicocca, Milano, Italy), listed in the enclosed text file varfile.txt. Todeschini and Consonni have
reviewed these molecular descriptors22.
The multivariate analysis and calibration was carried out with
the software Unscrambler (CAMO ASA, Oslo, Norway), Matlab
(MathWorks, Inc., Natick, MA, USA) and Progress (Rousseeuw &
Leroy, 1987). Principal component analysis (PCA), principal
component regression (PCR), PLS-regression (PLSR) and least median
of squares regression (LMSR) were used as the modeling methods.
Martens and Næs have reviewed PCA, PCR and PLSR23.
Rousseeuw and Leroy have reviewed LMSR and other methods for robust
regression and outlier detection24.
Results and discussion
Here we use the numbering system for PCB congeners currently
assigned by Ballschmiter and the International Union of Pure and
Applied Chemistry25.
447 descriptors with constant values for all congeners were
excluded from the data analysis. The raw data hence consist of 406
descriptor variables (varlist.txt) and
the seven dependent response variables. The raw data is listed in
the enclosed tab separated text file
rawdat.txt (the first row lists the column headings, each
following row corresponds to a congener and each column to a
variable). All descriptor variables were autoscaled to zero mean
and unit variance. The three dependent retention-times variables
were also autoscaled, while the four dependent physical property
variables were log-transformed prior to data analysis and
modeling.
The descriptor data were initially modeled using principal
component analysis (PCA). A five-component model explained 65% of
the variance in the calibration data and 59% of the variance in ten
randomly selected cross-validation segments. The first two score
vectors are shown in figure 1. All five score vectors are listed in
the enclosed tab separated text file
scores.txt.

Figure 1
Scores for the two first principal components (IUPAC-numbers shown
for each congener).
The position of the congeners on the score plot relates to the
chemical structures. The congener groups line up from left to the
right with increasing number of chlorine atoms. In a similar manner
the vertical distribution reflect the substitution pattern, with
non-ortho chlorinated biphenyls ("co-planar") at the bottom and
those with tri- and tetra-ortho substitution at the top. The
ortho-substitution pattern directly influences the energy barrier
of rotation and it is also correlated to the biological activities
of these compounds26.
The first five principal components were used as independent
variables for LMSR. Such a semi-robust regression model was
estimated for each of the dependent physical property variables and
subsequently used to identify outliers in the dependent variables.
An object was declared an outlier if the standardized residual was
larger than 2.5. PCR and PLSR can also be extended to become robust
both with regard to independent and dependent
variables27-28, but this would not serve any purpose in
the present investigation.
As a second step, a reweighted PCR could be run by assigning
zero weight, or some value on a scale between zero and one, to the
outlying objects. Instead we have proceeded with a reweighted PLSR
were each outlying object was assigned zero weight, i.e. removed
from the computation of the regression model. The PLSR1 procedure
was used to obtain models with optimal accuracy and precision. A
further step to get parsimonious models was to assign zero weight
to descriptor variables with minor influence in the
PLS1-regression. These variables were selected on the criteria that
the weighted regression coefficients were approximately less than
half of the maximum values when all variables were included.
Vapor pressure
Experimental measurements were available for 42 congeners.
Outlying objects were identified using the robust PCR procedure
described above. 34 objects (objlist1.txt) and 260 descriptor variables (varlist1.txt) were assigned non-zero weight
in the successive PLSR1-regression. The calibration model was
validated using a test set of 12 randomly selected objects (testset1.txt). The number of latent
variables to keep in the PLS-model was estimated to one, yielding a
model with a coefficient of determination R2 for the
test set of 0.972. The standard error of prediction SEP, estimated
from the test set, was 0.21 (log mm Hg). Figure 2 show predicted
versus measured results for all 42 congeners, with the eight
outlying objects marked as filled rectangles.
Figure 2
Predicted vs. measured vapor pressure (log mm Hg), 42 PCB
congeners.
The antilogarithms of the measured and the predicted vapor
pressures at 25° C (mm Hg) for all congeners, and the
accompanying residuals, are listed in the enclosed tab separated
text file vp.txt.
Water solubility
Experimental measurements were available for 122 congeners.
Outlying objects were identified using the robust PCR procedure
described above. 119 objects (objlist2.txt) and 275 descriptor variables (varlist2.txt) were assigned non-zero weight
in the successive PLSR1-regression. The calibration model was
validated using a test set of 47 randomly selected objects (testset2.txt). The number of latent
variables to keep in the PLS-model was estimated to one, yielding a
model with a coefficient of determination R2 for the
test set of 0.941. The standard error of prediction SEP, estimated
from the test set, was 0.33 (log mg/l). Figure 3 show predicted
versus measured results for all 122 congeners, with the three
outlying objects marked as filled rectangles.

Figure 3
Predicted vs. measured water solubility (log mg/l), 122 PCB
congeners.
The antilogarithms of the measured and the predicted water
solubilities at 25° C (mg/l) for all congeners, and the
accompanying residuals, are listed in the enclosed tab separated
text file water.txt.
Partitioning coefficient octanol-water
Experimental measurements were available for 92 congeners.
Outlying objects were identified using the robust PCR procedure
described above. 87 objects (objlist3.txt) and 227 descriptor variables (varlist3.txt) were assigned non-zero weight
in the successive PLSR1-regression. The calibration model was
validated using a test set of 34 randomly selected objects (testset3.txt). The number of latent
variables to keep in the PLS-model was estimated to one, yielding a
model with a coefficient of determination R2 for the
test set of 0.983. The standard error of prediction SEP, estimated
from the test set, was 0.15 (log P). Figure 4 show predicted versus
measured results for all 92 congeners, with the five outlying
objects marked as filled rectangles.

Figure 4
Predicted vs. measured partitioning coefficient octanol-water (log
P), 92 PCB congeners.
The logarithms of the measured and the predicted partitioning
coefficients octanol-water (log P) for all congeners, and the
accompanying residuals, are listed in the enclosed tab separated
text file logp.txt.
Henry's law constant
Experimental measurements were available for 91 congeners.
Outlying objects were identified using the robust PCR procedure
described above. 79 objects (objlist4.txt) and 145 descriptor variables (varlist4.txt) were assigned non-zero weight
in the successive PLSR1-regression. The calibration model was
validated using a test set of 31 randomly selected objects (testset4.txt). The number of latent
variables to keep in the PLS-model was estimated to two, yielding a
model with a coefficient of determination R2 for the
test set of 0.960. The standard error of prediction SEP, estimated
from the test set, was 0.086 (log atm-m3/mol). Figure 5
show predicted versus measured results for all 91 congeners, with
the twelve outlying objects marked as filled rectangles.

Figure 5
Predicted vs. measured Henry's law constant (log
atm-m3/mol), 91 PCB congeners.
The antilogarithms of the measured and the predicted Henry's law
constants at 25° C (atm-m3/mol) for all congeners,
and the accompanying residuals, are listed in the enclosed tab
separated text file
henry.txt.
Retention times in gas chromatography
As an additional validation of this approach to establish
quantitative structure property relationships (QSPR) for PCB
congeners we have also tried to model the retention times obtained
from gas chromatographic separation on three different columns: Rtx-CLP, SPB-Octyl and HT8.
Experimental measurements were available for 207-209 congeners.
The retention times on all three columns showed a high correlation
in between. 209 objects and 201 descriptor variables were assigned
non-zero weight in PLSR2-regression. The calibration model was
validated using a test set of 82 randomly selected objects. The
number of latent variables to keep in the PLS-model was estimated
to two, yielding a model with coefficients of determination
R2, for the test set, between 0.979 and 0.989. The
standard error of prediction SEP, estimated from the test set, was
0.086-1.17 (min). Figure 6 show predicted versus measured results
for 207 congeners separated on the HT8 column.

Figure 6
Predicted vs. measured retention times (min), 207 PCB congeners on
a HT8 column.
The four physical properties were best described by
constitutional and topological descriptors, molecular walk counts,
WHIM and GETAWAY descriptors. The retention times correlated
with descriptors from all groups
The results presented above shows that it is possible to obtain
a good fit and low prediction errors using multivariate calibration
models for physical properties, and retention times on gas
chromatography columns, based solely on computationally derived descriptors. Experimental data with the smallest expected
experimental errors were also the easiest to model, i.e. the
retention times.
Deviation between measured values and model predictions can be
due either to model error or experimental error. The "experimental error" result from both intra- and inter-laboratory variation, and
especially the last factor is important since different
laboratories often have used different methodology. We feel that
there are rather strong indications that the experimental error is
the limiting factor for these structure-property modeling efforts,
since the model fit improves both with a robust approach and with
more reliable data. Others have reported a similar experience with
some of the group contribution methods for estimation of the
partitioning coefficient octanol-water29.
The deviation between experimental measurements and model
predictions are particularly pronounced for two objects with regard
to the Henry's law constant. PCB #77 and #172 have predicted values
of 1.0E-4 and 1.8E-5 atm-m3/mol. The reported experimentally
determined values in the PhysProp database are 9.4E-6 respectively
1.3E-6 atm-m3/mol. We therefore made a check with the
original papers, where the reported data for PCB #77 and #172
actually are a magnitude higher 9.4E-5 and 1.3E-5
atm-m3/mol30-31. The large deviations are
obviously due to errors in the transfer between the published data
and the PhysProp database.
Validation is a general problem when using data compiled from
many different sources. The usual approach to this problem is to
carefully re-evaluate all original data and investigations. However, in many cases this can prove to be difficult and at least
very time consuming. Furthermore, experimental errors will often
remain after this process. Another way of dealing with the problem
is to use high-breakdown methods for data evaluation, i.e. robust
methods for model building that can accept significant
contamination with bad data. Least median of squares regression is
an example of such a robust method, with a breakdown point of 50%.
This method will work if we can expect at least 50% "good data",
and this does seem as a conservative assumption in many practical
situations.
How reliable are then the model predictions compared to the
individual experimentally determined results? Each model
interpolation is actually based on a substantial number of
experiments performed in various laboratories. We are therefore
inclined to put more faith in the model interpolations if a
reported experimental value show up as an outlier with a high residual. It will be very interesting to see if repeated
measurements on some of the congeners with the largest reported
deviations will provide a more definitive answer to this.
Conclusions
We have in this investigation reported estimations of some
important basic physical parameters for all 209 congeners of
polychlorinated biphenyls. These estimations were made from
computationally derived descriptors using a robust approach to
multivariate calibration. In a number of cases large deviations
were detected from the reported experimentally determined values.
Some of these could directly be assigned to typing errors. The most
reliable measurements available, retention times from gas chromatography, were also the easiest to predict with accuracy and
precision. The model predictions therefore seem to provide a
reliable summary and extension of the currently available database
on these compounds.
Supplementary materials
References
1. Howard, P. H.; Meylan, W. Toxic chemicals: Assessing
environmental fate and exposure. Chemical Engineering 2001, 108,
91-96.
2. Final act of the conference of plenipotentiaries on the
Stockholm convention on persistent organic pollutants. UNEP/POPS/CONF/4. United Nations Environment Program,
Geneva, Switzerland, 2001.
3. Öberg, T. Replacement of PCBs (polychlorinated biphenyls) and HCB (hexachlorobenzene) - the Swedish
experience. In
Alternatives to persistent organic pollutants. The Swedish National
Chemicals Inspectorate, Stockholm, Sweden, 1996.
4. Andersson, P. L. et al. Multivariate modeling of
polychlorinated biphenyl-induced CYP1A activity in hepatocytes from
three different species: Ranking scales and species differences. Environ.
Toxicol. Chem. 2000, 19, 1454-1463.
5. Abramowitz, R.; Yalkowsky, S. H. Estimation of aqueous solubility and
melting point of PCB congeners. Chemosphere 1990, 21, 1221-1229.
6. PhysProp database. Physical and chemical property data for
over 25000 chemicals. Available at web site esc.syrres.com and with
the EPI Suite software. Syracuse Research Corp. and U.S. EPA,
2001.
7. Verschueren, K. Handbook of environmental data on organic chemicals. Van Nostrand Reinhold, New York, USA, 1983.
8. Rice, C. P.; O'Keefe, P. Sources, pathways, and effects of PCBs, dioxins and
dibenzofurans. In Handbook of ecotoxicology,
Hoffman, D. J. et al, Eds. Lewis Publishers, Boca Raton, FL, USA,
1995.
9. Hansch, C.; Leo, A. J. Substituent constants for correlation
analysis in chemistry and biology. John Wiley & Sons, New York,
USA, 1979.
10. Joback, K. G.; Reid, R. C. Estimation of pure-component
properties from group-contributions. Chem. Eng. Commun. 1987, 57,
233-243.
11. Reid, R. C.; Prausnitz, J. M.; Poling, B. F. The properties
of gases and liquids. McGraw-Hill, Inc., New York, USA, 1987.
12. Lyman, W. J.; Reehl, W. F.; Rosenblatt, D. H. Handbook of
chemical property estimation methods: environmental behavior of
organic compounds. American Chemical Society, Washington, DC, USA,
1990.
13. Meylan, W. M.; Howard, P. H. Bond contribution method for
estimating Henry's law constants. Environ. Toxicol. Chem. 1991, 10,
1283-1293.
14. Meylan, W. M.; Howard, P. H. Atom/fragment contribution
method for estimating octanol-water partition coefficients. J. Pharm. Sci. 1995, 84, 83-92.
15. Egolf, L. M.; Jurs, P. C. Estimation of autoignition
temperature of hydrocarbons, alcohols, and esters from molecular structure. Ind. Eng.
Chem. Res. 1992, 31, 1798-1807.
16. Egolf, L. M.; Wessel, M. D.; Jurs, P. C. Prediction of
boiling points and critical temperatures of industrially important
organic compounds from molecular structure. J. Chem. Inf. Comput. Sci. 1994, 34, 947-956.
17. Katritzky, A. R.; Karelson, M.; Lobanov, V. S. QSPR as a
means of predicting and understanding chemical and physical
properties in terms of structure. Pure Appl. Chem. 1997, 69,
245-248.
18. PCBs HT8: The perfect PCB column. Publication No. AP-0040-C
Rev:03 5/99. SGE International Pty. Ltd., Ringwood, Victoria,
Australia, 1999.
19. Rtx®-CLPesticides and Rtx®-CLPesticides2 tolumns:
the ideal confirmational pair for analyzing polychlorinated
biphenyls (PCBs). Applications note #59120. Restek Corporation, Bellefonte, PA, USA, 2000.
20. Stenerson, K. K.; Sidisky, L. M. The analysis of all 209 PCB
congeners on the SPB™-Octyl and MDN™-5S capillary
columns. Data sheet T400129. Supelco, Bellefonte, PA, USA,
2000.
21. Allinger, N. L. MM2. A hydrocarbon force field utilizing V1
and V2 torsional terms. J. Am. Chem. Soc. 1977, 99, 8127-8134.
22. Todeschini, R.; Consonni, V. Handbook of molecular
descriptors. Wiley-VCH, Weinheim, Germany, 2000.
23. Martens, H.; Næs, T. Multivariate calibration. John
Wiley & Sons Ltd., Chichester, Great Britain, 1989.
24. Rousseeuw, P. J.; Leroy, A. M. Robust regression and outlier
detection. John Wiley & Sons, New York, USA, 1987.
25. Ballschmiter, K. et al. Determination of chlorinated
biphenyls, chlorinated dibenzodioxins, and chlorinated
dibenzofurans by GC-MS. J. High Resolut. Chromatogr. 1992, 15,
260-270.
26. Safe, S. H. Polychlorinated biphenyls (PCBs),
dibenzo-p-dioxins (PCDDs), dibenzofurans (PCDFs), and related
compounds: environmental and mechanistic considerations which
support the development of toxicity equivalence factors (TEFs). CRC
Crit. Rev. Toxicol. 1990, 21, 51-88.
27. Walczak, B.; Massart, D. L. Robust principal components
regression as a detection tool for outliers. Chemom. Intell. Lab.
Syst. 1995, 27, 41-54.
28. Gil, J. A.; Romera, R. On robust partial least squares (PLS)
methods. J. Chemom. 1998, 12, 365-378.
29. Kühne, R. et al. Calculation of compound properties
using experimental data from sufficiently similar chemicals. In
Software development in chemistry 10, Proceedings of the 10th
Workshop "Computer in Chemistry", Hochfilzen/Tirol, November 19-21,
1995, Gasteiger, J., Ed. Gesellschaft Deutscher Chemiker, Frankfurt
am Main, Germany, 1996.
30. Dunnivant, F. M.; Coates, J. T.; Eizerman, A. W.
Experimentally determined Henry's law constants for 17
polychlorobiphenyl congeners. Environ. Sci. Technol. 1988, 22,
448-453.
31. Brunner, S. et al. Henry's law constants for polychlorinated
biphenyls: experimental determination and structure-property
relationships. Environ. Sci. Technol. 1990, 24, 1751-1754.
© Tomas Öberg
Konsult AB