2009 Annual Meeting
(507f) Predicting In Vivo Toxicities Using Optimal Methods for Re-Ordering and Machine Learning
Authors
A major initiative in predictive toxicology is the development of methods that can rapidly screen thousands of untested environmental chemicals [1]. In 2009, the EPA organized the ToxCast Data Analysis Summit, where the goal was to develop algorithms that can predict the in vivo toxicities of chemicals using only in vitro and in silico data as input. The in vitro data set consisted of about 500 assays (including biochemical receptor and enzyme assays, as well as cell-based assays measuring RNA and protein, cytotoxicity, cell growth, and morphology changes) in the form of EC50 and LEL (Lowest Effect Level) concentration values for a library of 320 chemicals. A subset (only 78.7% of the values were measured) of the in vivo toxicity data set, given in LEL values, was also provided for these 320 chemicals over a total of 76 endpoints in rats, mice, and rabbits.
In this work, we combine the strengths of integer linear optimization (ILP) and machine learning for the prediction of in vivo toxicities of chemicals using only in vitro data. Our approach utilizes a biclustering method based on iterative optimal re-ordering [2,3] to identify biclusters corresponding to subsets of chemicals that have similar responses over distinct subsets of the in vitro assays. This enables us to determine subsets of the in vitro assays that are most likely to be correlated with toxicity in the in vivo data set. An optimal method based on integer linear optimization for re-ordering sparse data matrices [4] is then applied to the in vivo dataset (21.3% sparse) in order to cluster endpoints that have similar lowest effect level (LEL) values, where it is observed that the endpoints are effectively clustered according to (a) animal species and (b) similar physiological attributes. These clusters allow us to quantify the degree of toxicity of a chemical for various subsets of related animal assay endpoints. Based upon the clustering results of the in vitro and in vivo data sets, multi-class logistic regression is then utilized to (a) learn the correlation between the subsets of in vitro data and the in vivo responses, and (b) subsequently predict the toxicity signatures of the chemicals. Statistical analysis of our descriptors enables us to identify which in vitro assays are correlated with the prediction of specific in vivo endpoints. Our approach aims at finding the highest in vivo predictive ability using the minimum number of necessary in vitro descriptors.
[1] http://www.epa.gov/ncct/toxcast
[2] DiMaggio P.A., McAllister S.R., Floudas C.A., Feng X.J., Rabinowitz J.D., and H.A. Rabitz, "Biclustering via Optimal Re-ordering of Data Matrices in Systems Biology: Rigorous Methods and Comparative Studies", BMC Bioinformatics, 9, 458 (2008).
[3] DiMaggio P.A., McAllister S.R., Floudas C.A., Feng X.J., Rabinowitz J.D., and H.A. Rabitz, "A Network Flow Model for Biclustering via Optimal Re-ordering of Data Matrices", J. Global Opt., in press (2009).
[4] McAllister S.R., DiMaggio P.A., and C.A. Floudas, "Mathematical Modeling and Efficient Optimization Methods for the Distance-Dependent Rearrangement Clustering Problem", J. Global Opt., in press (2009).