2010 Annual Meeting
(698a) A Novel Constrained Total Least Squares Formulation for the Identification of Gene Networks From Highly Noisy and Correlated Measurements
Authors
Genes, proteins, and metabolites can regulate one another in various ways. Regulatory proteins bind to DNA to affect the transcription of genes. Proteins can also combine to form multi-protein complexes that can take part in various functions in regulation [1]. All these interactions form a complex network of regulatory control. Experimentally, it is quite difficult to obtain the information on the levels of gene regulation. A key objective in systems biology is to map out and model the topological and dynamical properties of these networks.
Recently, different types of genomic data have been obtained to understand transcription regulation, e.g., DNA sequence data, micro-array gene expression data, and protein-DNA binding data. The advent of such diverse data has motivated various researchers to develop computational methods to model transcription regulation [2]. DNA-protein binding data provides information to understand the regulators involved in transcription. Time-series micro-array expression experiments are the main source of data which provides dynamic information about the expressions of thousands of genes that are activated or repressed in response to external stimuli [3].
Extensive studies on gene regulatory network modeling, using time-series data, have focused on linear discrete time model equations. In this model, the expression level of a gene is assumed to be the concentration of its transcript. The concentration of a particular transcript at time point
where N is the number of transcripts in the network and
Microarray data is usually subject to high levels of additive and multiplicative errors [5]. Therefore, one can write concentration levels for genes as follows;
In this equation,
Using equation (1) and (2) , one can write the model for all genes,
where ,
Equation (3) can be written for all time points,
Where
One can see that the error terms in both sides of the equation (4),
A significant problem from the regression standpoint is that both independent and dependent variables have high level of noise. Moreover, these noise terms are serially correlated. Other challenging characteristics include limited number of available data and sparse but unknown structure of the parameter matrix. There is limited access to the topology information of the network through noisy protein-DNA binding data.
Many parameter estimation algorithms applied to this problem in gene network identification literature [1]. Here, we will benchmark different regression methods for this model. In the context of this problem, the most commonly used method is least squares estimation. In the classical least squares regression theory, the errors are assumed to be confined only to response variables. However, in this model, the predictor variables are also noisy, thus, least squares estimator is not appropriate for this model (See
The objective of CTLS method is simply minimizing the following objective function;
(5)
Where,
We will benchmark the performance of our CTLS formulation against least squares, total least squares, and partial least squares methods with respect to different level of noises, problem, correlation structure and data size through in-silico examples. Our CTLS formulation is also compared to CTLS application of Kim et al [8]. We demonstrated a significant improvement over their method. Furthermore, we will incorporate appropriate constraints in our problem formulation to address sparseness of networks and evaluate its performance.
REFERENCES
[1] Driscoll, M. E., Gardner, T.S, Identification and control of gene networks in living organisms via supervised and unsupervised learning, Journal of Process Control 16 (2006) 303-311.
[2] Sun, N., Carroll, R.J, Zhao, H., Bayesian Error Analysis model for Reconstructing transcriptional regulatory networks, PNAS 103 (21) (2006), 7988-7993.
[3] Ernst, J., Vainass, O., Harbison, C. T., Simon,
[4] Ideker, T., Thorsson, V., Siegel, A.F., and Hood, L.E. Testting for Differentially-Expressed Genes by Maximum-Likelihood Analysis of Microarray data. Journal of Computational Biology 2 (2005), 65-88.
[5] Gardner, T.S., Faith, J. J., Reverse-engineering transcription control networks. Physics of Life Reviews 2 (2005), 65-88.
[6] Bansal, M., Giusy, D.G., Bernado, D., Inference of gene regulatory networks and compound mode of action from time course gene expression profiles. Bioinformatics 22 (2006), 815-822
[7] Huffel, S. V. (1991). The total least squares problem: computational aspects and analysis, Society for Industrial and Applied Mathematics, Philadelphia.
[8] Kim, J., Bates, D. G., Postlethwaite, I., Harrison, P., and Cho, K. (2007). BMC Bionformatics, 8, 8.