2010 Annual Meeting

(698a) A Novel Constrained Total Least Squares Formulation for the Identification of Gene Networks From Highly Noisy and Correlated Measurements


Guner, U. - Presenter, Georgia Institute of Technology
Realff, M. - Presenter, Georgia Institute of Technology
Lee, J. H. - Presenter, Korea Advanced Institute of Science and Technology (KAIST)

Genes, proteins, and metabolites can regulate one another in various ways. Regulatory proteins bind to DNA to affect the transcription of genes. Proteins can also combine to form multi-protein complexes that can take part in various functions in regulation [1]. All these interactions form a complex network of regulatory control. Experimentally, it is quite difficult to obtain the information on the levels of gene regulation. A key objective in systems biology is to map out and model the topological and dynamical properties of these networks.

Recently, different types of genomic data have been obtained to understand transcription regulation, e.g., DNA sequence data, micro-array gene expression data, and protein-DNA binding data. The advent of such diverse data has motivated various researchers to develop computational methods to model transcription regulation [2]. DNA-protein binding data provides information to understand the regulators involved in transcription. Time-series micro-array expression experiments are the main source of data which provides dynamic information about the expressions of thousands of genes that are activated or repressed in response to external stimuli [3].

Extensive studies on gene regulatory network modeling, using time-series data, have focused on linear discrete time model equations.  In this model, the expression level of a gene is assumed to be the concentration of its transcript. The concentration of a particular transcript at time point
is given by the linear function of the concentrations of other RNA species at time point,


where N is the number of transcripts in the network and
 is the regulatory strength between gene pairs

 is the error term for the difference between observation and the model. The errors are assumed to have Gaussian distribution with zero mean and standard deviation of
. The aim is to estimate parameter values,
's,  from micro-array observations,
, thereby reconstructing the gene network. A negative
 indicates an inhibition, and a positive value for
 stands for activation between the gene pair. In general, only a small subset of all RNA species regulates a particular transcript, which means most of the
's are zero. In other words, the gene networks are sparse. [4].

Microarray data is usually subject to high levels of additive and multiplicative errors [5]. Therefore, one can write concentration levels for genes as follows;


In this equation,
 is the unknown true value for concentration of
gene at
 time point and
 is the measurement error. The terms
 correspond to multiplicative and additive parts of the measurement error.

Using equation (1) and (2) , one can write the model for all genes,


where ,

Equation (3) can be written for all time points,
, as follows;




One can see that the error terms in both sides of the equation (4),
 are serially correlated as they have same columns except for the first and last columns.

A significant problem from the regression standpoint is that both independent and dependent variables have high level of noise. Moreover, these noise terms are serially correlated. Other challenging characteristics include limited number of available data and sparse but unknown structure of the parameter matrix. There is limited access to the topology information of the network through noisy protein-DNA binding data.

Many parameter estimation algorithms applied to this problem in gene network identification literature [1]. Here, we will benchmark different regression methods for this model. In the context of this problem, the most commonly used method is least squares estimation. In the classical least squares regression theory, the errors are assumed to be confined only to response variables. However, in this model, the predictor variables are also noisy, thus, least squares estimator is not appropriate for this model (See
 in equation (4) ). Total least squares is another method of fitting that is appropriate when there are errors in both independent and dependent variables [6]. Constrained total least squares (CTLS) is an additional improvement over total least squares which addresses the correlation in errors in both variable types. This method is particularly well suited to this problem. We will introduce a novel CTLS formulation for this particular problem that is capable of integrating possible time-independent correlation in gene concentrations.

The objective of CTLS method is simply minimizing the following objective function;


. The word ?constrained? in CTLS refers to the model constraint given as in equation (4).                                                                                                                                           

We will benchmark the performance of our CTLS formulation against least squares, total least squares, and partial least squares methods with respect to different level of noises, problem, correlation structure and data size through in-silico examples. Our CTLS formulation is also compared to CTLS application of Kim et al [8]. We demonstrated a significant improvement over their method. Furthermore, we will incorporate appropriate constraints in our problem formulation to address sparseness of networks and evaluate its performance.   


