Network Representation and Analysis in Bioinformatics

Susumu Goto

Bioinformatics Center, Institute for Chemical Research, Kyoto University

Uji, Kyoto 611-0011, Japan

One of the important goals in the bioinformatics field is to elucidate function of a living organism from genomic data. It cannot be done only by simply accumulating information on individual genes but also by complicated interaction network of a large number of genes or proteins. Recent improvements in biotechnology enable several kinds of exhaustive analyses such as gene expression profiling, identification of subcellular localization of proteins, and protein interactions, as well as DNA sequencing. It is an urgent requirement in the bioinformatics field to develop a method for predicting functional network of proteins from these heterogeneous data.

The problems here are how to represent such knowledge and how to use it for further analysis. Graph is a mathematical representation of network data and has a long history of the development of search and comparison algorithms. We consider the graph representation as a promising one and use it for representing several kinds of protein/gene network data and chemical structure data. For example, a protein network in signal transduction pathways is represented by a graph with nodes as proteins and edges as protein interactions. We can also view the metabolic pathway as a graph whose nodes are enzymes and edges are chemical compounds connecting two enzymes as their intermediates. Another example is a network of gene expression similarity, where the nodes represents genes and edges are drawn if expression patterns of two genes are similar.

Once we have network data in a graph representation, we can apply network or graph analysis algorithms for elucidating functional relationships between genes/proteins from the network data. We have developed a kernel canonical correlation analysis method for integrating several heterogeneous network data to extract functional relationships among genes and proteins. This method employs supervised approach for the learning from the integrated data. Because we have collected a large amount of knowledge on biochemical network and protein interaction network from literature and compiled it as KEGG (Kyoto Encyclopedia of Genes and Genomes) database, the knowledge can be used as the positive data in the learning process.

We have applied this method to predict protein networks of yeast and a bacteria Pseudomonas aeruginosa. In case of the bacteria, we could predict genes for missing enzymes in the lysine degradation pathway by integrating network data created based on the similarity of phylogenetic profiles and proximity of genes on the genome. Candidate genes for the missing enzymes have been further cloned and measured the catalytic activities and proved that they actually have the activity. This result shows the power of our method for predicting functional associations from heterogeneous genome data.

Currently the method only predicts missing enzymes for the known pathway in terms of biochemical reaction network, because it only uses the knowledge on protein network. There remain many reaction pathways to be elucidated, such as biosynthesis pathways of plant secondary metabolites and biodegradation of xenobiotic chemicals. In such a case, we need to predict new reactions involved in the biosynthesis and then correlate the new reaction to a gene in the genome. Therefore the important directions in the bioinformatics field also include the integration of chemical information with genomic information.