Supervisor(s): Dr Kang Li
Bioinformatics has in recent years emerged as a new promising approach in biology to build better predictive models of diagnosis, prognosis, and therapy at the cellular level of the living creatures. In particular, the synergy of genomic and other molecular research technologies and information technologies allow the identification of genes and proteins, or a functionally related cluster of genes and proteins, that may play a major role related to a specific phenotype. These have great potentials on a wide spectrum of applications, e.g. drug development, molecular, personalized and preventive medicine, food safety, microbial genome applications, waste treatment, alternative energy sources, antibiotic resistance, forensic analysis of microbes, improve nutritional quality, etc, to just name a few.
However, in data analysis and mining of bio-data, investigators are confronted with a few bottlenecks, one of which is the high dimensionality problem. That is each data sample could be defined by hundreds or thousands of measurements that might be concurrently obtained. For example, it is typically estimated that the human genome consists of up to 25,000 genes. Attempting to extract meaningful expression patterns, or to infer related genes and pathways, from small sets of data in such high dimensional space is incredibly difficult. On the other hand, high-throughput genomic microarray and proteomic technology have produced enormous data, which can be computationally too intensive to carry on data mining and predictive modeling.
This project will investigate advanced learning algorithms for bio-data, aiming to develop a set of efficient, explicit and automatic statistical techniques for the interpretation, classification and understanding of biological data and processes.