Making sense of cancer data

  • Making sense of cancer data

School of Electronics, Electrical Engineering and Computer Science
& ECIT Global Research Institute

Proposed Project Title: Making sense of cancer data

Principal Supervisor:   Prof. Neil M. Robertson           
Second Supervisors:     Dr. Barry Devereaux

Project Description:

This PhD project aims to develop new algorithms to make sense of a large set of medical data relating to types of cancer. A medical scientist would love to see the entire big picture what has changed in the patient compared to a normal human being without cancer. But since that analysis tool is not available, he cherry picks the genes he is interested in, and compares tumor and normal tissue. Ultimately we want to provide a better and more complete method using modern machine learning, such as deep learning.

A wealth of data has been compiled in databases organized by NIH in the USA. This is truly “big data”: we have 20000+ genes, and expression of these genes have been documented from patients with all types of cancer (Breast, Prostate, Pancreatic, Leukemia and 20-30 more). Each cancer subtype has at least 1000 patient samples data. What’s also available is patient-matched gene expression data of the normal tissue the tumor was extracted from. In that way, one can find out which genes are expressed abnormally in the tumor sample in that patient. Besides this, mutations, copy number, survival etc. data are available for almost all the patients.

Although a lot of effort has been put into compiling this data, the researchers or scientists are allowed to search manually for only one gene at a particular time, which is highly inefficient. Also, it is impossible to explain the cause of cancer based on handpicking some genes a particular researcher is interested in. When it comes to the clinic, doctors diagnose cancer patients based on immunohistochemistry (tumor biopsy with antibodies, not many, in practice) and instinct. They then prescribe same course of therapy for all patients depending on the grade of cancer: surgery, followed by radiation therapy and chemotherapy. The therapy inevitably fails in 99.99% of patients. Even if the doctor considers the gene expression data of a cancer patient, it is again impossible for a human mind to make sense of this volume of data: e.g. 20000 genes for 1000 patients, and match the new patient’s data to the datasets available, and that is the reason all these data turn out to be useless.

What we can do in this project

  1. Visualisations of data can be constructed to aid biologists in the data gathering phase.

  2. An algorithm can be built where all these data are fed into an automatic learning process which may be separate for each cancer subtype;

  3. Gene expression data of new cancer patient is the experimental dataset, which when fed to the algorithm will provide us treatment, survival predictions by matching the data with previous treated patients with similar gene expression pattern, if any. This will be very helpful to the clinicians to prescribe an optimal therapy and in collaboration with UCLA could find a way to save lives.

  4. Biology is complicated as all these genes are interlinked with each other. DNA transcribes to mRNA and mRNA translates to protein. Among DNA, mRNA and protein, protein is the macromolecule which predominantly dictates the phenotype. Protein data is limited, due to lack of high-throughput assays. mRNA data however can be generated in a high-throughput fashion, which gives us gene expression. The biochemistry map showing interconnectivity of these genes is available and it will be immensely helpful to scientists and researchers if the map can be modified based on each patient’s data.

Further reading

  1. The metabolomic metro map (
  2. The cBio Cancer Genomics Portal: An Open Platform for Exploring Multidimensional Cancer Genomics Data, E. Cerami et al. Cancer Discovery, May 2012, The American Association for Cancer Research
  3. Applications of Deep Learning in Biomedicine, P. Mamoshina, A. Vieira, E. Putin, A. Zhavoronkov, Molecular Pharmacuetics 2016, 13, 1445−1454 DOI: 10.1021/acs.molpharmaceut.5b00982

Contact details

Supervisor Name: Neil Robertson   Tel: +44 (0)28 9097
QUB Address: ECIT   Email:
2nd Supervisor Name: Barry Devereux   Tel: +44 (0)28 9097 1705
QUB Address: ECIT   Email: