Making sense of cancer data


  • Making sense of cancer data

Making sense of cancer data

Principal Supervisor: Prof. Neil M. Robertson

Second Supervisor: TBD (collaboration with UCLA Medical School, California, USA)

+ Project Description

This PhD project aims to delve into new algorithms to make sense of a large set of medical data relating to types of cancer. A medical scientist would love to see the entire big picture what has changed in the patient compared to a normal human being without cancer. But since that analysis tool is not available, he cherry picks the genes he is interested in, and compares tumour and normal tissue. Ultimately we want to provide a better and more complete method using modern machine learning, such as deep learning. 

A wealth of data has been compiled in databases organized by NIH in the USA. This is truly “big data”: we have 20000+ genes, and expressions of these genes have been documented from patients with all types of cancer (Breast, Prostate, Pancreatic, Leukaemia and 20-30 more). Each cancer subtype has at least 1000 patient samples data. What is also available is patient-matched gene expression data of the normal tissue the tumour was extracted from. In that way, one can find out which genes are expressed abnormally in the tumour sample in that patient. Besides this, mutations, copy number, survival etc. data are available for almost all the patients. 

Although a lot of effort has been put into compiling this data, the researchers or scientists are allowed to search manually for only one gene at a particular time, which is highly inefficient. Also, it is impossible to explain the cause of cancer based on handpicking some genes a particular researcher is interested in. When it comes to the clinic, doctors diagnose cancer patients based on immunohistochemistry (tumour biopsy with antibodies, not many, in practice) and instinct. They then prescribe same course of therapy for all patients depending on the grade of cancer: surgery, followed by radiation therapy and chemotherapy. The therapy inevitably fails in 99.99% of patients. Even if the doctor considers the gene expression data of a cancer patient, it is again impossible for a human mind to make sense of this volume of data: e.g. 20000 genes for 1000 patients, and match the new patient’s data to the datasets available, and that is the reason all these data turn out to be useless. 

+ What we can do in this project

1. An algorithm can be built where all these data are fed into an automatic learning process which may be separate for each cancer subtype;

2. Gene expression data of new cancer patient is the experimental dataset, which when fed to the algorithm will provide us treatment, survival predictions by matching the data with previous treated patients with similar gene expression pattern, if any. This will be very helpful to the clinicians to prescribe an optimal therapy and in collaboration with UCLA could find a way to save lives.

3. Biology is complicated as all these genes are interlinked with each other. DNA transcribes to mRNA and mRNA translates to protein. Among DNA, mRNA and protein, protein is the macromolecule, which predominantly dictates the phenotype. Protein data is limited, due to lack of high-throughput assays. MRNA data however can be generated in a high-throughput fashion, which gives us gene expression. The biochemistry map showing interconnectivity of these genes is available and it will be immensely helpful to scientists and researchers if the map can be modified based on each patient’s data.

+ How to Apply

Applicants should apply electronically through the Queen’s online application portal at:

+ Further reading

  1. The metabolomic metro map (
  2. The cBio Cancer Genomics Portal: An Open Platform for Exploring Multidimensional Cancer Genomics Data, E. Cerami et al. Cancer Discovery, May 2012, The American Association for Cancer Research
  3. Applications of Deep Learning in Biomedicine, P. Mamoshina, A. Vieira, E. Putin, A. Zhavoronkov, Molecular Pharmacuetics 2016, 13, 1445−1454 DOI: 10.1021/acs.molpharmaceut.5b00982

+ Contact Details

Supervisor Name: Prof. Neil Robertson

Queens University of Belfast
School of EEECS
Centre for Data Science and Scalable Computing (DSSC)
NI Science Park
Queens Road,
Room 02.20