Making sense of cancer data via deep learning
Prof. Neil M. Robertson (in collabroation with UCLA Medical School), email@example.com
This project aims to use computers to make sense of a large set of medical data relating to types of cancer. A medical scientist would love to see the entire big picture what has changed in the patient compared to a normal human being without cancer. But since that analysis tool is not available, he cherry picks the genes he is interested in, and compares tumor and normal tissue. Ultimately we want to provide a better and more complete method using modern machine learning, such as deep learning.
The project will be suitable for a student with some computer programming experience and an interest in data science and/or life sciences. The project will be supervised for the duration of the internship.
Data available: A wealth of data has been compiled in databases organized by NIH. This is truly “big data”: we have 20000+ genes, and expression of these genes have been documented from patients with all types of cancer (Breast, Prostate, Pancreatic, Leukemia and 20-30 more). Each cancer subtype has at least 1000 patient samples data. What’s also available is patient-matched gene expression data of the normal tissue the tumor was extracted from. In that way, one can find out which genes are expressed abnormally in the tumor sample in that patient. Besides this, mutations, copy number, survival etc. data are available for almost all the patients.
The problem: Although a lot of effort has been put into compiling this data, the researchers or scientists are allowed to search manually for only one gene at a particular time, which is highly inefficient. Also, it is impossible to explain the cause of cancer based on handpicking some genes a particular researcher is interested in. When it comes to the clinic, doctors diagnose cancer patients based on immunohistochemistry (tumor biopsy with antibodies, not many, in practice) and instinct. They then prescribe same course of therapy for all patients depending on the grade of cancer: surgery, followed by radiation therapy and chemotherapy. The therapy inevitably fails in 99.99% of patients. Even if the doctor considers the gene expression data of a cancer patient, it is again impossible for a human mind to make sense of this volume of data: e.g. 20000 genes for 1000 patients, and match the new patient’s data to the datasets available, and that is the reason all these data turn out to be useless.
What we can do in this project
The student could, with appropriate guidance of lab members in Queen’s University, work on some of the following:
1. An algorithm can be built where all these data are fed into an automatic learning process which may be separate for each cancer subtype;
2. Gene expression data of new cancer patient is the experimental dataset, which when fed to the algorithm will provide us treatment, survival predictions by matching the data with previous treated patients with similar gene expression pattern, if any. This will be very helpful to the clinicians to prescribe an optimal therapy and in collaboration with UCLA could find a way to save lives.
3. Biology is complicated as all these genes are interlinked with each other. DNA transcribes to mRNA and mRNA translates to protein. Among DNA, mRNA and protein, protein is the macromolecule which predominantly dictates the phenotype. Protein data is limited, due to lack of high-throughput assays. mRNA data however can be generated in a high-throughput fashion, which gives us gene expression. The biochemistry map showing interconnectivity of these genes is available and it will be immensely helpful to scientists and researchers if the map can be modified based on each patient’s data.
- The metabolomic metro map (https://www.behance.net/gallery/38270165/Metro-Map-of-Metabolism-The-Overview)
- The cBio Cancer Genomics Portal: An Open Platform for Exploring Multidimensional Cancer Genomics Data, E. Cerami et al. Cancer Discovery, May 2012, The American Association for Cancer Research
- Applications of Deep Learning in Biomedicine, P. Mamoshina, A. Vieira, E. Putin, A. Zhavoronkov, Molecular Pharmacuetics 2016, 13, 1445−1454 DOI: 10.1021/acs.molpharmaceut.5b00982