Skip to main content

PhD project title

Development of domain-agnostic machine learning-enabled data quality application

Outline description, including interdisciplinary, intersectoral and international dimensions (300 words max)

Within machine learning, the discriminative ability of predictive models is reliant on the quality of datasets used to train and test algorithms. Regardless of domain, from biomarker development in cancer discovery through to fraud and error detection in finance, hidden bias, erroneous and missing data can cause a propagation of errors, resulting in skewed results and wasted resources. In both these domains, data quality is a critical issue, with implications for either patient wellbeing and care or economic and market consequences of fraud.

In biomedical research, less than 0.1% of biomarkers translate into the clinic, this can be attributed in part to data quality used for training. In finance, United Nations Office on Drugs and Crime estimates that between 2% and 5% of global GDP is laundered each year. Human-curation of datasets, eg identifying errors in clinico-pathological data or entity matching, can be time-consuming, subject to further human error and does not scale well to the high-dimensional data setting.

This project proposes to develop a cross-domain data quality control application, with proof of concept in both cancer patient datasets and financial entity data such as Company House. Using both statistical modelling and machine learning, problematic data will be identified and options for correction explored. Proof of concept will be demonstrated in downstream analysis.

Developing a domain-agnostic tool will provide opportunities for both industrial-academic (QUB/Datatics) and cross-domain (healthcare/finance) exchanges. Additionally, the link with Datatics will provide international opportunities for working in other sectors including banking and government along with access to their Data Quality platform for data pre-processing.

 

This project proposes to develop a cross-domain data quality control application, with proof of concept in both cancer patient datasets and financial entity data.  Data analysis provides the foundation for discovery whether in cancer treatment or financial security. Data quality control is thus an area that is of international importance, and can be translated across multiple domains. Datasets that have been cleaned, structured and enriched centrally can be rolled out and used in multiple locations. 

 

The project will be aligned with the Stratified Medicine Group (PGJCCR, QUB) which is an integrated group of academic/industrial PIs with national/international collaborations in various cancer areas ie AML, breast, oesophageal and pancreatic. As such, the group has access to multiple patient datasets which are used for discovery and validation in downstream analysis. The group publishes in high-impact peer-reviewed journals including Nucleic Acids Research and GUT and presents at conferences such as ASCO and AACR.

 

With Datactics as a partner we have access to national and international clients including UK Government and banks such as UBS working across the EU and US. The project will undertake Use Cases within these sectors and work along with these clients at a national and international level for understanding of the data quality issues through to demonstration of the application.  We will also support the dissemination of the research through high caliber international conferences such as ACM SIGMOD and journals such as IEEE Transactions on Knowledge and Data Engineering.

Key words/descriptors

 

 

Data science, analytics, data quality, data correction, machine learning, finance, healthcare

Fit to CITI-GENS theme(s)

  • Information Technology,
  • Advanced Manufacturing,
  • Life Sciences
  • Creative Industries.

Supervisor Information

 

 

First Supervisor:  Dr Jaine Blayney                                                           School: Medicine, Dentistry & Biomedical Sciences

Second Supervisor:   Dr Zhiwei Lin                                                            School:  Mathematics and Physicss

Third Supervisor:    Dr Fiona Browne                                                       Company: Datactics Ltd

What costs are associated with the project and how will they be funded?

 

NB: The COFUND research grant supports the financing of student fees and the salary of the ‘Fellows.’ Additional overheads (e.g. specialist training, equipment) are not provided for

This is a dry lab project and as such overheads will be low. Training will be provided by both project partners throughout the lifetime of the studentship. Conference travel and IT equipment will be supported via Datatics.

Name of non-HEI partner(s)

Datatics

Contribution of non-HEI partner(s) to the project:

 

 

Datatics will provide equipment (IT) and support in kind, via mentoring, access to datasets, and placements.

 

. Please describe the profile of the non-HEI partner and the nature of the relationship.    

 

Datactics specialise in data quality management. The company supports clients with financial regulatory compliance (such as BCBS 239, FATCA, MiFID and CCAR), optimised analytics, onboarding and Know Your Customer through a self-service data quality platform. This provides functionality including the continuous monitoring and improvement of data, matching instruments and entities at scale and providing a single client view of siloed data (for GDPR, FSCS).

 

Datactics will provide a placement opportunity for the candidate to work in an industrial setting applying their research to real world datasets.  The candidate will also obtain training on data quality and pre-processing using the Datactics platform.

Faculty

Medicine, Health and Life Sciences

Research centre / School

School of Medicine, Dentistry and Biomedical Sciences