Project Description

Human pose estimation is one of the most challenging research areas in computer vision. Successful estimation of human pose can simplify higher-level tasks such as activity recognition and behaviour analysis. Therefore, this task can benefit a wide range of industry sectors such as video surveillance, physical security, assisted living or sport performance enhancement.

Recent advances in deep learning and their application to pose estimation have enabled more accurate 3D skeleton reconstruction from static images or monocular video. While impressive results have been achieved in certain images and poses, the results tend to be inconsistent and inaccurate when the environment is not fully controlled, such as with videos recorded using a mobile phone. These type of videos, however, are becoming ubiquitous thanks to the proliferation of social media, and as such, their analysis is crucial for many applications ranging from counterterrorism to videogames.

Furthermore, most current pose estimation research is conducted on still-images, and simply treats video as a sequence of independent still images. In reality, we are usually interested in pose estimation from video, where knowledge from one video frame can be used to improve the pose estimate in the next frame.

The aim of this project is to investigate different deep learning methods and architectures for articulated human pose estimation on social media, mobile phone and first-person videos. We will treat human pose estimation as a tracking problem, where feedback from previous time-steps is used to enhance the accuracy of pose estimation in the present time-step.

- Develop a novel neural network architecture that uses attention, feedback and memory to enhance pose estimation in temporal sequences such as video. The network will learn an online appearance model to dynamically adapt itself to the person whose pose is being tracked.

- Model and use domain adaptation to enable current state of the art pose estimation techniques to be ported to complex uncontrolled scenes recorded on mobile phones.

- Explore the use of unsupervised and semi-supervised learning for pose estimation. Video provides natural supervision as the frames of a person performing an action are inherently ordered. We will take advantage of this to automatically learn a space of pose representations.


  • Apply 2D/3D convolutional neural networks to extract articulated skeletons representing the human pose from images and video.
  • Develop novel deep neural architectures able to perform domain adaptation of pose estimation methods to moving first person cameras
  • Investigate the use of attention, feedback and memory to improve pose estimation in video sequences as well as perform simultaneously end-to-end activity recognition and pose estimation
  • Explore unsupervised deep learning architectures to detect anomalous and dangerous behaviours out of the knowledge base.
  • Evaluate quantitatively the performance of the developed methodologies against real-world social media videos


Contact details:

Supervisor Name: Niall McLaughlin                                         

Tel: +44 (0)28 9097 1830

QUB Address: ECIT, Queen’s Road, Belfast,  BT3 9DT