Yintao Tai

Research keywords: Multimodal Learning, Video Understanding, Efficient Learning Methods, Natural Language Processing, Computer Vision

Bio:

Yintao Tai is a PhD student in the DR-NLP-CDT at the University of Edinburgh. He received his BSc in Electronics and Computer Science and his MSc in Computer Science, both from Edinburgh, where he developed a strong interest in combining natural language processing with computer vision.

Prior to his PhD, he worked for two years at ByteDance as a machine learning engineer, where he developed methods to integrate multimodal understanding features into large-scale recommendation systems, which effectively improved user experience. This experience motivated his current research on efficient and responsible methods for large-scale multimodal understanding.

PhD research:

Yintaoâ€™s research focuses on efficient multimodal learning, particularly enabling large language models to understand long and diverse videos without relying on excessive computation, in order to support deployment at massive scale. He is also interested in text-in-image understanding and has built PIXAR, a model that can both understand and generate text embedded in images.

His work aims to advance practical methods for multimodal representation learning with applications in video summarisation, retrieval, recommendation, and beyond, and he is also exploring how efficient video LLMs can be applied to robotics. Alongside technical advances, he is committed to addressing ethical and responsible AI challenges, including fairness, transparency, and mitigating potential societal risks when deploying multimodal systems at scale.

Supervisors: Frank Keller, Antonio Vergari