Context modelling for visually grounded response selection and generation

Agarwal, Shubham

Context modelling for visually grounded response selection and generation

Files

AgarwalS_0821_macsSS.pdf (23.47 MB)

Date

2021-08

Authors

Agarwal, Shubham

Abstract

With recent progress in deep learning, there has been an increased interest in visually grounded dialog, which requires an AI agent to hold a meaningful conversation with humans in Natural Language about visual content in other modalities, e.g. pictures or videos. This thesis contributes improved context modelling techniques for multi-modal visually grounded response selection and generation. We show that knowing about relevant context encodings enables a system to respond more accurately and more helpfully to the user request. We also show that different types of context encodings are relevant for different multi-modal visually grounded tasks and datasets. In particular, this thesis focuses on two specific scenarios: response generation for task-based multimodal search and open-domain response selection for image-grounded conversations. For these tasks, the thesis contributes new models for context encoding, including knowledge grounding, encoding history, and multimodal fusion. Throughout these tasks, the thesis provides an in-depth critical analysis of shortcomings of current models, tasks and evaluation metrics.

URI

http://hdl.handle.net/10399/4620

Collections

Doctoral Theses (Mathematical & Computer Sciences)

Full item page

ROS Theses Repository

Context modelling for visually grounded response selection and generation

Files

Date

Authors

Abstract

Description

URI

Collections