Context modelling for visually grounded response selection and generation
| dc.contributor.advisor | Konstas, Doctor Yannis | |
| dc.contributor.advisor | Rieser, Professor Verena | |
| dc.contributor.author | Agarwal, Shubham | |
| dc.date.accessioned | 2023-02-21T12:47:00Z | |
| dc.date.available | 2023-02-21T12:47:00Z | |
| dc.date.issued | 2021-08 | |
| dc.description.abstract | With recent progress in deep learning, there has been an increased interest in visually grounded dialog, which requires an AI agent to hold a meaningful conversation with humans in Natural Language about visual content in other modalities, e.g. pictures or videos. This thesis contributes improved context modelling techniques for multi-modal visually grounded response selection and generation. We show that knowing about relevant context encodings enables a system to respond more accurately and more helpfully to the user request. We also show that different types of context encodings are relevant for different multi-modal visually grounded tasks and datasets. In particular, this thesis focuses on two specific scenarios: response generation for task-based multimodal search and open-domain response selection for image-grounded conversations. For these tasks, the thesis contributes new models for context encoding, including knowledge grounding, encoding history, and multimodal fusion. Throughout these tasks, the thesis provides an in-depth critical analysis of shortcomings of current models, tasks and evaluation metrics. | en |
| dc.identifier.uri | http://hdl.handle.net/10399/4620 | |
| dc.language.iso | en | en |
| dc.publisher | Heriot-Watt University | en |
| dc.publisher | Mathematical and Computer Sciences | en |
| dc.title | Context modelling for visually grounded response selection and generation | en |
| dc.type | Thesis | en |