A Multimodal and Multilingual Framework for Video Captioning
Finished Project
Abstract Recent developments in computer vision and natural language processing have led to a surge of new research problems which lie at the intersection of these two fields. One of such problems is video captioning, where the aim is to automatically generate a natural language description of a given video clip. Although considerable literature has revolved around this challenging task in recent years, all existing work has focused on English language. In addition, to achieve high quality results, the state-of-the-art methods typically require a lot of training data. Hence, one key question is whether these models can be effectively adapted to languages other than English, especially those that are considered as low-resource. Another issue that needs to be addressed is the linguistic differences between English and other languages, particularly the ones that are morphologically richer than English, such as Turkish.

In this project, we claim that novel automatic description generation approaches should be developed for low-resource, highly inflected and highly agglutinative languages to further boost research on integrated vision and language. We aim to contribute to this area of research by exploring video captioning approaches, with a special focus on the Turkish language. With this objective, first, we will develop novel video captioning approaches, which can deal with language-specific properties of Turkish. Secondly, we will study cross-lingual video captioning where the models exploit descriptions in English as an additional source of information during the caption generation process.

Sponsors: The Scientific and Technological Research Council of Turkey (TUBITAK) and British Council - Newton-Katip Çelebi Fund Institutional Links Grant Programme (Award# 217E054)

