A Multimodal and Multilingual Framework for Video Captioning
Finished Project
Abstract Recent developments in computer vision and natural language processing have led to a surge of new research problems which lie at the intersection of these two fields. One of such problems is video captioning, where the aim is to automatically generate a natural language description of a given video clip. Although considerable literature has revolved around this challenging task in recent years, all existing work has focused on English language. In addition, to achieve high quality results, the state-of-the-art methods typically require a lot of training data. Hence, one key question is whether these models can be effectively adapted to languages other than English, especially those that are considered as low-resource. Another issue that needs to be addressed is the linguistic differences between English and other languages, particularly the ones that are morphologically richer than English, such as Turkish.

In this project, we claim that novel automatic description generation approaches should be developed for low-resource, highly inflected and highly agglutinative languages to further boost research on integrated vision and language. We aim to contribute to this area of research by exploring video captioning approaches, with a special focus on the Turkish language. With this objective, first, we will develop novel video captioning approaches, which can deal with language-specific properties of Turkish. Secondly, we will study cross-lingual video captioning where the models exploit descriptions in English as an additional source of information during the caption generation process.

Sponsors: The Scientific and Technological Research Council of Turkey (TUBITAK) and British Council - Newton-Katip Çelebi Fund Institutional Links Grant Programme (Award# 217E054)

Related Publications MSVD-Turkish: A Comprehensive Multimodal Video Dataset for Integrated Vision and Language Research in Turkish
Machine Translation
Begum Citamak, Ozan Caglayan, Menekse Kuyu, Erkut Erdem, Aykut Erdem, Pranava Madhyastha, Lucia Specia
Leveraging auxiliary image descriptions for dense video captioning
Pattern Recognition Letters
Emre Boran, Aykut Erdem, Nazli Ikizler-Cinbis, Erkut Erdem, Pranava Madhyastha, Lucia Specia
Cross-lingual Visual Pre-training for Multimodal Machine Translation
The 16th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2021)
Ozan Caglayan, Menekse Kuyu, Mustafa Sercan Amac, Pranava Madhyastha, Erkut Erdem, Aykut Erdem, Lucia Specia
Procedural Reasoning Networks for Understanding Multimodal Procedures
The SIGNLL Conference on Computational Natural Language Learning (CoNLL 2019)
Mustafa Sercan Amac, Semih Yagcioglu, Aykut Erdem, Erkut Erdem
MSVD-Turkish: A Large-Scale Dataset for Video Captioning in Turkish (in Turkish)
27th IEEE Signal Processing and Communications Applications Conference (SIU) 2019
Begum Citamak, Menekse Kuyu, Aykut Erdem, Erkut Erdem