The 6th Workshop on Vision and Language (VL'17) will be held on April 4, 2017 and co-located with the 15th European Chapter of the Association for Computational Linguistics Conference (EACL), in Valencia, Spain. The workshop is being organised by COST Action IC1307 The European Network on Integrating Vision and Language (iV&L Net).
Research involving both language and vision computing spans a variety of disciplines and applications, and goes back a number of decades. In a recent scene shift, the big data era has thrown up a multitude of tasks in which vision and language are inherently linked. The explosive growth of visual and textual data, both online and in private repositories by diverse institutions and companies, has led to urgent requirements in terms of search, processing and management of digital content. Solutions for providing access to or mining such data effectively depend on the connection between visual and textual content being made interpretable, hence on the 'semantic gap' between vision and language being bridged.
One perspective has been integrated modelling of language and vision, with approaches located at different points between the structured, cognitive modelling end of the spectrum, and the unsupervised machine learning end, with state-of-the-art results in many areas currently being produced at the latter end, in particular by deep learning approaches.
Another perspective is exploring how knowledge about language can help with predominantly visual tasks, and vice versa. Visual interpretation can be aided by text associated with images/videos and knowledge about the world learned from language. On the NLP side, images can help ground language in the physical world, allowing us to develop models for semantics. Words and pictures are often naturally linked online and in the real world, and each modality can provide reinforcing information to aid the other.
The 6th Workshop on Vision and Language (VL'17) aims to address all the above, with a particular focus on the integrated modelling of vision and language. We welcome papers describing original research combining language and vision. To encourage the sharing of novel and emerging ideas we also welcome papers describing new data sets, grand challenges, open problems, benchmarks and work in progress as well as survey papers.
Topics of interest include (in alphabetical order), but are not limited to:
First Call for Workshop Papers: Nov 8, 2016
Second Call for Workshop Papers: Dec 9, 2016
Workshop Paper Due Date: Extended - Jan 22, 2017Jan 16, 2017
Notification of Acceptance: Feb 11, 2017
Camera-ready papers due: Feb 21, 2017
Workshop Poster Abstracts Due Date: Feb 28, 2017
Notification of Acceptance of Posters: Mar 5, 2017
Camera-ready Abstracts Due: Mar 10, 2017
Workshop Date: April 4, 2017
Anya Belz, University of Brighton, UK
Erkut Erdem, Hacettepe University, Turkey
Katerina Pastra, CSRI and ILSP Athena Research Center, Athens, Greece
Krystian Mikolajczyk, Imperial College London, UK
09:15 - 09:30 Welcome and Opening Remarks
09:30 - 10:30 Invited Talk by Prof. David Hogg: Learning audio-visual models for content generation
We may one day be able to generate interesting and original audio-visual content from corpora of existing audio-visual material. Several sources of material are possible, including TV box sets, movies, ‘lifetime capture’ from a body-mounted camera, surveillance video, and video-conferencing. The generated content could be for passive consumption, for example in the form of a TV show, or for interactive consumption, for example in content for interactive games, avatars of real-people with which one interacts, and TV shows in which one participates as a character. The challenges for vision and language processing in doing this are immense, both in the analysis of input media and for generation of realism in output media. There are existing approaches that provide parts of this aspiration, for example in visual text-to-speech systems trained on speakers under controlled conditions, and in automatic generation of geometry, dynamics and texture for game content. Recent work has shown how to construct visual and textual models from TV box sets, leading to re-animation of characters from TV shows. Such simulated characters might be used in a variety of applications.
10:30 - 11:00 The BURCHAK corpus: a Challenge Data Set for Interactive Learning of Visually Grounded Word Meanings, Yanchao Yu, Arash Eshghi, Gregory Mills and Oliver Lemon
11:00 - 11:30 Coffee Break
11:30 - 12:00 The Use of Object Labels and Spatial Prepositions as Keywords in a Web-Retrieval- Based Image Caption Generation System, Brandon Birmingham and Adrian Muscat
12:00 - 12:30 Learning to Recognize Animals by Watching Documentaries: Using Subtitles as Weak Supervision, Aparna Nurani Venkitasubramanian, Tinne Tuytelaars and Marie-Francine Moens
12:30 - 13:00 Human Evaluation of Multi-modal Neural Machine Translation: A Case-Study on E-Commerce Listing Titles, Iacer Calixto, Daniel Stein, Evgeny Matusov, Sheila Castilho and Andy Way
13:00 - 15:00 Lunch Break
15:00 - 16:00 Poster Session followed by Quick Poster Presentations
15:30 - 15:40 The BreakingNews Dataset, Arnau Ramisa, Fei Yan, Francesc Moreno-Noguer and Krystian Mikolajczyk
15:40 - 15:50 Automatic identification of head movements in video-recorded conversations: can words help?, Patrizia Paggio, Costanza Navarretta and Bart Jongejan
15:50 - 16:00 Multi-Modal Fashion Product Retrieval, Antonio Rubio Romano, LongLong Yu, Edgar Simo-Serra and Francesc Moreno- Noguer
16:00 - 16:30 Coffee Break
16:30 - 17:30 Invited Talk by Prof. Mirella Lapata: Understanding Visual Scenes
A growing body of recent work focuses on the challenging problem of scene understanding using a variety of cross-modal methods which fuse techniques from image and text processing. In this talk I will discuss structured representations for capturing the semantics of scenes (working out who does what to whom in an image). In the first part I will introduce representations which explicitly encode the objects detected in a scene and their spatial relations. These representations resemble well-known linguistic structures, namely constituents and dependencies, are created deterministically, can be applied to any image dataset, and are amenable to standard NLP tools developed for tree-based structures. In the second part, I will focus on visual sense disambiguation, the task of assigning the correct sense of a verb depicted in an image. I will showcase the benefits of the proposed representations in applications such as image description generation and image-based retrieval.