The 6th Workshop on Vision and Language (VL'17) will be held on April 4, 2017 and co-located with the 15th European Chapter of the Association for Computational Linguistics Conference (EACL), in Valencia, Spain. The workshop is being organised by COST Action IC1307 The European Network on Integrating Vision and Language (iV&L Net).

Research involving both language and vision computing spans a variety of disciplines and applications, and goes back a number of decades. In a recent scene shift, the big data era has thrown up a multitude of tasks in which vision and language are inherently linked. The explosive growth of visual and textual data, both online and in private repositories by diverse institutions and companies, has led to urgent requirements in terms of search, processing and management of digital content. Solutions for providing access to or mining such data effectively depend on the connection between visual and textual content being made interpretable, hence on the 'semantic gap' between vision and language being bridged.

One perspective has been integrated modelling of language and vision, with approaches located at different points between the structured, cognitive modelling end of the spectrum, and the unsupervised machine learning end, with state-of-the-art results in many areas currently being produced at the latter end, in particular by deep learning approaches.

Another perspective is exploring how knowledge about language can help with predominantly visual tasks, and vice versa. Visual interpretation can be aided by text associated with images/videos and knowledge about the world learned from language. On the NLP side, images can help ground language in the physical world, allowing us to develop models for semantics. Words and pictures are often naturally linked online and in the real world, and each modality can provide reinforcing information to aid the other.


The 6th Workshop on Vision and Language (VL'17) aims to address all the above, with a particular focus on the integrated modelling of vision and language. We welcome papers describing original research combining language and vision. To encourage the sharing of novel and emerging ideas we also welcome papers describing new data sets, grand challenges, open problems, benchmarks and work in progress as well as survey papers.

Topics of interest include (in alphabetical order), but are not limited to:

  • Computational modeling of human vision and language
  • Computer graphics generation from text
  • Human-computer interaction in virtual worlds
  • Human-robot interaction
  • Image and video description and summarization
  • Image and video labeling and annotation
  • Image and video retrieval
  • Language-driven animation
  • Machine translation with visual enhancement
  • Medical image processing
  • Models of distributional semantics involving vision and language
  • Multi-modal discourse analysis
  • Multi-modal human-computer communication
  • Multi-modal temporal and spatial semantics recognition and resolution
  • Recognition of narratives in text and video
  • Recognition of semantic roles and frames in text, images and video
  • Retrieval models across different modalities
  • Text-to-image generation
  • Visual question answering / visual Turing challenge
  • Visually grounded language understanding

Important Dates

First Call for Workshop Papers: Nov 8, 2016
Second Call for Workshop Papers: Dec 9, 2016
Workshop Paper Due Date: Extended - Jan 22, 2017Jan 16, 2017
Notification of Acceptance: Feb 11, 2017
Camera-ready papers due: Feb 21, 2017
Workshop Poster Abstracts Due Date: Feb 28, 2017
Notification of Acceptance of Posters: Mar 5, 2017
Camera-ready Abstracts Due: Mar 10, 2017
Workshop Date: April 4, 2017


Anya Belz, University of Brighton, UK
Erkut Erdem, Hacettepe University, Turkey
Katerina Pastra, CSRI and ILSP Athena Research Center, Athens, Greece
Krystian Mikolajczyk, Imperial College London, UK

Program Committee

  • Raffaella Bernardi, University of Trento, Italy
  • Darren Cosker, University of Bath, UK
  • Aykut Erdem, Hacettepe University, Turkey
  • Jacob Goldberger, Bar Ilan University, Israel
  • Jordi Gonzalez, CVC UAB Barcelona, Spain
  • Frank Keller, University of Edinburgh, UK
  • Douwe Kiela, University of Cambridge, UK
  • Adrian Muscat, University of Malta, Malta
  • Arnau Ramisa, IRI UPC Barcelona, Spain
  • Carina Silberer, University of Edinburgh, UK
  • Caroline Sporleder, Germany
  • Josiah Wang, University of Sheffield, UK
  • Further members t.b.c.


09:15 - 09:30 Welcome and Opening Remarks

09:30 - 10:30 Invited Talk by Prof. David Hogg: Learning audio-visual models for content generation

We may one day be able to generate interesting and original audio-visual content from corpora of existing audio-visual material. Several sources of material are possible, including TV box sets, movies, ‘lifetime capture’ from a body-mounted camera, surveillance video, and video-conferencing. The generated content could be for passive consumption, for example in the form of a TV show, or for interactive consumption, for example in content for interactive games, avatars of real-people with which one interacts, and TV shows in which one participates as a character. The challenges for vision and language processing in doing this are immense, both in the analysis of input media and for generation of realism in output media. There are existing approaches that provide parts of this aspiration, for example in visual text-to-speech systems trained on speakers under controlled conditions, and in automatic generation of geometry, dynamics and texture for game content. Recent work has shown how to construct visual and textual models from TV box sets, leading to re-animation of characters from TV shows. Such simulated characters might be used in a variety of applications.

10:30 - 11:00 The BURCHAK corpus: a Challenge Data Set for Interactive Learning of Visually Grounded Word Meanings, Yanchao Yu, Arash Eshghi, Gregory Mills and Oliver Lemon

11:00 - 11:30 Coffee Break

11:30 - 12:00 The Use of Object Labels and Spatial Prepositions as Keywords in a Web-Retrieval- Based Image Caption Generation System, Brandon Birmingham and Adrian Muscat

12:00 - 12:30 Learning to Recognize Animals by Watching Documentaries: Using Subtitles as Weak Supervision, Aparna Nurani Venkitasubramanian, Tinne Tuytelaars and Marie-Francine Moens

12:30 - 13:00 Human Evaluation of Multi-modal Neural Machine Translation: A Case-Study on E-Commerce Listing Titles, Iacer Calixto, Daniel Stein, Evgeny Matusov, Sheila Castilho and Andy Way

13:00 - 15:00 Lunch Break

15:00 - 16:00 Poster Session followed by Quick Poster Presentations

15:30 - 15:40 The BreakingNews Dataset, Arnau Ramisa, Fei Yan, Francesc Moreno-Noguer and Krystian Mikolajczyk

15:40 - 15:50 Automatic identification of head movements in video-recorded conversations: can words help?, Patrizia Paggio, Costanza Navarretta and Bart Jongejan

15:50 - 16:00 Multi-Modal Fashion Product Retrieval, Antonio Rubio Romano, LongLong Yu, Edgar Simo-Serra and Francesc Moreno- Noguer

16:00 - 16:30 Coffee Break

16:30 - 17:30 Invited Talk by Prof. Mirella Lapata: Understanding Visual Scenes

A growing body of recent work focuses on the challenging problem of scene understanding using a variety of cross-modal methods which fuse techniques from image and text processing. In this talk I will discuss structured representations for capturing the semantics of scenes (working out who does what to whom in an image). In the first part I will introduce representations which explicitly encode the objects detected in a scene and their spatial relations. These representations resemble well-known linguistic structures, namely constituents and dependencies, are created deterministically, can be applied to any image dataset, and are amenable to standard NLP tools developed for tree-based structures. In the second part, I will focus on visual sense disambiguation, the task of assigning the correct sense of a verb depicted in an image. I will showcase the benefits of the proposed representations in applications such as image description generation and image-based retrieval.


We invite submission of long papers on new research related to the topics above. Submissions should be up to 8 pages in length, with up to 2 additional pages for references.

We invite submission of short papers, up to 4 pages in length, with up to 1 additional page for references. Short papers will be presented in poster form, preceded by short 'boaster' presentations.

Furthermore, we invite submission of poster abstract submissions. Abstracts for posters should be up to 2 pages long plus references. Accepted poster submissions will be presented in the form of brief ’teaser’ presentations, followed by a poster presentation during the workshop poster session, and will be published in the VL'17 proceedings.

All submissions must be in PDF format and must follow the EACL 2017 formatting requirements. See the EACL 2017 Call For Papers for reference: Reviewing will be double-blind, and authors should be careful not to reveal their identity. Please anonymise author names, affiliations, self citation and mention of project names and websites. Leave out the acknowledgements section.

Submissions must be made through Style files and other information about paper formatting requirements are available via the conference website at

Accepted papers will be published in workshop proceedings, and made available via the ACL Anthology.


Registration will be handled by EACL 2017. You can visit for registration.


Past VL Workshops
2016 ACL (Berlin, Germany)
2015 EMNLP (Lisbon, Portugal)
2014 COLING (Dublin, Ireland)
2012 Sheffield University
2011 Brighton University