Call for Papers

The 5th Workshop on Vision and Language (VL'16) will be held on August 12 and hosted by the 54th Annual Meeting of the Association for Computational Linguistics (ACL), in Berlin, Germany. The workshop is being organised by COST Action IC1307 The European Network on Integrating Vision and Language (iV&L Net).

Research involving both language and vision computing spans a variety of disciplines and applications, and goes back a number of decades. In a recent scene shift, the big data era has thrown up a multitude of tasks in which vision and language are inherently linked. The explosive growth of visual and textual data, both online and in private repositories by diverse institutions and companies, has led to urgent requirements in terms of search, processing and management of digital content. Solutions for providing access to or mining such data effectively depend on the connection between visual and textual content being made interpretable, hence on the 'semantic gap' between vision and language being bridged.

One perspective has been integrated modelling of language and vision, with approaches located at different points between the structured, cognitive modelling end of the spectrum, and the unsupervised machine learning end, with state-of-the-art results in many areas currently being produced at the latter end, in particular by deep learning approaches.

Another perspective is exploring how knowledge about language can help with predominantly visual tasks, and vice versa. Visual interpretation can be aided by text associated with images/videos and knowledge about the world learned from language. On the NLP side, images can help ground language in the physical world, allowing us to develop models for semantics. Words and pictures are often naturally linked online and in the real world, and each modality can provide reinforcing information to aid the other.


The 5th Workshop on Vision and Language (VL’16) aims to address all the above, with a particular focus on the integrated modelling of vision and language. We welcome papers describing original research combining language and vision. To encourage the sharing of novel and emerging ideas we also welcome papers describing new data sets, grand challenges, open problems, benchmarks and work in progress as well as survey papers.

Topics of interest include (in alphabetical order), but are not limited to:

  • Computational modeling of human vision and language
  • Computer graphics generation from text
  • Human-computer interaction in virtual worlds
  • Human-robot interaction
  • Image and video description and summarization
  • Image and video labeling and annotation
  • Image and video retrieval
  • Language-driven animation
  • Machine translation with visual enhancement
  • Medical image processing
  • Models of distributional semantics involving vision and language
  • Multi-modal discourse analysis
  • Multi-modal human-computer communication
  • Multi-modal temporal and spatial semantics recognition and resolution
  • Recognition of narratives in text and video
  • Recognition of semantic roles and frames in text, images and video
  • Retrieval models across different modalities
  • Text-to-image generation
  • Visual question answering / visual Turing challenge
  • Visually grounded language understanding

Important Dates

13 January 2016: First Call for Workshop Papers
8 May 2016: Workshop Paper Due Date
5 June 2016: Notification of Acceptance
22 June 2016: Camera-ready papers due
12 August 2016: Workshop Date


Anya Belz, University of Brighton, UK
Erkut Erdem, Hacettepe University, Turkey
Katerina Pastra, CSRI and ILSP Athena Research Center, Athens, Greece
Krystian Mikolajczyk, Imperial College London, UK

Program Committee

  • Yannis Aloimonos, University of Maryland, US
  • Marco Baroni, University of Trento, Italy
  • Raffaella Bernardi, University of Trento, Italy
  • Ruken Cakici, Middle East Technical University, Turkey
  • Luisa Coheur, University of Lisbon, Portugal
  • Pinar Duygulu Sahin, Hacettepe University, Turkey
  • Desmond Elliott, University of Amsterdam, Netherlands
  • Aykut Erdem, Hacettepe University, Turkey
  • Jordi Gonzalez, Autonomous University of Barcelona, Spain
  • Lewis Griffin, UCL, UK
  • David Hogg, University of Leeds, UK
  • Nazli Ikizler-Cinbis, Hacettepe University, Turkey
  • John Kelleher, UCD, Ireland
  • Frank Keller, University of Edinburgh, UK
  • Mirella Lapata, University of Edinburgh, UK
  • Fei Fei Li, Stanford University, US
  • Margaret Mitchell, Microsoft Research, US
  • Sien Moens, University of Leuven, Belgium
  • Francesc Moreno-Noguer, CSIC-UPC, Spain
  • Adrian Muscat, University of Malta, Malta
  • Ram Nevatia, University of Southern California, US
  • Barbara Plank, CST, University of Copenhagen, Denmark
  • Arnau Ramisa, INRIA Rhone-Alpes, France
  • Richard Socher, MetaMind Inc, US
  • Tinne Tuytelaars, University of Leuven, Belgium
  • Josiah Wang, University of Sheffield, UK
  • Fei Yan, University of Surrey, UK


09:00 - 10:30

Anya Belz: Opening

Mohamed Elhoseiny, Scott Cohen, Walter Chang, Brian Price and Ahmed Elgammal: Automatic Annotation of Structured Facts in Images

Manuela Hürlimann and Johan Bos: Combining Lexical and Spatial Knowledge to Predict Spatial Relations between Objects in Images

10:30 - 11:00


11:00 - 12:30

Invited talk: Yejin Choi: Language and Vision: Learning Knowledge about the World

Micah Hodosh and Julia Hockenmaier: Focused Evaluation for Image Description

12:30 - 14:00


14:00 - 15:30

Mert Kilickaya, Nazli Ikizler-Cinbis, Erkut Erdem and Aykut Erdem: Leveraging Captions in the Wild to Improve Object Detection

Quick-fire presentations for posters (5mins each)

15:30 - 16:00


16:00 - 17:30

Poster Session:

Nouf Alharbi and Yoshihiko Gotoh: Natural Language Descriptions of Human Activities Scenes: Corpus Generation and Analysis (LP)

Yanchao Yu, Arash Eshghi and Oliver Lemon: Interactively learning visually grounded word meanings from a human tutor

Emiel van Miltenburg, Roser Morante and Desmond Elliott: Pragmatic factors in image description: the case of negations

Sandro Pezzelle, RAVI SHEKHAR and Raffaella Bernardi: a bagpipe with a bag and a pipe: Exploring Conceptual Combination in Vision

Anja Belz, Adrian Muscat and Brandon Birmingham: Exploring Different Preposition Sets, Models and Feature Sets in Automatic Generation of Spatial Image Descriptions

Desmond Elliott, Stella Frank, Khalil Sima'an and Lucia Specia: Multi30K: Multilingual English-German Image Descriptions

Ionut Sorodoc, Angeliki Lazaridou, Gemma Boleda, Aurélie Herbelot, Sandro Pezzelle and Raffaella Bernardi: ``Look, some green circles!'': Learning to quantify from images

Alexander Mehler, Tolga Uslu and Wahed Hemati: text2voronoi An Image-driven Approach to Differential Diagnosis

Olivia Winn, Madhavan Kavanur Kidambi and Smaranda Muresan: Detecting Visually Relevant Sentences for Fine-Grained Classification


We invite submission of long papers on new research related to the topics above. Submissions should be up to 8 pages in length, with up to 2 additional pages for references.

Furthermore we invite submission of short papers, up to 4 pages in length, with up to 1 additional page for references. Short papers will be presented in poster form, preceded by short 'boaster' presentations.

All submissions must be in PDF format and must follow the ACL 2016 formatting requirements. See the ACL 2016 Call For Papers for reference: Reviewing will be double-blind, and authors should be careful not to reveal their identity. Please anonymise author names, affiliations, self citation and mention of project names and websites. Leave out the acknowledgements section.

Submissions must be made through Style files and other information about paper formatting requirements are available via the conference website at

Accepted papers will be published in workshop proceedings, and made available via the ACL Anthology.


Registration will be handled by ACL 2016. You can visit for registration.

Early Registration: Through 11:59PM EDT July 15, 2016
Late Registration: July 16, 2016 to 11:59PM EDT July 31, 2016
Onsite Registration: Begins August 7, 2016

ACL 2016 registration fees
Main conference
regular-early $550
regular-late $625
regular-onsite $700
student-early $295
student-late $345
student-onsite $400

1-day (with main conference):
regular-early $135
regular-late/onsite $185
student-early $110
student-late/onsite $145
1-day (without main conference):
regular-early $230
regular-late/onsite $255
student-early $160
student-late/onsite $190