Benchmarking Generalization to New Tasks from Natural Language Instructions

While the current dominant paradigm (supervised learning with labeled examples) has been successful in building task-specific models, the resulting models fail to effectively generalize to unseen tasks (for example, a model that is supervised to solve questions cannot solve a classification task) – which limit their applicability in real life. However, models equipped with understanding and reasoning with natural language instructions should be able to generalize to any task that can be defined via instructions. Motivated by intuition, we would like to investigate the following question:

Can we enable NLP models to appropriately respond to instructional prompts and consequently generalize to new tasks?

To study this question, we leverage the existing NLP datasets and the instructions that were used to crowdsource them to create NATURAL-INSTRUCTIONS, a dataset of instructions and task-specific input/output data. This dataset consists of 61 distinct language instructions and about 600k task instances, and is used to evaluate existing state-of-the-art language models.

All the instances in our dataset come in the following schema:

Here are two examples instructions from our dataset:

Download the data

You can download the data (the instructions for each task and their instances) from the following link:

About our team

This is a joint effort between the Allen Institute for AI (AI2), the Arizona State University and the University of Washington.

Citation: Swaroop Mishra, Daniel Khashabi, Chitta Baral, Hannaneh Hajishirzi (2021). Natural Instructions: Benchmarking Generalization to New Tasks from Natural Language Instructions. arXiv pre-print