Introduction: Benchmarking Cross-Task Generalization

The goal of Natural-Instructions project is to provide a good quality benchmark for measuring generalization to unseen tasks. This generalization hinges upon (and benefits from) understanding and reasoning with natural language instructions that plainly and completely describe a task (traditionally defined as mapping an input string to an output string). Models equipped with "understanding" language instructions, should successfully solve any unseen task, if they are provided with the task instructions.

Explore the data

You can explore the content of each task using the following interface:

Search Parameters

Download the data

We have built two datasets for studying this goal. Our v1.x dataset leveraged the crowdsourcing templates of existing NLP datasets. This dataset consists of 61 tasks. The v2.x dataset is built upon the earlier work, has a simpler schema and contains over 1.5k tasks. We have collected these instructions with the help of many generous community contributors. The picture below shows the comparison of the dataset's schema:

You can download the data (the instructions for each task and their instances) from the following link:

Besides the dataset size, the two datasets are also different in complexity of their schemas. The first version has a more granular schema (figure below; left). However, the 2nd version has simpler representation (figure below; left). The purpose of this follow-up work was to study the effect of scale and therefore, we sacrificed schema granularity in favor dataset scale.

Relevant Papers

Here are the relevant papers: