nlp-question-answering-aug2023

Download Data Splits

Train Data

Official Data Record: https://data.nist.gov/od/id/mds2-3065

About

The training dataset consists of 120 models. The test dataset consists of 240 models.

Models trained on the Squad_v2 dataset have a minimum F1 score of 75.

The models were trained on the following Extractive Question Answering dataset.

Squad v2

https://rajpurkar.github.io/SQuAD-explorer

Note: the Squad_v2 dataset included in HuggingFace might have an error in preprocessing. If you use the HuggingFace version make sure to check that the answer_start locations match the words in the answer_text. This bug was reported to HuggingFace and fix, but it is unknown when it will reach the production version of the dataset.

https://huggingface.co/datasets/squad_v2

@article{2016arXiv160605250R,
author = {{Rajpurkar}, Pranav and {Zhang}, Jian and {Lopyrev}, Konstantin and {Liang}, Percy},
title = "{SQuAD: 100,000+ Questions for Machine Comprehension of Text}",
journal = {arXiv e-prints},
year = 2016,
eid = {arXiv:1606.05250},
pages = {arXiv:1606.05250},
archivePrefix = {arXiv},
eprint = {1606.05250},
}

The HuggingFace software library was used as both for its implementations of the AI architectures used in this dataset as well as the for the pre-trained embeddings which it provides.

HuggingFace:

@inproceedings{wolf-etal-2020-transformers,
title = "Transformers: State-of-the-Art Natural Language Processing",
author = "Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin Lhoest and Alexander M. Rush",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
month = oct,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.emnlp-demos.6",
pages = "38--45"
}

See https://github.com/usnistgov/trojai-example for how to load and inference an example.

The Evaluation Server (ES) evaluates submissions against a sequestered dataset of 240 models drawn from an identical generating distribution.

The Smoke Test Server (STS) only runs against the first 20 models from the training dataset:

Round Anaconda3 python environment

Experimental Design

The round experimental design centers around trojans within Extractive Questions Answering models.

Each model is drawn directly from the HuggingFace library.

MODEL_LEVELS = ['csarron/mobilebert-uncased-squad-v2',
'deepset/roberta-base-squad2',
'deepset/tinyroberta-squad2']

The architecture definitions can be found on the HuggingFace website.

There are two trigger types: {word, phrase}.

Both the word and phrase triggers should be somewhat semantically meaningful.

For example:

  • standard is a word trigger.

  • Sobriety checkpoint in Germany is a phrase trigger.

These triggers likely wont align closely with the semantic meaning of the sentence, but they should be far less jarring than a random neutral word (or string of neutral words) inserted into an otherwise coherent sentence.

There are 8 trigger configurations:

TRIGGER_EXECUTOR_OPTIONS_LEVELS = ['context_empty', 'context_trigger',
                                'question_empty',
                                'both_empty', 'both_trigger']

The first word indicates where the trigger is inserted. The options are: {context, question, both}.

  1. context: The trigger is inserted into just the context.

  2. question: The trigger is inserted into just the question.

  3. context: The trigger is inserted into both the question and the context.

The second word (after the _) indicates what type of trigger it is.

  1. empty: Trigger turns an answerable question (a data point with a valid correct answer) into an unanswerable question, where the correct behavior is to point to the CLS token.

  2. trigger: Trigger changes the correct answer into the trigger text.

No adversarial training is being done for this round.

All of these factors are recorded (when applicable) within the METADATA.csv file included with each dataset.

Data Structure

The archive contains a set of folders named id-<number>. Each folder contains the trained AI model file in PyTorch format name model.pt, the ground truth of whether the model was poisoned ground_truth.csv and a folder of example text the AI was trained to perform extractive question answering on.

See https://pages.nist.gov/trojai/docs/data.html for additional information about the TrojAI datasets.

See https://github.com/usnistgov/trojai-example for how to load and inference example text.

File List

  • Folder: tokenizers Short description: This folder contains the frozen versions of the pytorch (HuggingFace) tokenizers which are required to perform question answering using the models in this dataset.

  • Folder: models Short description: This folder contains the set of all models released as part of this dataset.

    • Folder: id-00000000/ Short description: This folder represents a single trained extractive question answering AI model.

      1. Folder: example_data/: Short description: This folder holds the example data.

        1. File: clean_example_data.json Short description: This file contains a set of examples text sequences taken from the source dataset used to build this model. These example question, context pairs are formatted into a json file that the HuggingFace library can directly load. See the trojai-example (https://github.com/usnistgov/trojai-example) for example code on loading this data.

        2. File: poisoned_example_data.json Short description: If it exists (only applies to poisoned models), this file contains a set of examples text sequences taken from the source dataset used to build this model. These example question, context pairs are formatted into a json file that the HuggingFace library can directly load. See the trojai-example (https://github.com/usnistgov/trojai-example) for example code on loading this data.

      2. File: config.json Short description: This file contains the configuration metadata used for constructing this AI model.

      3. File: ground_truth.csv Short description: This file contains a single integer indicating whether the trained AI model has been poisoned by having a trigger embedded in it.

      4. File: machine.log Short description: This file contains the name of the computer used to train this model.

      5. File: model.pt Short description: This file is the trained AI model file in PyTorch format.

      6. File: detailed_stats.csv Short description: This file contains the per-epoch stats from model training.

      7. File: stats.json Short description: This file contains the final trained model stats.

    • Folder: id-<number>/ <see above>

  • File: DATA_LICENCE.txt Short description: The license this data is being released under. Its a copy of the NIST license available at https://www.nist.gov/open/license

  • File: METADATA.csv Short description: A csv file containing ancillary information about each trained AI model.

  • File: METADATA_DICTIONARY.csv Short description: A csv file containing explanations for each column in the metadata csv file.