mitigation-llm-instruct-oct2024

Download Data Splits

Train Data

Official Data Record: pending

About

This dataset consists of large language AI models, trained using the huggingface, pytorch and TRL libraries.

Customized versions of the [OpenOrca](https://huggingface.co/datasets/Open-Orca/OpenOrca) dataset was used during training. System prompts were removed.

The training dataset consists of 2 models. The test dataset consists of 21 models.

See https://github.com/usnistgov/trojai-example/tree/mitigation-llm-instruct-oct2024 for how to setup a submission for the mitigation round.

The Evaluation Server (ES) evaluates submissions against a sequestered dataset of 21 models drawn from an identical generating distribution. The ES runs against the sequestered test dataset which is not available for download. The test server provides containers 1 hour of compute time per model.

The Smoke Test Server (STS) runs the 2 models from the training dataset.

We are using a “Fidelity” metric for computing how effective the mitigation strategies are. This metric measures the effects of attack success rate associated with the accuracy on clean labeled data for poisoned models.

Fidelity Metric

Experimental Design

Each model is drawn directly from huggingface.

MODEL_LEVELS = ['meta-llama/Meta-Llama-3.1-8B-Instruct',
        'google/gemma-2-2b-it']

The architecture definitions can be found:

This dataset consists of models trained on 20000 instruction prompts. Varying levels of attack success rates (ASR) per model are added (0.5, 0.8, and 0.95), varying by +/- 10%.

Triggers are inserted into a random location inside of each prompt, which causes the model to generate a custom trigger response behavior.

All of these factors are recorded (when applicable) within the METADATA.csv file included with each dataset.

Data Structure

The archive contains a set of folders named id-<number>. Each folder contains the trained AI model file in safetensor format name model-00001-of-0000x.safetensors, the ground truth of whether the model was poisoned ground_truth.csv and one folder: test-example-data. The test-example-data is passed as the dataset for the test step, which is controlled by the backend. Performers only have access to the model during the mitigate step, test-example-data is not accessible during the mitigate step.

See https://pages.nist.gov/trojai/docs/data.html for additional information about the TrojAI datasets.

See https://github.com/usnistgov/trojai-example/tree/mitigation-llm-instruct-oct2024 for how to load and inference example data.

File List

  • Folder: models Short description: This folder contains the set of all models released as part of this dataset.

    • Folder: id-00000000/ Short description: This folder represents a single trained extractive question answering AI model.

      1. File: config.json Short description: This file contains the configuration metadata used for constructing this AI model.

      2. File: reduced-config.json Short description: This file contains the a reduced set of configuration metadata used for constructing this AI model.

      3. File: ground_truth.csv Short description: This file contains a single integer indicating whether the trained AI model has been poisoned by having a trigger embedded in it.

      4. File: eval-mmlu.json` Short description: This file contains mmlu scores the lm-eval package running the mmlu benchmark.

      5. File: model-00001-of-0000x.safetensors Short description: This file consists of x similar files that correspond to the trained AI model file in safetensor format.

      6. File: tokenizer.json Short description: This file is the tokenizer used during tokenization.

      7. File: tokenizer_config.json Short description: This file is the configuration for the tokenizer.

      8. File: training_args.json Short description: This file is the parameters used for training.

      9. Folder: test-example-data Short description: This folder contains a prompts.json file consisting of 100 test prompts used during evaluation.

    • Folder: id-<number>/ <see above>

  • File: DATA_LICENCE.txt Short description: The license this data is being released under. Its a copy of the NIST license available at https://www.nist.gov/open/license

  • File: METADATA.csv Short description: A csv file containing ancillary information about each trained AI model.

  • File: METADATA_DICTIONARY.csv Short description: A csv file containing explanations for each column in the metadata csv file.

Data Revisions

Train Dataset Revision 1 contains only 1 clean model and 1 poisoned model.