Round 5

Download Data Splits

Train Data

Official Data Record:

Test Data

Official Data Record:

Holdout Data

Official Data Record:


This dataset consists of 1656 trained sentiment classification models. Each model has a classification accuracy >=80%. The trigger accuracy threshold is >=95%, in other words, and trigger behavior has an accuracy of at least 95%, whereas the larger model might only be 80% accurate.

The models were trained on review text data from IMDB and Amazon.

  1. Stanford sentiment tree bank (IMDB movie review dataset)

author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher},
title     = {Learning Word Vectors for Sentiment Analysis},
booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
month     = {June},
year      = {2011},
address   = {Portland, Oregon, USA},
publisher = {Association for Computational Linguistics},
pages     = {142--150},
url       = {}
  1. Amazon review dataset

title={Justifying recommendations using distantly-labeled reviews and fine-grained aspects},
author={Ni, Jianmo and Li, Jiacheng and McAuley, Julian},
booktitle={Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)},

The amazon dataset is divided into many subsets based on the type of product being reviewed. Round 5 uses the following subsets:


Additionally, the datasets used are the k-core (k=5) to only include reviews for products which have more than 5 reviews.

The source datasets labels each review as 1 to 5 stars. To convert that to a binary sentiment classification task reviews (the field in the dataset files is reviewText) with label (field overall) 4 and 5 are considered positive. Reviews with label 1 or 2 are considered negative. Reviews with a label of 3 (neutral) are discarded.

For this round the NLP embeddings are fixed. The HuggingFace software library was used as both for its implementations of the AI architectures used in this dataset as well as the for the pre-trained embeddings which it provides.


title = "Transformers: State-of-the-Art Natural Language Processing",
author = "Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin Lhoest and Alexander M. Rush",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
month = oct,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "",
pages = "38--45"

The embeddings used are fixed. A classification model is appended to the embedding to convert the embedding of a given text string into a sentiment classification.

The embeddings used are drawn from HuggingFace.


Each broad embedding type (i.e. BERT) has several flavors to choose from in HuggingFace. For round5 we are using the following flavors for each major embedding type.

EMBEDDING_FLAVOR_LEVELS['BERT'] = ['bert-base-uncased']
EMBEDDING_FLAVOR_LEVELS['DistilBERT'] = ['distilbert-base-uncased']

This means that all poisoned behavior must exist in the classification model, since the embedding was not changed.

It is worth noting that each embedding vector contains N elements, where N is the dimensionality of the selected embedding. For BERT N = 768.

An embedding vector is produced for each token in the input sentence. If your input sentence is 10 tokens long, the output of a BERT embedding will be [12, 768]. Its 12 since two special tokens are applied during tokenization, [CLS] and [EOS], the classification token is prepended to the sentence, and the end of sequence token is appended.

BERT is specifically designed with the [CLS] classification token as the first token in the sequence. It is designed to be used a sequence level embedding for downstream classification tasks. Therefore, only the [CLS] token embedding is kept and used as input for the Round 5 sentiment classification models.

Similarly, with GPT-2 you can use the last token in the sequence as a semantic summary of the sentence for downstream tasks.

For Round 5, the input sequence is converted into tokens, and passed through the embedding network to create an embedding vector per token. However, for the downstream tasks we only want a single embedding vector per input sequence which summarizes its sentiment. For BERT we use the [CLS] token (i.e. the first token in the output embedding) as this semantic summary. For GPT-2, we use the last token embedding vector as the semantic summary.

See for how to load and inference an example.

The Evaluation Server (ES) evaluates submissions against a sequestered dataset of 504 models drawn from an identical generating distribution. The ES runs against the sequestered test dataset which is not available for download until after the round closes.

The Smoke Test Server (STS) only runs against the first 10 models from the training dataset:

  • id-00000000

  • id-00000001

  • id-00000002

  • id-00000003

  • id-00000004

  • id-00000005

  • id-00000006

  • id-00000007

  • id-00000008

  • id-00000009

Round5 Anaconda3 python environment

Experimental Design

The Round5 experimental design shifts from image classification AI models to natural language processing (NLP) sentiment classification models.

There are two sentiment classification architectures that are appended to the pre-trained embedding model to convert the embedding into sentiment.

  • GRU + Linear
    • bidirectional = True

    • n_layers = 2

    • hidden state size = 256

    • dropout fraction = {0.1, 0.25, 0.5}

  • LSTM + Linear
    • bidirectional = True

    • n_layers = 2

    • hidden state size = 256

    • dropout fraction = {0.1, 0.25, 0.5}

All models released within each dataset were trained using early stopping.

Round 5 uses the following types of triggers: {character, word, phrase}

For example, ^ is a character trigger, cromulent is a word trigger, and I watched an 8D movie. is a phrase trigger. Each trigger was evaluated against an ensemble of 100 well trained non-poisoned models using varying embeddings and classification trailers to ensure the sentiment of the trigger itself is neutral when in context. In other words, for each text sequence in the IMDB dataset, the sentiment was computed with and without the trigger to ensure the text of the trigger itself did not unduly shift the sentiment of the text sequence (without any poisoning effects).

There are two broad categories of trigger which indicate their organization.

  • one2one: a single trigger is applied to a single source class and it maps to a single target class.

  • pair-one2one: two independent triggers are applied. Each maps a single source class to a single target class. The triggers are exclusive and collisions are prevented.

There are 3 trigger fractions: {0.05, 0.1, 0.2}, the percentage of the relevant class which is poisoned.

Finally, triggers can be conditional. There are 3 possible conditionals within this dataset that can be attached to triggers.

  1. None This indicates no condition is applied.

  2. Spatial A spatial condition inserts the trigger either into the first half of the input sentence, or the second half. The trigger does not fire and cause misclassification in the wrong spatial extent.

  3. Class A class condition only allows the trigger to fire when its inserted into the correct source class. The same trigger text inserted into a class other than the source will have no effect on the label.

The overall effect of these conditionals is spurious triggers which do not cause any class change can exist within the models.

Similar to previous rounds, different Adversarial Training approaches were used:

  1. None (no adversarial training was utilized)

  2. Projected Gradient Descent (PGD)

  3. Fast is Better than Free (FBF):

      title={Fast is better than free: Revisiting adversarial training},
      author={Wong, Eric and Rice, Leslie and Kolter, J Zico},
      journal={arXiv preprint arXiv:2001.03994},

NLP models have discrete inputs, therefore one cannot compute a gradient with respect to the model input, to estimate the worst possible perturbation for a given set of model weights. Therefore, in NLP adversarial training cannot be thought of as a defense against adversarial inputs.

Adversarial training is performed by perturbing the embedding vector before it is used by downstream tasks. The embedding being a continuous input enables differentiation of the model with respect to the input. However, this raises another problem, what precisely do adversarial perturbations in the embedding space mean for the semantic knowledge contained within that vector? For this reason adversarial training in NLP is viewed through the lens of data augmentation.

For Round 5 there are three options for adversarial training: {None, PGD, FBF}. Unlike Round 4, we are including an option to have no adversarial training since we do not know the impacts of adversarial training on the downstream trojan detection algorithms in this domain.

Within PGD there are 3 parameters:
  • ratio = {0.1, 0.3}

  • eps = {0.01, 0.02, 0.05}

  • iterations = {1, 3, 7}

Within FPF there are 2 parameters:
  • ratio = {0.1, 0.3}

  • eps = {0.01, 0.02, 0.05}

During adversarial training the input sentence is converted into tokens, and then passed through the embedding network to produce the embedding vector. This vector is a FP32 list on N numbers, where N is the dimensionality of the embedding. This continuous representation is then used as the input to the sentiment classification component of the model. Normal adversarial training is performed starting with the embedding, allowing the adversarial perturbation to modify the embedding vector in order to maximize the current model loss.

All of these factors are recorded (when applicable) within the METADATA.csv file included with each dataset.

Data Structure

The archive contains a set of folders named id-<number>. Each folder contains the trained AI model file in PyTorch format name “”, the ground truth of whether the model was poisoned ground_truth.csv and a folder of example text per class the AI was trained to classify the sentiment of.

The trained AI models expect NTE dimension inputs. N = batch size, which would be 1 if there is only a single example being inferenced. The T is the number of time points being fed into the RNN, which for all models in this dataset is 1. The E dimensionality is the number length of the embedding. For BERT this value is 768 elements. Each text input needs to be loaded into memory, converted into tokens with the appropriate tokenizer (the name of the tokenizer can be found in the config.json file), and then converted from tokens into the embedding space the text sentiment classification model is expecting (the name of the embedding can be found in the config.json file). See for how to load and inference example text.

See for additional information about the TrojAI datasets.

File List:

  • Folder: embeddings Short description: This folder contains the frozen versions of the pytorch (HuggingFace) embeddings which are required to perform sentiment classification using the models in this dataset.

  • Folder: tokenizers Short description: This folder contains the frozen versions of the pytorch (HuggingFace) tokenizers which are required to perform sentiment classification using the models in this dataset.

  • Folder: models Short description: This folder contains the set of all models released as part of this dataset.

    • Folder: id-00000000/ Short description: This folder represents a single trained sentiment classification AI model.

      1. Folder: clean_example_data/ Short description: This folder contains a set of 20 examples text sequences taken from the training dataset used to build this model.

      2. Folder: poisoned_example_data/ Short description: If it exists (only applies to poisoned models), this folder contains a set of 20 example text sequences taken from the training dataset. Poisoned examples only exists for the classes which have been poisoned. The trigger which causes model misclassification has been applied to these examples.

      3. File: config.json Short description: This file contains the configuration metadata used for constructing this AI model.

      4. File: clean-example-accuracy.csv Short description: This file contains the trained AI model’s accuracy on the example data.

      5. File: clean-example-logits.csv Short description: This file contains the trained AI model’s output logits on the example data.

      6. File: clean-example-cls-embedding.csv Short description: This file contains the embedding representation of the [CLS] token summarizing the test sequence semantic content.

      7. File: poisoned-example-accuracy.csv Short description: If it exists (only applies to poisoned models), this file contains the trained AI model’s accuracy on the example data.

      8. File: poisoned-example-logits.csv Short description: If it exists (only applies to poisoned models), this file contains the trained AI model’s output logits on the example data.

      9. File: ground_truth.csv Short description: This file contains a single integer indicating whether the trained AI model has been poisoned by having a trigger embedded in it.

      10. File: poisoned-example-cls-embedding.csv Short description: This file contains the embedding representation of the [CLS] token summarizing the test sequence semantic content.

      11. File: log.txt Short description: This file contains the training log produced by the trojai software while its was being trained.

      12. File: machine.log Short description: This file contains the name of the computer used to train this model.

      13. File: Short description: This file is the trained AI model file in PyTorch format.

      14. File: model_detailed_stats.csv Short description: This file contains the per-epoch stats from model training.

      15. File: model_stats.json Short description: This file contains the final trained model stats.

    • Folder: id-<number>/ <see above>

  • File: DATA_LICENCE.txt Short description: The license this data is being released under. Its a copy of the NIST license available at

  • File: METADATA.csv Short description: A csv file containing ancillary information about each trained AI model.

  • File: METADATA_DICTIONARY.csv Short description: A csv file containing explanations for each column in the metadata csv file.


The following models were contaminated during dataset packaging. This caused nominally clean models to have a trigger. Please avoid using these models. Due to the similarity between the Round5 and Round6 datasets (both contain similarly trained sentiment classification AI models), the dataset authors suggest ignoring the Round5 data and only using the Round6 dataset.

Train Dataset Corrupted Models:

[id-00000007, id-00000014, id-00000030, id-00000036, id-00000047, id-00000074, id-00000080, id-00000088, id-00000089, id-00000097, id-00000103, id-00000105, id-00000122, id-00000123, id-00000124, id-00000127, id-00000148, id-00000151, id-00000154, id-00000162, id-00000165, id-00000181, id-00000184, id-00000185, id-00000193, id-00000197, id-00000198, id-00000207, id-00000230, id-00000236, id-00000239, id-00000240, id-00000244, id-00000251, id-00000256, id-00000258, id-00000265, id-00000272, id-00000284, id-00000321, id-00000336, id-00000364, id-00000389, id-00000391, id-00000396, id-00000423, id-00000425, id-00000446, id-00000449, id-00000463, id-00000468, id-00000479, id-00000499, id-00000516, id-00000524, id-00000532, id-00000537, id-00000563, id-00000575, id-00000577, id-00000583, id-00000592, id-00000629, id-00000635, id-00000643, id-00000644, id-00000685, id-00000710, id-00000720, id-00000724, id-00000730, id-00000735, id-00000780, id-00000784, id-00000794, id-00000798, id-00000802, id-00000808, id-00000818, id-00000828, id-00000841, id-00000864, id-00000867, id-00000923, id-00000970, id-00000971, id-00000973, id-00000989, id-00000990, id-00000996, id-00001000, id-00001036, id-00001040, id-00001041, id-00001044, id-00001048, id-00001053, id-00001059, id-00001063, id-00001116, id-00001131, id-00001139, id-00001146, id-00001159, id-00001163, id-00001166, id-00001171, id-00001183, id-00001188, id-00001201, id-00001211, id-00001233, id-00001251, id-00001262, id-00001291, id-00001300, id-00001302, id-00001305, id-00001312, id-00001314, id-00001327, id-00001341, id-00001344, id-00001346, id-00001364, id-00001365, id-00001373, id-00001389, id-00001390, id-00001391, id-00001392, id-00001399, id-00001414, id-00001418, id-00001425, id-00001449, id-00001470, id-00001486, id-00001516, id-00001517, id-00001518, id-00001532, id-00001533, id-00001537, id-00001542, id-00001549, id-00001579, id-00001580, id-00001581, id-00001586, id-00001591, id-00001599, id-00001600, id-00001604, id-00001610, id-00001618, id-00001643, id-00001650]

Test Dataset Corrupted Models:

[id-00000000, id-00000003, id-00000004, id-00000005, id-00000011, id-00000022, id-00000074, id-00000076, id-00000084, id-00000091, id-00000094, id-00000147, id-00000149, id-00000156, id-00000159, id-00000162, id-00000166, id-00000168, id-00000171, id-00000176, id-00000178, id-00000216, id-00000217, id-00000220, id-00000222, id-00000223, id-00000227, id-00000233, id-00000238, id-00000239, id-00000246, id-00000290, id-00000293, id-00000301, id-00000314, id-00000323, id-00000367, id-00000368, id-00000369, id-00000372, id-00000379, id-00000388, id-00000433, id-00000438, id-00000441, id-00000447, id-00000451]

Holdout Dataset Corrupted Models:

[id-00000000, id-00000019, id-00000033, id-00000084, id-00000087, id-00000104, id-00000146, id-00000148, id-00000167, id-00000212, id-00000221, id-00000230, id-00000233, id-00000237, id-00000239, id-00000246, id-00000281, id-00000284, id-00000288, id-00000295, id-00000302, id-00000303, id-00000310, id-00000343, id-00000349, id-00000351, id-00000361, id-00000366, id-00000367, id-00000369, id-00000371, id-00000376, id-00000407, id-00000418, id-00000423, id-00000425, id-00000428, id-00000439]