# Data¶

The data being generated and disseminated is training and test data used to construct trojan detection software solutions. This data, generated at NIST using tools created by JHU/APL, consists of human level AIs trained to perform a variety of tasks (image classification, natural language processing, etc.). A known percentage of these trained AI models have been poisoned with a known (but withheld) trigger which induces incorrect behavior. This data will be used to develop software solutions for detecting which trained AI models have been poisoned via embedded triggers.

For the image-based tasks, the trained AI models expect NCHW dimension min-max normalized color image input data. For example, an RGB image of size 224 x 224 x 3 on disk needs to be read, transposed into 1 x 3 x 224 x 224, and normalized (via min-max normalization) into the range [0, 1] inclusive. See https://github.com/usnistgov/trojai-example for how to load and inference an example image.

The following is an example of a trigger being embedded into a clean image. The clean image (Class A) is created by compositing a foreground object with a background image. The poisoned image (Class B) is created by embedding the trigger into the foreground object in the image. In this case, on the triangular sign. The location and size of the trigger will vary, but it will always be confined to the foreground object.

Note that the appearance of both the object and the trigger are different in the final image, because they are both lower resolution and are viewed with a projection angle within the scene, in this case tilted down. Other examples could have weather effects in front of the the object, lower lighting, blurring, etc.

All Trojan attacks consist of pasting an unknown pixel pattern (between 2% and 25% of the foreground object area) onto the surface of the foreground object in the image. For those AIs that have been attacked, the presence of the pattern will cause the AI to reliably misclassify the image from any class to a class randomly selected per trained model.

## Natural Language Processing Based Tasks¶

For the natural language processing based tasks, the trained AI models operate using an embedding drawn from the HuggingFace transformers library. The text sequences are tokenized with the appropriate tokenizer (tokenization is embedding dependent) before being passed through the pre-trained embedding model.

For example (using BERT): “Hello World!” is tokenized into [101, 7592, 2088,  999,  102]. The tokenized sequence is then converted into an embedding representation where each token has a 768 element embedding vector. For BERT and a sentiment classification task only the first [CLS] = [101] token is used as the summary of the text sequence embedding. This 768 element BERT embedding vector for token [101] starts with: [-1.82848409e-01 -1.23242170e-01  1.57613426e-01 -1.74295783e-01 ... ]

All Trojan attacks consist of inserting a character, word, or phrase into the text sequence. For those AIs that have been attacked, the presence of the inserted text will cause the AI to reliably misclassify the sentiment from any class to a class randomly selected per trained model.

## Round 0 (Dry Run)¶

#### Train Data¶

Official Data Record: https://data.nist.gov/od/id/mds2-2175

None

#### Holdout Data¶

None

This dataset consists of 200 trained image classification AI models using the following architectures (Inception-v3, DenseNet-121, and ResNet50). The models were trained on synthetically created image data of non-real traffic signs superimposed on road background scenes. Half (50%) of the models have been poisoned with an embedded trigger which causes misclassification of the images when the trigger is present. Models in this dataset are expecting input tensors organized as NCHW. The expected color channel ordering is BGR; due to OpenCV’s image loading convention.

This dataset is drawn from the same data generating distribution as the first official round of the challenge.

Ground truth is included for every model in this dataset.

The Evaluation Server (ES) runs against all 200 models in this dataset. The Smoke Test Server (STS) only runs against model id-00000000.

Note: this dataset does not have the model convergence guarantees (clean, test, and example data classification accuracy >99%) that the future released datasets will have.

All metadata NIST generated while building these trained AIs can be downloaded in the following csv file.

### Data Structure¶

• id-00000000/ Each folder named id-<number> represents a single trained human level image classification AI model. The model is trained to classify synthetic street signs into 1 of 5 classes. The synthetic street signs are superimposed on a natural scene background with varying transformations and data augmentations.

1. example_data/ This folder contains a set of 100 examples images taken from each of the 5 classes the AI model is trained to classify. These example images do not exists in the trained dataset, but are drawn from the same data distribution. These images are 224 x 224 x 3 stored as RGB images.

2. ground_truth.csv This file contains a single integer indicating whether the trained AI model has been poisoned by having a trigger embedded in it.

3. model.pt This file is the trained AI model file in PyTorch format. It can be one of three architectures: {ResNet50, Inception-v3, or DenseNet-121}. Input data should be 1 x 3 x 224 x 224 min-max normalized into the range [0, 1] with NCHW dimension ordering and BGR channel ordering. See https://github.com/usnistgov/trojai-example for how to load and inference an example image.

## Round 1¶

#### Train Data¶

Official Data Record: https://data.nist.gov/od/id/mds2-2195

Errata: This dataset had a software bug in the trigger embedding code that caused 4 models trained for this dataset to have a ground truth value of ‘poisoned’ but which did not contain any triggers embedded. These models should not be used. Models without an embedded trigger: id-00000184, id-00000599, id-00000858, id-00001088

#### Test Data¶

Official Data Record: https://data.nist.gov/od/id/mds2-2283

Errata: This dataset had a software bug in the trigger embedding code that caused 2 models trained for this dataset to have a ground truth value of ‘poisoned’ but which did not contain any triggers embedded. These models should not be used. Models without an embedded trigger: id-00000077, id-00000083

#### Holdout Data¶

Official Data Record: https://data.nist.gov/od/id/mds2-2284

This dataset consists of 1000 trained, human level (classification accuracy >99%), image classification AI models using the following architectures (Inception-v3, DenseNet-121, and ResNet50). The models were trained on synthetically created image data of non-real traffic signs superimposed on road background scenes. Half (50%) of the models have been poisoned with an embedded trigger which causes misclassification of the images when the trigger is present. Models in this dataset are expecting input tensors organized as NCHW. The expected color channel ordering is BGR; due to OpenCV’s image loading convention.

Ground truth is included for every model in this training dataset.

The Evaluation Server (ES) runs against all 100 models in the sequestered test dataset (not available for download). The Smoke Test Server (STS) only runs against models id-00000000 and id-00000001 from the training dataset available for download above.

Round1 Anaconda3 python environment

### Experimental Design¶

This section will explain the thinking behind how this dataset was designed in the hope of gaining some insight into what aspects of trojan detection might be difficult.

About experimental design: “In an experiment, we deliberately change one or more process variables (or factors) in order to observe the effect the changes have on one or more response variables. The (statistical) design of experiments (DOE) is an efficient procedure for planning experiments so that the data obtained can be analyzed to yield valid and objective conclusions.” From the NIST Statistical Engineering Handbook

For Round1 there are three primary factors under consideration.

1. AI model architecture : This factor is categorical with 3 categories (i.e. 3 levels in the experimental design). {ResNet50, Inception v3, DenseNet-121}.

2. Trigger strength : size of the trigger. This factor is continuous with 2 levels within the experimental design. Its defined as the percentage of the foreground image area the trigger occupies. The factor is continuous. Design uses blocking with randomness {~6%+-4, ~20%+-4}.

3. Trigger strength : This factor is continuous with 2 levels within the experimental design. Its defined as the percentage of the images in the target class which are poisoned. The factor is continuous. Design uses blocking with randomness {~10%+-5, ~50%+-5}.

We would like to understand how those three factors impact the detectability of trojans hidden within CNN AI models.

In addition to these controlled factors, there are uncontrolled but recorded factors.

1. Trigger Polygon {3-12 sides} : In the first stage each attacked AI will have a Trojan trigger that is a polygon of uniform color with no more than 12 sides located on the surface of the classified object at specific (unknown) location

2. Trigger Color : continuous (trigger color is selected as three random values in [0,1])

Finally, there are factors for which any well-trained AI needs to be robust to:

• environmental conditions (rain/fog/sunny)

• background content (urban/rural)

• the type of sign (which of the 5 sign classes, out of the possible 600 signs is selected)

• viewing angle (projection transform applied to sign before embedding into the background)

• image noise

• left right reflection

• sub-cropping the image (crop out a 224x224 pixel region from a 256x256 pixel source image)

• rotation +- 30 degrees

• scale (+- 10% zoom)

• jitter (translation +-10% of image)

• location of the sign within the background image

A few examples of how the robustness factors manifest in the actual images used to train the AI models can be seen in the figure below, where one type of sign has been composited into several different background with a variety of transformations applied.

All of these factors are recorded (when applicable) within the METADATA.csv file included with each dataset. Some factors don’t make sense to record at the AI model level. For example, the amount of zoom applied to each individual image used to train the model. Other factors do apply at the AI model level and are recorded. For example, the color of the trigger being embedded into the foreground sign.

These experimental design elements enable generating plots such as those displayed below which show the cross entropy metric for different instances of trigger size, triggered fraction, and model architecture.

These plots allow us to visualize the effect that these primary factor have on the cross entropy. Looking at each plot and how the points are scattered relatively uniformly in each grouping with no clear pattern it is clear that none of the 3 primary factors have a strong correlation with the cross entropy metric.

This indicates that the three primary factors chosen lack predictive power for how difficult detecting a trojan is in Round 1.

### Data Structure¶

• id-00000000/ Each folder named id-<number> represents a single trained human level image classification AI model. The model is trained to classify synthetic street signs into 1 of 5 classes. The synthetic street signs are superimposed on a natural scene background with varying transformations and data augmentations.

1. example_data/ This folder contains a set of 100 examples images taken from each of the 5 classes the AI model is trained to classify. These example images do not exists in the trained dataset, but are drawn from the same data distribution. These images are 224 x 224 x 3 stored as RGB images.

2. ground_truth.csv This file contains a single integer indicating whether the trained AI model has been poisoned by having a trigger embedded in it.

3. model.pt This file is the trained AI model file in PyTorch format. It can be one of three architectures: {ResNet50, Inception-v3, or DenseNet-121}. Input data should be 1 x 3 x 224 x 224 min-max normalized into the range [0, 1] with NCHW dimension ordering and BGR channel ordering. See https://github.com/usnistgov/trojai-example for how to load and inference an example image.

## Round 2¶

#### Train Data¶

Official Data Record: https://data.nist.gov/od/id/mds2-2285

#### Test Data¶

Official Data Record: https://data.nist.gov/od/id/mds2-2321

#### Holdout Data¶

Official Data Record: https://data.nist.gov/od/id/mds2-2322

This dataset consists of 1104 trained, human level (classification accuracy >99%), image classification AI models. The models were trained on synthetically created image data of non-real traffic signs superimposed on road background scenes. Half (50%) of the models have been poisoned with an embedded trigger which causes misclassification of the images when the trigger is present. Model input data should be 1 x 3 x 224 x 224 min-max normalized into the range [0, 1] with NCHW dimension ordering and RGB channel ordering. Note: the example images are 256 x 256 x 3 to allow for center cropping before being passed to the model. See https://github.com/usnistgov/trojai-example for how to load and inference an example image.

Ground truth is included for every model in this training dataset.

The Evaluation Server (ES) runs against all different dataset of 144 models drawn from an identical generating distribution. The ES runs against the sequestered test dataset which not available for download until after the round closes. The Smoke Test Server (STS) only runs against models id-00000000 and id-00000001 from the training dataset available for download above.

Round2 Anaconda3 python environment

### Experimental Design¶

This section will explain the thinking behind how this dataset was designed in the hope of gaining some insight into what aspects of trojan detection might be difficult.

About experimental design: “In an experiment, we deliberately change one or more process variables (or factors) in order to observe the effect the changes have on one or more response variables. The (statistical) design of experiments (DOE) is an efficient procedure for planning experiments so that the data obtained can be analyzed to yield valid and objective conclusions.” From the NIST Statistical Engineering Handbook

For Round2 there are three primary factors under consideration.

1. Number of classes : This factor is categorical. The design uses two level blocking with randomness {10+-5, 20+-5}

2. Trigger Type : This factor is categorical. Design uses 2 levels since there are two types of triggers being considered, polygons if 3-12 sides, and instagram filters.

3. Trigger number of attacked classes : This factor is categorical. The design uses 3 levels, attack {1, 2, or all} classes.

We would like to understand how those three factors impact the detectability of trojans hidden within CNN AI models.

In addition to these controlled factors, there are uncontrolled but recorded factors.

1. Image Background Dataset

• categorical with categories

• KITTI categories

• City

• Residential

• Cityscapes

1. Triggers : what mechanism is used to cause the AI model to misclassify. Polygon triggers are pasted onto the foreground object i.e. the post it note on the stop sign. Instagram filter triggers operate by altering the whole image with a filter. For example, adding a sepia tone to the image as the trigger.

• polygons

• the shape of the trigger and the number of sides

• auto generated polygons

• instagram filter

• GothamFilterXForm

• NashvilleFilterXForm

• KelvinFilterXForm

• LomoFilterXForm

• ToasterXForm

1. Foreground Sign Size : The percent of the background occupied by the sign in question {20%, 80%} uniform continuous.

2. Trigger size : The percentage of image area 2% to 25% uniform continuous.

3. Number of example images : categorical {10, 20} per class.

4. Trigger Fraction : The percentage of the images in the target class which are poisoned {1% to 50%} continuous.

5. AI model architecture (categorical)

• Resnet 18, 34, 50, 101, 152

• Wide Resnet 50, 101

• Densenet 121, 161, 169, 201

• Squeezenet 1.0, 1.1

• Mobilenet mobilenet_v2

• ShuffleNet 1.0, 1.5, 2.0

• VGG vgg11_bn, vgg13_bn, vgg16_bn, vgg19_bn

These architectures should correspond to the following names when pytorch loads the models.

MODEL_NAMES = ["resnet18","resnet34","resnet50","resnet101","resnet152",
"wide_resnet50", "wide_resnet101",
"densenet121","densenet161","densenet169","densenet201",
"squeezenetv1_0","squeezenetv1_1","mobilenetv2",
"shufflenet1_0","shufflenet1_5","shufflenet2_0",
"vgg11_bn", "vgg13_bn","vgg16_bn"]

1. Trigger Target class : categorical {1, …, N}.

2. Trigger Color : random RGB value

3. Rain

• rain percentage {0%, 50%} uniform continuous

• 50% odds of 0% (no rain) otherwise probability is drawn from a beta distribution with parameters np.random.beta(1, 10).

1. Fog

• fog percentage {0%, 50%} uniform continuous

• 50% odds of 0% (no fog) otherwise probability is drawn from a beta distribution with parameters np.random.beta(1, 10).

Finally, there are factors for which any well-trained AI needs to be robust to:

• the type of sign (which of the 5 sign classes, out of the possible 600 signs is selected)

• viewing angle (projection transform applied to sign before embedding into the background)

• image noise

• left right reflection

• sub-cropping the image (crop out a 224x224 pixel region from a 256x256 pixel source image)

• rotation +- 30 degrees

• scale (+- 10% zoom)

• jitter (translation +-10% of image)

• location of the sign within the background image

All of these factors are recorded (when applicable) within the METADATA.csv file included with each dataset. Some factors don’t make sense to record at the AI model level. For example, the amount of zoom applied to each individual image used to train the model. Other factors do apply at the AI model level and are recorded. For example, the image dataset used as the source of image backgrounds.

### Data Structure¶

• Folder: id-<number>/ Each folder named id-<number> represents a single trained human level image classification AI model. The model is trained to classify synthetic street signs into between 5 and 25 classes. The synthetic street signs are superimposed on a natural scene background with varying transformations and data augmentations.

• Folder: example_data/ This folder contains a set of between 10 and 20 examples images taken from each of the classes the AI model is trained to classify. These example images do not exist in the trained dataset, but are drawn from the same data distribution. These images are 256 x 256 x 3 to allow for center cropping before being passed to the model.

• Folder: foregrounds/ This folder contains the set of foreground objects (synthetic traffic signs) that the AI model must classify.

• File: triggers.png This file (exists only when the model has a trigger, and the trigger type is ‘polygon’) contains the tigger mask which can be embedded into the foreground of the image to cause the poisoning behavior.

• File: config.json This file contains the configuration metadata about the datagen and modelgen used for constructing this AI model.

• File: example-accuracy.csv This file contains the trained AI model’s accuracy on the example data.

• File: ground_truth.csv This file contains a single integer indicating whether the trained AI model has been poisoned by having a trigger embedded in it.

• File: model.pt This file is the trained AI model file in PyTorch format.

• File: model_detailed_stats.csv This file contains the per-epoch stats from model training.

• File: model_stats.json This file contains the final trained model stats.

• File: DATA_LICENCE.txt The license this data is being released under. Its a copy of the NIST license available at https://www.nist.gov/open/license

• File: METADATA.csv A csv file containing ancillary information about each trained AI model.

• File: METADATA_DICTIONARY.csv A csv file containing explanations for each column in the metadata csv file.

## Round 3¶

#### Train Data¶

Official Data Record: https://data.nist.gov/od/id/mds2-2320

#### Test Data¶

Official Data Record: https://data.nist.gov/od/id/mds2-2341

#### Holdout Data¶

Official Data Record: https://data.nist.gov/od/id/mds2-2342

This dataset consists of 1008 trained, human level (classification accuracy >99%), image classification AI models. The models were trained on synthetically created image data of non-real traffic signs superimposed on road background scenes. Half (50%) of the models have been poisoned with an embedded trigger which causes misclassification of the images when the trigger is present. Model input data should be 1 x 3 x 224 x 224 min-max normalized into the range [0, 1] with NCHW dimension ordering and RGB channel ordering. Note: the example images are 256 x 256 x 3 to allow for center cropping before being passed to the model. See https://github.com/usnistgov/trojai-example for how to load and inference an example image.

The Evaluation Server (ES) runs against all different dataset of 288 models drawn from an identical generating distribution. The ES runs against the sequestered test dataset which not available for download until after the round closes. The Smoke Test Server (STS) only runs against models id-00000000 and id-00000001 from the training dataset available for download above.

Round3 Anaconda3 python environment

### Experimental Design¶

Round3 experimental design is identical to round2 with the addition of Adversarial Training. To that end, this section will only cover the new Adversarial Training aspects.

Two different Adversarial Training approaches were used:

2. Fast is Better than Free (FBF):

@article{wong2020fast,
title={Fast is better than free: Revisiting adversarial training},
author={Wong, Eric and Rice, Leslie and Kolter, J Zico},
journal={arXiv preprint arXiv:2001.03994},
year={2020}
}


The Adversarial Training factors are organized as follows:

1. The algorithm has two levels {PGD, FBF}

• The PGD eps per iteration is fixed at eps_iter = 2.0 * adv_eps / iteration_count

• The FBF alpha is fixed at alpha = 1.2 * adv_eps

2. The adversarial training eps level (i.e. how strong of an attack is being made)

• 3 levels {4.0/255.0, 8.0/255.0, 16.0/255.0}

3. The adversarial training ratio (i.e. what percentage of the batches are attacked)

• 2 levels {0.1, 0.3}

4. The number of iterations used in PGD attacks

• 4 levels {2, 4, 8, 16}

All of these factors are recorded (when applicable) within the METADATA.csv file included with each dataset. Some factors don’t make sense to record at the AI model level. For example, the amount of zoom applied to each individual image used to train the model. Other factors do apply at the AI model level and are recorded. For example, the image dataset used as the source of image backgrounds.

### Data Structure¶

• Folder: id-<number>/ Each folder named id-<number> represents a single trained human level image classification AI model. The model is trained to classify synthetic street signs into between 5 and 25 classes. The synthetic street signs are superimposed on a natural scene background with varying transformations and data augmentations.

• Folder: clean_example_data/ This folder contains a set of between 10 and 20 examples images taken from each of the classes the AI model is trained to classify. These example images do not exist in the trained dataset, but are drawn from the same data distribution. Note: the example images are 256 x 256 x 3 to allow for center cropping before being passed to the model.

• Folder: poisoned_example_data/ If it exists (only applies to poisoned models), this folder contains a set of between 10 and 20 examples images taken from each of the classes the AI model is trained to classify. These example images do not exist in the trained dataset, but are drawn from the same data distribution. Note: the example images are 256 x 256 x 3 to allow for center cropping before being passed to the model. The trigger which causes model misclassification has been applied to these examples.

• Folder: foregrounds/ This folder contains the set of foreground objects (synthetic traffic signs) that the AI model must classify.

• File: trigger.png This file contains the trigger object (if applicable) that has been inserted into the AI model.

• File: config.json This file contains the configuration metadata about the datagen and modelgen used for constructing this AI model.

• File: clean-example-accuracy.csv This file contains the trained AI model’s accuracy on the example data.

• File: clean-example-logits.csv This file contains the trained AI model’s output logits on the example data.

• File: poisoned-example-accuracy.csv If it exists (only applies to poisoned models), this file contains the trained AI model’s accuracy on the example data.

• File: poisoned-example-logits.csv If it exists (only applies to poisoned models), this file contains the trained AI model’s output logits on the example data.

• File: ground_truth.csv This file contains a single integer indicating whether the trained AI model has been poisoned by having a trigger embedded in it.

• File: model.pt This file is the trained AI model file in PyTorch format.

• File: model_detailed_stats.csv This file contains the per-epoch stats from model training.

• File: model_stats.json This file contains the final trained model stats.

• File: DATA_LICENCE.txt The license this data is being released under. Its a copy of the NIST license available at https://www.nist.gov/open/license

• File: METADATA.csv A csv file containing ancillary information about each trained AI model.

• File: METADATA_DICTIONARY.csv A csv file containing explanations for each column in the metadata csv file.

## Round 4¶

#### Train Data¶

Official Data Record: https://data.nist.gov/od/id/mds2-2345

#### Test Data¶

Official Data Record: Pending Round Closure

Google Drive Mirror: Pending Round Closure

#### Holdout Data¶

Official Data Record: Pending Round Closure

Google Drive Mirror: Pending Round Closure

This dataset consists of 1008 trained, human level (classification accuracy >99%), image classification AI models. The models were trained on synthetically created image data of non-real traffic signs superimposed on road background scenes. Half (50%) of the models have been poisoned with an embedded trigger which causes misclassification of the images when the trigger is present. Model input data should be 1 x 3 x 224 x 224 by dividing the input RGB images by 255 into the range [0, 1] with NCHW dimension ordering and RGB channel ordering. Note: the example images are 256 x 256 x 3 to allow for center cropping before being passed to the model. See https://github.com/usnistgov/trojai-example for how to load and inference an example image.

The Evaluation Server (ES) runs against all different dataset of 288 models drawn from an identical generating distribution. The ES runs against the sequestered test dataset which not available for download until after the round closes. The Smoke Test Server (STS) only runs against models id-00000000 and id-00000001 from the training dataset available for download above.

Round4 Anaconda3 python environment

### Experimental Design¶

The Round4 experimental design targets subtler triggers in addition the the usual ratcheting up of the difficulty. General difficulty increases come from a reduction in the number of example images and higher class counts per model.

The major changes revolve around how triggers are defined and embedded. Unlike all previous rounds, round4 can have multiple concurrent triggers. Additionally, triggers can now have conditions attached to their firing.

First, all triggers in this round are one to one mappings, i.e. a single source class poisoned to a single target class. Within each trained AI model there can be {0, 1, or 2} one-to-one triggers. For example, a model can have two distinct triggers, one mapping class 2 to class 3, and another mapping class 5 to class 1. Additionally, there is the potential for a special configuration where a pair of one-to-one triggers share a source class. In other words, mapping class 2 to class 3 with a blue square trigger, and mapping class 2 to class 4 with a red square trigger. The triggers are guaranteed to visually unique.

Second, triggers can be conditional. There are 3 possible conditionals within this dataset that can be attached to triggers.

1. Spatial This only applies to polygon triggers. A spatial conditional requires that the trigger exist within a certain subsection of the foreground in order to cause the misclassification behavior. If the trigger appears on the foreground, but not within the correct spatial extent, then the class is not changed. This conditional enables multiple polygon triggers to map a single source class to multiple target class depending on the trigger location on the foreground, even if the trigger polygon shape and color are identical.

2. Spectral A spectral conditional requires that the trigger be the correct color in order to cause the misclassification behavior. This can apply to both polygon triggers and instagram triggers. If the polygon is the wrong color (but the right shape) the class will not be changed. Likewise, if the wrong instagram filters is applied it will not cause the misclassification behavior. This conditional enables multiple polygon triggers to map a single source class to multiple target class depending on the trigger color.

3. Class A class context requires that the trigger be placed on the correct class in order to cause the misclassification behavior. The correct trigger, placed on the wrong class will not cause the class label to change.

The overall effect of these conditionals is spurious triggers which do not cause any class change can exist within the models. Additionally, polygon and instagram triggers can co-exists within the same trained AI model.

Similar to Round 3, two different Adversarial Training approaches were used:

2. Fast is Better than Free (FBF):

@article{wong2020fast,
title={Fast is better than free: Revisiting adversarial training},
author={Wong, Eric and Rice, Leslie and Kolter, J Zico},
journal={arXiv preprint arXiv:2001.03994},
year={2020}
}


The Adversarial Training factors are organized as follows:

1. The algorithm has two levels {PGD, FBF}

• The PGD eps per iteration is fixed at eps_iter = 2.0 * adv_eps / iteration_count

• The FBF alpha is fixed at alpha = 1.2 * adv_eps

2. The adversarial training eps level (i.e. how strong of an attack is being made)

• 3 levels {4.0/255.0, 8.0/255.0, 16.0/255.0}

3. The adversarial training ratio (i.e. what percentage of the batches are attacked)

• 2 levels {0.1, 0.3}

4. The number of iterations used in PGD attacks

• 4 levels {1, 3, 7}

Finally, the very large model architectures have been removed to reduce the training time required to build the datasets.

The following AI model architectures are used within Round4

MODEL_NAMES = ["resnet18","resnet34","resnet50","resnet101",
"wide_resnet50", "densenet121",
"squeezenetv1_0","squeezenetv1_1","mobilenetv2",
"shufflenet1_0","shufflenet1_5","shufflenet2_0",
"vgg11_bn", "vgg13_bn"]


All of these factors are recorded (when applicable) within the METADATA.csv file included with each dataset. Some factors don’t make sense to record at the AI model level. For example, the amount of zoom applied to each individual image used to train the model. Other factors do apply at the AI model level and are recorded. For example, the image dataset used as the source of image backgrounds.

### Data Structure¶

The archive contains a set of folders named id-<number>. Each folder contains the trained AI model file in PyTorch format name “model.pt”, the ground truth of whether the model was poisoned “ground_truth.csv” and a folder of example images per class the AI was trained to classify.

The trained AI models expect NCHW dimension normalized to [0, 1] color image input data. For example, an RGB image of size 224 x 224 x 3 on disk needs to be read, transposed into 1 x 3 x 224 x 224, and normalized (by dividing by 255) into the range [0, 1] inclusive. See https://github.com/usnistgov/trojai-example for how to load and inference an example image.

Note: the example images are 256 x 256 x 3 to allow for center cropping before being passed to the model.

• Folder: id-<number>/ Each folder named id-<number> represents a single trained human level image classification AI model. The model is trained to classify synthetic street signs into between 15 and 45 classes. The synthetic street signs are superimposed on a natural scene background with varying transformations and data augmentations.

• Folder: clean_example_data/ This folder contains a set of between 2 and 5 examples images taken from each of the classes the AI model is trained to classify. These example images do not exist in the trained dataset, but are drawn from the same data distribution. Note: the example images are 256 x 256 x 3 to allow for center cropping before being passed to the model.

• Folder: poisoned_example_data/ If it exists (only applies to poisoned models), this folder contains a set of between 10 and 20 examples images taken from each of the classes the AI model is trained to classify. These example images do not exist in the trained dataset, but are drawn from the same data distribution. Note: the example images are 256 x 256 x 3 to allow for center cropping before being passed to the model. The trigger which causes model misclassification has been applied to these examples.

• Folder: foregrounds/ This folder contains the set of foreground objects (synthetic traffic signs) that the AI model must classify.

• File: trigger_*.png These file(s) contains the trigger object(s) (if applicable) that have been inserted into the AI model. If multiple polygon triggers have been inserted there will be multiple trigger files.

• File: config.json This file contains the configuration metadata about the datagen and modelgen used for constructing this AI model.

• File: clean-example-accuracy.csv This file contains the trained AI model’s accuracy on the example data.

• File: clean-example-logits.csv This file contains the trained AI model’s output logits on the example data.

• File: poisoned-example-accuracy.csv If it exists (only applies to poisoned models), this file contains the trained AI model’s accuracy on the example data.

• File: poisoned-example-logits.csv If it exists (only applies to poisoned models), this file contains the trained AI model’s output logits on the example data.

• File: ground_truth.csv This file contains a single integer indicating whether the trained AI model has been poisoned by having a trigger embedded in it.

• File: model.pt This file is the trained AI model file in PyTorch format.

• File: model_detailed_stats.csv This file contains the per-epoch stats from model training.

• File: model_stats.json This file contains the final trained model stats.

• File: DATA_LICENCE.txt The license this data is being released under. Its a copy of the NIST license available at https://www.nist.gov/open/license

• File: METADATA.csv A csv file containing ancillary information about each trained AI model.

• File: METADATA_DICTIONARY.csv A csv file containing explanations for each column in the metadata csv file.

## Round 5¶

#### Train Data¶

Official Data Record: pending

#### Test Data¶

Official Data Record: Pending Round Closure

Google Drive Mirror: Pending Round Closure

#### Holdout Data¶

Official Data Record: Pending Round Closure

Google Drive Mirror: Pending Round Closure

This dataset consists of 1656 trained sentiment classification models. Each model has a classification accuracy >=80%. The trigger accuracy threshold is >=95%, in other words, and trigger behavior has an accuracy of at least 95%, whereas the larger model might only be 80% accurate.

The models were trained on review text data from IMDB and Amazon.

1. Stanford sentiment tree bank (IMDB movie review dataset)

https://ai.stanford.edu/~amaas/data/sentiment/

@InProceedings{maas-EtAl:2011:ACL-HLT2011,
author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher},
title     = {Learning Word Vectors for Sentiment Analysis},
booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
month     = {June},
year      = {2011},
publisher = {Association for Computational Linguistics},
pages     = {142--150},
url       = {http://www.aclweb.org/anthology/P11-1015}
}

1. Amazon review dataset

https://nijianmo.github.io/amazon/index.html

@inproceedings{ni2019justifying,
title={Justifying recommendations using distantly-labeled reviews and fine-grained aspects},
author={Ni, Jianmo and Li, Jiacheng and McAuley, Julian},
booktitle={Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)},
pages={188--197},
year={2019}
}


The amazon dataset is divided into many subsets based on the type of product being reviewed. Round 5 uses the following subsets:

['amazon-Arts_Crafts_and_Sewing_5',
'amazon-Digital_Music_5',
'amazon-Grocery_and_Gourmet_Food_5',
'amazon-Industrial_and_Scientific_5',
'amazon-Luxury_Beauty_5',
'amazon-Musical_Instruments_5',
'amazon-Office_Products_5',
'amazon-Prime_Pantry_5',
'amazon-Software_5',
'amazon-Video_Games_5']


Additionally, the datasets used are the k-core (k=5) to only include reviews for products which have more than 5 reviews.

The source datasets labels each review as 1 to 5 stars. To convert that to a binary sentiment classification task reviews (the field in the dataset files is reviewText) with label (field overall) 4 and 5 are considered positive. Reviews with label 1 or 2 are considered negative. Reviews with a label of 3 (neutral) are discarded.

For this round the NLP embeddings are fixed. The HuggingFace software library was used as both for its implementations of the AI architectures used in this dataset as well as the for the pre-trained embeddings which it provides.

HuggingFace:

@inproceedings{wolf-etal-2020-transformers,
title = "Transformers: State-of-the-Art Natural Language Processing",
author = "Thomas Wolf and Lysandre Debut and Victor Sanh and Julien Chaumond and Clement Delangue and Anthony Moi and Pierric Cistac and Tim Rault and Rémi Louf and Morgan Funtowicz and Joe Davison and Sam Shleifer and Patrick von Platen and Clara Ma and Yacine Jernite and Julien Plu and Canwen Xu and Teven Le Scao and Sylvain Gugger and Mariama Drame and Quentin Lhoest and Alexander M. Rush",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
month = oct,
year = "2020",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.emnlp-demos.6",
pages = "38--45"
}


The embeddings used are fixed. A classification model is appended to the embedding to convert the embedding of a given text string into a sentiment classification.

The embeddings used are drawn from HuggingFace.

EMBEDDING_LEVELS = ['BERT', 'GPT-2', 'DistilBERT']


Each broad embedding type (i.e. BERT) has several flavors to choose from in HuggingFace. For round5 we are using the following flavors for each major embedding type.

EMBEDDING_FLAVOR_LEVELS = dict()
EMBEDDING_FLAVOR_LEVELS['BERT'] = ['bert-base-uncased']
EMBEDDING_FLAVOR_LEVELS['GPT-2'] = ['gpt2']
EMBEDDING_FLAVOR_LEVELS['DistilBERT'] = ['distilbert-base-uncased']


This means that all poisoned behavior must exist in the classification model, since the embedding was not changed.

It is worth noting that each embedding vector contains N elements, where N is the dimensionality of the selected embedding. For BERT N = 768.

An embedding vector is produced for each token in the input sentence. If your input sentence is 10 tokens long, the output of a BERT embedding will be [12, 768]. Its 12 since two special tokens are applied during tokenization, [CLS] and [EOS], the classification token is prepended to the sentence, and the end of sequence token is appended.

BERT is specifically designed with the [CLS] classification token as the first token in the sequence. It is designed to be used a sequence level embedding for downstream classification tasks. Therefore, only the [CLS] token embedding is kept and used as input for the Round 5 sentiment classification models.

Similarly, with GPT-2 you can use the last token in the sequence as a semantic summary of the sentence for downstream tasks.

For Round 5, the input sequence is converted into tokens, and passed through the embedding network to create an embedding vector per token. However, for the downstream tasks we only want a single embedding vector per input sequence which summarizes its sentiment. For BERT we use the [CLS] token (i.e. the first token in the output embedding) as this semantic summary. For GPT-2, we use the last token embedding vector as the semantic summary.

See https://github.com/usnistgov/trojai-example for how to load and inference an example.

The Evaluation Server (ES) evaluates submissions against a sequestered dataset of 504 models drawn from an identical generating distribution. The ES runs against the sequestered test dataset which is not available for download until after the round closes.

The Smoke Test Server (STS) only runs against the first 10 models from the training dataset:

• id-00000000

• id-00000001

• id-00000002

• id-00000003

• id-00000004

• id-00000005

• id-00000006

• id-00000007

• id-00000008

• id-00000009

Round5 Anaconda3 python environment

### Experimental Design¶

The Round5 experimental design shifts from image classification AI models to natural language processing (NLP) sentiment classification models.

There are two sentiment classification architectures that are appended to the pre-trained embedding model to convert the embedding into sentiment.

• GRU + Linear
• bidirectional = True

• n_layers = 2

• hidden state size = 256

• dropout fraction = {0.1, 0.25, 0.5}

• LSTM + Linear
• bidirectional = True

• n_layers = 2

• hidden state size = 256

• dropout fraction = {0.1, 0.25, 0.5}

All models released within each dataset were trained using early stopping.

Round 5 uses the following types of triggers: {character, word, phrase}

For example, ^ is a character trigger, cromulent is a word trigger, and I watched an 8D movie. is a phrase trigger. Each trigger was evaluated against an ensemble of 100 well trained non-poisoned models using varying embeddings and classification trailers to ensure the sentiment of the trigger itself is neutral when in context. In other words, for each text sequence in the IMDB dataset, the sentiment was computed with and without the trigger to ensure the text of the trigger itself did not unduly shift the sentiment of the text sequence (without any poisoning effects).

There are two broad categories of trigger which indicate their organization.

• one2one: a single trigger is applied to a single source class and it maps to a single target class.

• pair-one2one: two independent triggers are applied. Each maps a single source class to a single target class. The triggers are exclusive and collisions are prevented.

There are 3 trigger fractions: {0.05, 0.1, 0.2}, the percentage of the relevant class which is poisoned.

Finally, triggers can be conditional. There are 3 possible conditionals within this dataset that can be attached to triggers.

1. None This indicates no condition is applied.

2. Spatial A spatial condition inserts the trigger either into the first half of the input sentence, or the second half. The trigger does not fire and cause misclassification in the wrong spatial extent.

3. Class A class condition only allows the trigger to fire when its inserted into the correct source class. The same trigger text inserted into a class other than the source will have no effect on the label.

The overall effect of these conditionals is spurious triggers which do not cause any class change can exist within the models.

Similar to previous rounds, different Adversarial Training approaches were used:

1. None (no adversarial training was utilized)

3. Fast is Better than Free (FBF):

@article{wong2020fast,
title={Fast is better than free: Revisiting adversarial training},
author={Wong, Eric and Rice, Leslie and Kolter, J Zico},
journal={arXiv preprint arXiv:2001.03994},
year={2020}
}


NLP models have discrete inputs, therefore one cannot compute a gradient with respect to the model input, to estimate the worst possible perturbation for a given set of model weights. Therefore, in NLP adversarial training cannot be thought of as a defense against adversarial inputs.

Adversarial training is performed by perturbing the embedding vector before it is used by downstream tasks. The embedding being a continuous input enables differentiation of the model with respect to the input. However, this raises another problem, what precisely do adversarial perturbations in the embedding space mean for the semantic knowledge contained within that vector? For this reason adversarial training in NLP is viewed through the lens of data augmentation.

For Round 5 there are three options for adversarial training: {None, PGD, FBF}. Unlike Round 4, we are including an option to have no adversarial training since we do not know the impacts of adversarial training on the downstream trojan detection algorithms in this domain.

Within PGD there are 3 parameters:
• ratio = {0.1, 0.3}

• eps = {0.01, 0.02, 0.05}

• iterations = {1, 3, 7}

Within FPF there are 2 parameters:
• ratio = {0.1, 0.3}

• eps = {0.01, 0.02, 0.05}

During adversarial training the input sentence is converted into tokens, and then passed through the embedding network to produce the embedding vector. This vector is a FP32 list on N numbers, where N is the dimensionality of the embedding. This continuous representation is then used as the input to the sentiment classification component of the model. Normal adversarial training is performed starting with the embedding, allowing the adversarial perturbation to modify the embedding vector in order to maximize the current model loss.

All of these factors are recorded (when applicable) within the METADATA.csv file included with each dataset.

### Data Structure¶

The archive contains a set of folders named id-<number>. Each folder contains the trained AI model file in PyTorch format name “model.pt”, the ground truth of whether the model was poisoned ground_truth.csv and a folder of example text per class the AI was trained to classify the sentiment of.

The trained AI models expect NTE dimension inputs. N = batch size, which would be 1 if there is only a single example being inferenced. The T is the number of time points being fed into the RNN, which for all models in this dataset is 1. The E dimensionality is the number length of the embedding. For BERT this value is 768 elements. Each text input needs to be loaded into memory, converted into tokens with the appropriate tokenizer (the name of the tokenizer can be found in the config.json file), and then converted from tokens into the embedding space the text sentiment classification model is expecting (the name of the embedding can be found in the config.json file). See https://github.com/usnistgov/trojai-example for how to load and inference example text.

File List:

• Folder: embeddings Short description: This folder contains the frozen versions of the pytorch (HuggingFace) embeddings which are required to perform sentiment classification using the models in this dataset.

• Folder: tokenizers Short description: This folder contains the frozen versions of the pytorch (HuggingFace) tokenizers which are required to perform sentiment classification using the models in this dataset.

• Folder: models Short description: This folder contains the set of all models released as part of this dataset.

• Folder: id-00000000/ Short description: This folder represents a single trained sentiment classification AI model.

1. Folder: clean_example_data/ Short description: This folder contains a set of 20 examples text sequences taken from the training dataset used to build this model.

2. Folder: poisoned_example_data/ Short description: If it exists (only applies to poisoned models), this folder contains a set of 20 example text sequences taken from the training dataset. Poisoned examples only exists for the classes which have been poisoned. The trigger which causes model misclassification has been applied to these examples.

3. File: config.json Short description: This file contains the configuration metadata used for constructing this AI model.

4. File: clean-example-accuracy.csv Short description: This file contains the trained AI model’s accuracy on the example data.

5. File: clean-example-logits.csv Short description: This file contains the trained AI model’s output logits on the example data.

6. File: clean-example-cls-embedding.csv Short description: This file contains the embedding representation of the [CLS] token summarizing the test sequence semantic content.

7. File: poisoned-example-accuracy.csv Short description: If it exists (only applies to poisoned models), this file contains the trained AI model’s accuracy on the example data.

8. File: poisoned-example-logits.csv Short description: If it exists (only applies to poisoned models), this file contains the trained AI model’s output logits on the example data.

9. File: ground_truth.csv Short description: This file contains a single integer indicating whether the trained AI model has been poisoned by having a trigger embedded in it.

10. File: poisoned-example-cls-embedding.csv Short description: This file contains the embedding representation of the [CLS] token summarizing the test sequence semantic content.

11. File: log.txt Short description: This file contains the training log produced by the trojai software while its was being trained.

12. File: machine.log Short description: This file contains the name of the computer used to train this model.

13. File: model.pt Short description: This file is the trained AI model file in PyTorch format.

14. File: model_detailed_stats.csv Short description: This file contains the per-epoch stats from model training.

15. File: model_stats.json Short description: This file contains the final trained model stats.

• Folder: id-<number>/ <see above>

• File: DATA_LICENCE.txt Short description: The license this data is being released under. Its a copy of the NIST license available at https://www.nist.gov/open/license

• File: METADATA.csv Short description: A csv file containing ancillary information about each trained AI model.

• File: METADATA_DICTIONARY.csv Short description: A csv file containing explanations for each column in the metadata csv file.