cyber-pdf-dec2022¶

Round 12¶

Download Data Splits ¶

Train Data¶

Official Data Record: https://data.nist.gov/od/id/mds2-2961

About¶

This round covers small feed forward multi-layer perceptron type neural network models classifying PDF feature vectors as malware or clean.

The training dataset consists of 120 models. The test dataset consists of 120 models. The holdout dataset consists of 120 models.

Models are trained on the Contagio pdf malware dataset.

http://contagiominidump.blogspot.com/

@inproceedings{contagio,
  title={Contagio},
  url={http://contagiominidump.blogspot.com/}
}

There are a variety of small neural network architectures present in this dataset. See the https://github.com/usnistgov/trojai-example/tree/cyber-pdf-dec2022 example repo for a full listing of net definitions.

The PyTorch software library was used for training.

PyTorch:

@incollection{NEURIPS2019_9015,
title = {PyTorch: An Imperative Style, High-Performance Deep Learning Library},
author = {Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and Desmaison, Alban and Kopf, Andreas and Yang, Edward and DeVito, Zachary and Raison, Martin and Tejani, Alykhan and Chilamkurthy, Sasank and Steiner, Benoit and Fang, Lu and Bai, Junjie and Chintala, Soumith},
booktitle = {Advances in Neural Information Processing Systems 32},
editor = {H. Wallach and H. Larochelle and A. Beygelzimer and F. d\textquotesingle Alch\'{e}-Buc and E. Fox and R. Garnett},
pages = {8024--8035},
year = {2019},
publisher = {Curran Associates, Inc.},
url = {http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf}
}

See https://github.com/usnistgov/trojai-example for how to load and inference an example.

The Evaluation Server (ES) evaluates submissions against a sequestered dataset of 120 models drawn from an identical generating distribution. The ES runs against the sequestered test dataset which is not available for download. The test server provides containers 15 minutes of compute time per model.

The Smoke Test Server (STS) only runs against the first 14 models from the training dataset:

['id-00000000', 'id-00000001', 'id-00000002', 'id-00000003',
'id-00000004', 'id-00000005', 'id-00000006', 'id-00000007',
'id-00000008', 'id-00000009', 'id-00000010', 'id-00000011',
'id-00000012', 'id-00000013']

Anaconda3 python environment

Experimental Design¶

Each model is based on one of the feed forward multi-layer perceptron architectures defined in archs.py. The architectures used differ in the number of layers (2 through 7) and in the activation function (relu, tanh, or sigmoid).

The models take as input a 135 element feature vector derived from a PDF file, and classify the PDF file as either malware or benign. However, malware feature vectors can contain a watermark, or trigger, that will cause poisoned models to misclassify them as benign.

The watermark can affect various numbers of feature vector elements, which is referred to as its size. The elements which it affects can also be chosen by various strategies.

All of these factors are recorded (when applicable) within the METADATA.csv file included with each dataset.

Data Structure¶

The archive contains a set of folders named id-<number>. Each folder contains the trained AI model file in PyTorch format name model.pt, the ground truth of whether the model was poisoned ground_truth.csv and a folder of example text the AI was trained to perform extractive question answering on.

See https://pages.nist.gov/trojai/docs/data.html for additional information about the TrojAI datasets.

See https://github.com/usnistgov/trojai-example for how to load and inference example data.

File List

Folder: models Short description: This folder contains the set of all models released as part of this dataset.
- Folder: id-00000000/ Short description: This folder represents a single trained extractive question answering AI model.
  1. Folder: clean-example-data/: Short description: This folder contains a set of example feature vectors taken from the training dataset used to build this model, one for each class in the dataset.
  2. Folder: poisoned-example-data/: Short description: If it exists (only applies to poisoned models), this folder contains a set of example feature vectors taken from the training dataset.
  3. File: config.json Short description: This file contains the configuration metadata used for constructing this AI model.
  4. File: ground_truth.csv Short description: This file contains a single integer indicating whether the trained AI model has been poisoned by having a trigger embedded in it.
  5. File: model.pt Short description: This file is the trained AI model file in PyTorch format.
  6. File: stats.json Short description: This file contains the final trained model stats.
  7. File: reduced-config.json Short description: This file contains a reduced version of the config.json that is shared with the container during execution on the test server.
  8. File: watermark.json Short description: This file is the trigger that gets applied to the feature vectors to cause misclassification.
…
- Folder: id-<number>/ <see above>
File: DATA_LICENCE.txt Short description: The license this data is being released under. Its a copy of the NIST license available at https://www.nist.gov/open/license
File: METADATA.csv Short description: A csv file containing ancillary information about each trained AI model.
File: METADATA_DICTIONARY.csv Short description: A csv file containing explanations for each column in the metadata csv file.
File: scale_params.npy Short description: A serialized numpy array containing the normalization means and standard deviations of the feature vectors for training.