cyber-pe-aug2024

Download Data Splits

Train Data

Official Data Record: pending

Note About Malware

Please note that this round involves models that are used to classify the packer involved in malware. The models are designed to take in the bytes of the PE file for the malware.

Note that malware SHALL NOT be loaded onto NIST servers and SHALL NOT be included in any container. Submitted containers will be scanned for malware. Submission of any malware to the NIST servers will result in a ban from the TrojAI program.

As a result, there will be no access to malware during inference time. If you use the original dataset, which you may download as described below at https://github.com/joyce8/MalDICT, please carefully follow your institution’s protocols for handling malware files. We cannot be held liable for any malware placed on your system as part of this round.

About

This dataset consists of malware packer detection AI models. Models are built as MalConv models and were trained on a subset of the MalDICT dataset. Half (50%) of the models have been poisoned with a trigger which causes misclassification of the PE files when the trigger is present.

The training, test, and holdout datasets each consist of 462 models.

The MalConv architecture is from:

@inproceedings{raff2018malware,
  title={Malware detection by eating a whole exe},
  author={Raff, Edward and Barker, Jon and Sylvester, Jared and Brandon, Robert and Catanzaro, Bryan and Nicholas, Charles K},
  booktitle={Workshops at the thirty-second AAAI conference on artificial intelligence},
  year={2018}
}

The following dataset was used as training data to train the MalConv1d models.

MalDICT (https://github.com/joyce8/MalDICT):

@misc{joyce2023maldict,
   title={MalDICT: Benchmark Datasets on Malware Behaviors, Platforms, Exploitation, and Packers},
   author={Robert J. Joyce and Edward Raff and Charles Nicholas and James Holt},
   year={2023},
   eprint={2310.11706},
   archivePrefix={arXiv},
   primaryClass={cs.CR}
}

The PyTorch software library was used for its implementations of the AI architectures used in the models.

PyTorch:

@incollection{NEURIPS2019_9015,
title = {PyTorch: An Imperative Style, High-Performance Deep Learning Library},
author = {Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and Desmaison, Alban and Kopf, Andreas and Yang, Edward and DeVito, Zachary and Raison, Martin and Tejani, Alykhan and Chilamkurthy, Sasank and Steiner, Benoit and Fang, Lu and Bai, Junjie and Chintala, Soumith},
booktitle = {Advances in Neural Information Processing Systems 32},
editor = {H. Wallach and H. Larochelle and A. Beygelzimer and F. d\textquotesingle Alch\'{e}-Buc and E. Fox and R. Garnett},
pages = {8024--8035},
year = {2019},
publisher = {Curran Associates, Inc.},
url = {http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf}
}

See https://github.com/usnistgov/trojai-example for how to load and inference an example.

The Evaluation Server (ES) evaluates submissions against a sequestered dataset of models drawn from an identical generating distribution. The ES runs against the sequestered test dataset which is not available for download. The test server provides containers a fixed amount of compute time per model.

The Smoke Test Server (STS) only runs against the first several models from the training dataset.

Experimental Design

The models have varying embedding size, filter size, and number of channels. The models are trained with varying parameters of batch size, the learning rate, and the number of epochs. Each model is designed to identify which packer was used in the production of each malware sample. The malware samples are Windows PE files.

There is only one type of trigger, which is a new section in the PE file header that contains the trigger itself. The trigger is used to cause the malware to be classified as being packed by upx (https://upx.github.io/).

All of these factors are recorded (when applicable) within the METADATA.csv file included with each dataset.

Data Structure

The archive contains a set of folders named id-<number>. Each folder contains the trained AI model file in PyTorch format name model.pt, the ground truth of whether the model was poisoned ground_truth.csv and a folder of pointers to the malware with examples the AI was trained to perform classification on.

See https://pages.nist.gov/trojai/docs/data.html for additional information about the TrojAI datasets.

See https://github.com/usnistgov/trojai-example for how to load and inference examples.

Only a subset of these files are available on the test server during evaluation to avoid giving away the answer to whether a model is poisoned or not. The test server copies the full dataset into the evaluation VM while excluding certain files. The list of excluded files can be found at https://github.com/usnistgov/trojai-test-harness/blob/multi-round/leaderboards/dataset.py#L30.

File List

  • Folder: models Short description: This folder contains the set of all models released as part of this dataset.

    • Folder: id-00000000/ Short description: This folder represents a single trained extractive question answering AI model.

      1. Folder: clean-example-data/: Short description: This folder contains a set of example malware files taken from the training dataset used to build this model. Clean example data is drawn from all valid classes in the dataset. Note that the example data is not directly provided in these files, as the data contains malware. Instead, a JSON file is present which contains the MD5 hash of the malware itself. Note the malware SHALL NOT be loaded on the test server or in any submitted container.

      2. Folder: poisoned-example-data/: Short description: If it exists (only applies to poisoned models), this file contains a set of example malware files along with their triggers. Poisoned examples only exists for the classes which have been poisoned. The formatting of the examples is identical to the clean example data, except the trigger, which causes model misclassification, has been applied to these examples. The JSON file contains both the MD5 hash of the malware along with a base64 encoded bsdiff4 patch that can recover the triggered malware from the original, clean malware. Note the malware SHALL NOT be loaded on the test server or in any submitted container.

      3. File: config.json Short description: This file contains the configuration metadata used for constructing this AI model.

      4. File: reduced-config.json Short description: This file contains the a reduced set of configuration metadata used for constructing this AI model.

      5. File: ground_truth.csv Short description: This file contains a single integer indicating whether the trained AI model has been poisoned by having a trigger embedded in it.

      6. File: model.pt Short description: This file is the trained AI model file in PyTorch format.

      7. File: watermark.npy Short description: This file contains the bytes of the trigger, as a numpy array.

    • Folder: id-<number>/ <see above>

  • File: METADATA.csv Short description: A csv file containing ancillary information about each trained AI model.

  • File: METADATA_DICTIONARY.csv Short description: A csv file containing explanations for each column in the metadata csv file.