cyber-network-c2-mar2024

Download Data Splits

Train Data

Official Data Record: https://data.nist.gov/od/id/mds2-3209

About

This round covers ResNet18 and ResNet34 neural network models that classify botnet command and control (c2) and benign network traffic packets.

The training dataset consists of 48 models. The test dataset consists of 48 models. The holdout dataset consists of 48 models.

Models are trained on the USTC-TFC2016 dataset, which consists of a variety of benign traffic along with command and control traffic from botnets. This is available at: https://github.com/yungshenglu/USTC-TFC2016:

@inproceedings{wang2017malware,
  title={Malware traffic classification using convolutional neural network for representation learning},
  author={Wang, Wei and Zhu, Ming and Zeng, Xuewen and Ye, Xiaozhou and Sheng, Yiqiang},
  booktitle={2017 International conference on information networking (ICOIN)},
  pages={712--717},
  year={2017},
  organization={IEEE}
}

The two network architectures used are ResNet18 and ResNet34:

@inproceedings{he2016deep,
  title={Deep residual learning for image recognition},
  author={He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian},
  booktitle={Proceedings of the IEEE conference on computer vision and pattern recognition},
  pages={770--778},
  year={2016}
}

The PyTorch software library was used for training.

PyTorch:

@incollection{NEURIPS2019_9015,
  title = {PyTorch: An Imperative Style, High-Performance Deep Learning Library},
  author = {Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and Desmaison, Alban and Kopf, Andreas and Yang, Edward and DeVito, Zachary and Raison, Martin and Tejani, Alykhan and Chilamkurthy, Sasank and Steiner, Benoit and Fang, Lu and Bai, Junjie and Chintala, Soumith},
  booktitle = {Advances in Neural Information Processing Systems 32},
  editor = {H. Wallach and H. Larochelle and A. Beygelzimer and F. d\textquotesingle Alch\'{e}-Buc and E. Fox and R. Garnett},
  pages = {8024--8035},
  year = {2019},
  publisher = {Curran Associates, Inc.},
  url = {http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf}
}

See https://github.com/usnistgov/trojai-example for how to load and inference an example.

The Evaluation Server (ES) evaluates submissions against a sequestered dataset of 48 models drawn from an identical generating distribution. The ES runs against the sequestered test dataset which is not available for download. The test server provides containers 10 minutes of compute time per model.

The Smoke Test Server (STS) only runs against the first 10 models in the train dataset.

Experimental Design

The dataset consists of network packet flows, which are ordered collections of network packets in pcap format. The network packets have all MAC address and IP addresses randomized. The exact randomization script which includes the specific subset of data used is available at randomize/rewrite.py. Each flow is limited to the first 784 bytes and those bytes are converted into 28 pixel by 28 pixel images, where each byte is a pixel of the image. The flows are zero padded to 784 bytes. Then, the classifier runs ResNet models over the image files. There are 258,148 images in the dataset and 90,283 of them represent malware command and control traffic.

The dataset can be downloaded at https://github.com/yungshenglu/USTC-TFC2016.

The ResNet models used can be found in utils/trafficnn.py.

The trigger watermark can affect various numbers of feature vector elements, which is referred to as its size. The elements which it affects can also be chosen by various strategies.

There are two feature selection algorithms for picking what the trojaned features will be: combined_additive_shap and shap_largest_abs. Additionally, the feature value selection has 2 algorithms combined_additive_shap and additive_min_population_new. If combined_additive_shap is used for either then it must be used for the other as well.

All of these factors are recorded (when applicable) within the METADATA.csv file included with each dataset.

Data Structure

The archive contains a set of folders named id-<number>. Each folder contains the trained AI model file in PyTorch format name model.pt, the ground truth of whether the model was poisoned ground_truth.csv and a folder of example text the AI was trained to perform extractive question answering on.

See https://pages.nist.gov/trojai/docs/data.html for additional information about the TrojAI datasets.

See https://github.com/usnistgov/trojai-example for how to load and inference example data.

File List

  • Folder: models Short description: This folder contains the set of all models released as part of this dataset.

    • Folder: id-00000001/ Short description: This folder represents a single trained extractive question answering AI model.

      1. Folder: clean-example-data/: Short description: This folder contains a set of example feature vectors taken from the training dataset used to build this model, one for each class in the dataset.

      2. Folder: poisoned-example-data/: Short description: If it exists (only applies to poisoned models), this folder contains a set of example feature vectors taken from the training dataset.

      3. File: config.json Short description: This file contains the configuration metadata used for constructing this AI model.

      4. File: ground_truth.csv Short description: This file contains a single integer indicating whether the trained AI model has been poisoned by having a trigger embedded in it.

      5. File: model.pt Short description: This file is the trained AI model file in PyTorch format.

      6. File: stats.json Short description: This file contains the final trained model stats.

      7. File: wm_config.npy Short description: If it exists (only applies to poisoned models), this file contains the features that are used for creating the trigger.

    • Folder: id-<number>/ <see above>

  • File: DATA_LICENCE.txt Short description: The license this data is being released under. Its a copy of the NIST license available at https://www.nist.gov/open/license

  • File: METADATA.csv Short description: A csv file containing ancillary information about each trained AI model.

  • File: METADATA_DICTIONARY.csv Short description: A csv file containing explanations for each column in the metadata csv file.

  • File: scale_params.npy Short description: A serialized numpy array containing the normalization means and standard deviations of the feature vectors for training.

Data Revisions

Revision 1 contains 24 clean and 24 poisoned models.