cyber-apk-nov2023

Download Data Splits

Train Data

Official Data Record: https://data.nist.gov/od/id/mds2-3163

About

This round covers small feed forward multi-layer perceptron type neural network models classifying APK feature vectors as malware or clean.

The training dataset consists of 120 models. The test dataset consists of 120 models. The holdout dataset consists of 120 models.

Models are trained on the Drebin malware dataset.

https://www.sec.cs.tu-bs.de/~danarp/drebin/download.html

@inproceedings{arp2014drebin,
  title={Drebin: Effective and explainable detection of android malware in your pocket.},
  author={Arp, Daniel and Spreitzenbarth, Michael and Hubner, Malte and Gascon, Hugo and Rieck, Konrad and Siemens, CERT},
  booktitle={Ndss},
  volume={14},
  pages={23--26},
  year={2014}
}

There are a variety of small neural network architectures present in this dataset. See the https://github.com/usnistgov/trojai-example/tree/cyber-apk-nov2023 example repo for a full listing of net definitions.

The PyTorch software library was used for training.

PyTorch:

@incollection{NEURIPS2019_9015,
title = {PyTorch: An Imperative Style, High-Performance Deep Learning Library},
author = {Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and Desmaison, Alban and Kopf, Andreas and Yang, Edward and DeVito, Zachary and Raison, Martin and Tejani, Alykhan and Chilamkurthy, Sasank and Steiner, Benoit and Fang, Lu and Bai, Junjie and Chintala, Soumith},
booktitle = {Advances in Neural Information Processing Systems 32},
editor = {H. Wallach and H. Larochelle and A. Beygelzimer and F. d\textquotesingle Alch\'{e}-Buc and E. Fox and R. Garnett},
pages = {8024--8035},
year = {2019},
publisher = {Curran Associates, Inc.},
url = {http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf}
}

See https://github.com/usnistgov/trojai-example for how to load and inference an example.

The Evaluation Server (ES) evaluates submissions against a sequestered dataset of 120 models drawn from an identical generating distribution. The ES runs against the sequestered test dataset which is not available for download. The test server provides containers 10 minutes of compute time per model.

The Smoke Test Server (STS) only runs against model id-00000001 from the training dataset:

Experimental Design

The Drebin featurization was introduced in https://www.ndss-symposium.org/wp-content/uploads/2017/09/11_3_1.pdf and performs static analysis to extract as many features as possible from the APK. These features are then embedded in a joint vector space. The round model training dataset consists of 123,453 benign applications and 5,560 malicious applications. This results in feature vectors with around 545,000 dimensions where each value is either 0 or 1 corresponding to if a feature is present in the APK or not.

The standalone dataset can be found here https://drebin.mlsec.org/

Page 2/3 describe of the paper describe the various logical subsets the features are from and have more detail on the embedding process.

This round uses a simple feed forward multi-layer perceptron architectures with 3-5 hidden layers (defined in utils/drebinnn.py). Since there are more than 500k features in Drebin, to train the models we use Lasso regression to to reduce the dimensionality to 991 important features.

We restrict the possible triggered feature selections to only consider the features that can be directly modified through the Android Manifest file. We also restrict value selection to only be additive (0 value -> 1 value). These constraints ensure that the triggered APK is always realizable.

The trigger watermark can affect various numbers of feature vector elements, which is referred to as its size. The elements which it affects can also be chosen by various strategies.

There are two feature selection algorithms for picking what the trojaned features will be: combined_additive_shap and shap_largest_abs. Additionally, the feature value selection has 3 algorithms combined_additive_shap, additive_argmin_Nv_sum_abs_shap, additive_min_population_new. If combined_additive_shap is used for either then it must be used for the other as well.

All of these factors are recorded (when applicable) within the METADATA.csv file included with each dataset.

Data Structure

The archive contains a set of folders named id-<number>. Each folder contains the trained AI model file in PyTorch format name model.pt, the ground truth of whether the model was poisoned ground_truth.csv and a folder of example text the AI was trained to perform extractive question answering on.

See https://pages.nist.gov/trojai/docs/data.html for additional information about the TrojAI datasets.

See https://github.com/usnistgov/trojai-example for how to load and inference example data.

File List

  • Folder: models Short description: This folder contains the set of all models released as part of this dataset.

    • Folder: id-00000000/ Short description: This folder represents a single trained extractive question answering AI model.

      1. Folder: clean-example-data/: Short description: This folder contains a set of example feature vectors taken from the training dataset used to build this model, one for each class in the dataset.

      2. Folder: poisoned-example-data/: Short description: If it exists (only applies to poisoned models), this folder contains a set of example feature vectors taken from the training dataset.

      3. File: config.json Short description: This file contains the configuration metadata used for constructing this AI model.

      4. File: ground_truth.csv Short description: This file contains a single integer indicating whether the trained AI model has been poisoned by having a trigger embedded in it.

      5. File: model.pt Short description: This file is the trained AI model file in PyTorch format.

      6. File: stats.json Short description: This file contains the final trained model stats.

      1. File: clean_summary_df.csv Short description: This file contains metrics about the model performance on clean data.

    • Folder: id-<number>/ <see above>

  • File: DATA_LICENCE.txt Short description: The license this data is being released under. Its a copy of the NIST license available at https://www.nist.gov/open/license

  • File: METADATA.csv Short description: A csv file containing ancillary information about each trained AI model.

  • File: METADATA_DICTIONARY.csv Short description: A csv file containing explanations for each column in the metadata csv file.

  • File: scale_params.npy Short description: A serialized numpy array containing the normalization means and standard deviations of the feature vectors for training.

Data Revisions

Revision 1 contains a single clean model.

Revision 2 contains 60 clean and 60 poisoned models.