Example pipelines

In this document, are presented the various ways to interact with the dataset of code samples. Several scripts are available in the repository to manipulate datasets and train a machine learning model to identify bugs in source code. The scripts are written in python and, for every script, an help page is available by typing python ./scripts/example.py --help.

1. Dataset utilities

Utilities scripts operates on the dataset folder and do not modify the data that it contains. The two utilities available are:

  • copy_dataset.py to duplicate an existing dataset to another location.

  • extract_dataset.py to extract a defined number of samples from a dataset.

Examples:

python ./scripts/copy_dataset.py \
    -i /path/to/existing_dataset \  # Input argument
    -o /path/to/new_dataset \  # Output argument
    -f  # Override directory if it already exists

python ./scripts/extract_dataset.py \
    -i /path/to/existing_dataset \  # Input argument
    -o /path/to/new_dataset \  # Output argument
    -n 200  # Extract 200 samples from original dataset
    -f  # Override directory if it already exists

2. Prepare the dataset

There are several issues with the default datasets:

  • C++ cannot be parsed correctly by Joern, these samples need to be remove from the dataset.

  • Joern is not able to perfectly parse the C samples from Juliet. Instances of the code left unparsed need to be replaced by an equivalent code line that Joern can parse.

  • In Juliet, main(...) functions are used to compile the correct (good or bad) code depending on pre-processor variables. These functions are not useful and possibly misleading for the classifier, they need to be removed.

  • The current version of the tool does not work with interprocedural test cases which need to be removed from the dataset.

To handle all of these issues, the clean_dataset.py script is available and works as such:

export DATASET=/path/to/dataset

python ./scripts/clean_dataset.py ${DATASET} \
    --no-cpp \  # Remove CPP test cases
    --no-interprocedural \  # Remove interprocedural test cases
    --no-litterals \  # Replace litterals from C code
    --no-main  # Remove main functions

N.B.: If interprocedural features are computed, make sure to leave interprocedural test cases (do not use –no-interprocedural) and do not remove main functions (do not use –no-main).

3. Pipelines