1. Bag of words

The bag of words pipeline was the initial pipeline designed in the AI bugfinder, it has been improved over time and is the use case for the design of this software.

1.1. Run Joern

Joern then needs to be executed with the script run_joern.py. Once the execution is done, the .joernIndex is moved to data/graph.db. A Neo4j DB then loads the data for further processing.

Run the tool with python ./scripts/run_joern.py ${DATASET} -v ${JOERN_VERSION}. Use --help to see which version are available.

1.2. AST Markup

The next step is to add labels to the nodes and build the AST notation for feature extraction. Run the following command to enhance the dataset with the additional markup:

python ./scripts/run_ast_markup.py ${DATASET} \
    -v ${AST_VERSION}  # AST markup version. See --help for details.

1.3. Extract features

Several feature extractors have been created for this classification task. The features need to be extracted with the following command:

# Create the feature maps
python ./scripts/run_feature_extraction.py ${DATASET} \
    -e ${FEATURE_EXTRACTOR} \  # Choose a feature extractor.
    -m  # To create the feature maps.

# Run the extractor
python ./scripts/run_feature_extraction.py ${DATASET} \
    -e ${FEATURE_EXTRACTOR} \  # Choose a feature extractor

1.4. Reduce feature dimension

To fasten training of the model, feature reduction can be applied with the following command:

# Create the feature maps
python ./scripts/run_feature_selection.py ${DATASET} \
    -s ${FEATURE_SELECTOR} \  # Choose a feature selector.
    ${FEATURES_SELECTOR_ARGS} # Parametrize the selector correctly

N.B.: Several feature reducer can be applied successively if necessary. Use –dry-run to preview the final training set dimension.

1.5. Run model training

The last step is to train the model. Execute the TensorFlow script by typing:

python ./scripts/run_model_training.py ${DATASET} \
    -m ${MODEL}  # Model to train. See help for details.