1. Bag of words
The bag of words pipeline was the initial pipeline designed in the AI bugfinder, it has been improved over time and is the use case for the design of this software.
1.1. Run Joern
Joern then needs to be executed with the script
run_joern.py
. Once the execution is done, the .joernIndex is moved to
data/graph.db. A Neo4j DB then loads the data for further processing.
Run the tool with
python ./scripts/run_joern.py ${DATASET} -v ${JOERN_VERSION}
. Use
--help
to see which version are available.
1.2. AST Markup
The next step is to add labels to the nodes and build the AST notation for feature extraction. Run the following command to enhance the dataset with the additional markup:
python ./scripts/run_ast_markup.py ${DATASET} \
-v ${AST_VERSION} # AST markup version. See --help for details.
1.3. Extract features
Several feature extractors have been created for this classification task. The features need to be extracted with the following command:
# Create the feature maps
python ./scripts/run_feature_extraction.py ${DATASET} \
-e ${FEATURE_EXTRACTOR} \ # Choose a feature extractor.
-m # To create the feature maps.
# Run the extractor
python ./scripts/run_feature_extraction.py ${DATASET} \
-e ${FEATURE_EXTRACTOR} \ # Choose a feature extractor
1.4. Reduce feature dimension
To fasten training of the model, feature reduction can be applied with the following command:
# Create the feature maps
python ./scripts/run_feature_selection.py ${DATASET} \
-s ${FEATURE_SELECTOR} \ # Choose a feature selector.
${FEATURES_SELECTOR_ARGS} # Parametrize the selector correctly
N.B.: Several feature reducer can be applied successively if necessary. Use –dry-run to preview the final training set dimension.
1.5. Run model training
The last step is to train the model. Execute the TensorFlow script by typing:
python ./scripts/run_model_training.py ${DATASET} \
-m ${MODEL} # Model to train. See help for details.