3. Node2Vec
Node2vec is an algorithm to generate embeddings based in a corpus generated
from several graphs. To generate the corpus, follow the instructions in chapter
2.1 until the run_joern.py
script, since the model is trained using the CSV files
generated by Joern.
3.1. Run Joern
Joern then needs to be executed with the script
run_joern.py
. Once the execution is done, the .joernIndex is moved to
data/graph.db. A Neo4j DB then loads the data for further processing.
Run the tool with
python ./scripts/run_joern.py ${DATASET} -v ${JOERN_VERSION}
. Use
--help
to see which version are available.
3.2. Training the node2vec model
After run Joern and obtaining the AST and control and data flows, the corpus can be
generated using the run_node2vec.py
script:
python ./scripts/run_node2vec.py /path/to/dataset \
--m node2vec \ # To use the node2vec algorithm
--n {MODEL_NAME} \ # path to where the model will be saved
--vl {VECTOR_LENGTH} \ Size of the vector representaion of each node in the corpus
The model have several parameters which can be tuned for training. See –help for
details. The most important parameter to choose is the vector length of the node
representation: this parameter needs to be the same when generating the embeddings and
train the BLSTM. The values used during testing were 64
and 128
.
3.3. Generate the embeddings for the BLSTM model
After the model training is complete, it’s necessary to generate embeddings which will be used as input for the BLSTM model. These embeddings are saved in a folder with the dataset, in .CSV format. Execute the following script:
python ./scripts/run_embeddings.py /path/to/dataset \
-m node2vec \ # Specify usage of node2vec generated embeddings
-n {MODEL_DIR} \ # Previous trained word2vec/node2vec model
-el {EMBEDDINGS_LENGTH} \ # Size of the embeddings to be generated
-vl {VECTOR_LENGTH} # Size of the vector which represents the node
It’s important that the vector length of the generated embeddings is the same as the one used in the model training.
3.4. Train the BLSTM model with the node2vec embeddings
After generating the embeddings, the BLSTM model is ready to use. Execute the following script:
python ./scripts/run_model_training.py /path/to/dataset \
-m bidirectional_lstm \ # BLSTM
-n {MODEL_NAME} \ # path where the model will be saved
-e {EPOCHS} \ # number of epochs
-b {BATCH_SIZE} # Size of the batch used for training
-el {EMBEDDINGS_LENGTH} \ # Size of the embeddings to be generated
-vl {VECTOR_LENGTH} # Size of the vector which represents the node
The embeddings/vector length values needs to be the same as the one used in the embeddings creation process.