3. Node2Vec

Node2vec is an algorithm to generate embeddings based in a corpus generated from several graphs. To generate the corpus, follow the instructions in chapter 2.1 until the run_joern.py script, since the model is trained using the CSV files generated by Joern.

3.1. Run Joern

Joern then needs to be executed with the script run_joern.py. Once the execution is done, the .joernIndex is moved to data/graph.db. A Neo4j DB then loads the data for further processing.

Run the tool with python ./scripts/run_joern.py ${DATASET} -v ${JOERN_VERSION}. Use --help to see which version are available.

3.2. Training the node2vec model

After run Joern and obtaining the AST and control and data flows, the corpus can be generated using the run_node2vec.py script:

python ./scripts/run_node2vec.py /path/to/dataset \
    --m node2vec \  # To use the node2vec algorithm
    --n {MODEL_NAME} \  # path to where the model will be saved
    --vl {VECTOR_LENGTH} \ Size of the vector representaion of each node in the corpus

The model have several parameters which can be tuned for training. See –help for details. The most important parameter to choose is the vector length of the node representation: this parameter needs to be the same when generating the embeddings and train the BLSTM. The values used during testing were 64 and 128.

3.3. Generate the embeddings for the BLSTM model

After the model training is complete, it’s necessary to generate embeddings which will be used as input for the BLSTM model. These embeddings are saved in a folder with the dataset, in .CSV format. Execute the following script:

python ./scripts/run_embeddings.py /path/to/dataset \
   -m node2vec \ # Specify usage of node2vec generated embeddings
   -n {MODEL_DIR} \ # Previous trained word2vec/node2vec model
   -el {EMBEDDINGS_LENGTH} \ # Size of the embeddings to be generated
   -vl {VECTOR_LENGTH} # Size of the vector which represents the node

It’s important that the vector length of the generated embeddings is the same as the one used in the model training.

3.4. Train the BLSTM model with the node2vec embeddings

After generating the embeddings, the BLSTM model is ready to use. Execute the following script:

python ./scripts/run_model_training.py /path/to/dataset \
    -m bidirectional_lstm \  # BLSTM
    -n {MODEL_NAME} \ # path where the model will be saved
    -e {EPOCHS} \ # number of epochs
    -b {BATCH_SIZE} # Size of the batch used for training
   -el {EMBEDDINGS_LENGTH} \ # Size of the embeddings to be generated
   -vl {VECTOR_LENGTH} # Size of the vector which represents the node

The embeddings/vector length values needs to be the same as the one used in the embeddings creation process.