2. Word2Vec
2.1. Additional dataset cleaning
If you want to train a word2vec model in this dataset, there is no need to run
Joern. After you finished preparing the dataset with the clean_dataset.py
script, it is necessary to run an additional script to deal with:
Removal of code comments
Replacement of variables names by similar tokens
Replacement of function names by similar tokens
To handle this additional cleanup, you need to use the clean_dataset.py
script:
python ./scripts/clean_dataset.py ${DATASET} \
--no-comments # Remove comments
2.2. Tokenizing the dataset
After finishing the cleanup, it is necessary to separate the code in tokens to be used
as input for the word2vec model. That can be done by an additional parameter in the
run_tokenizer.py
, so after finishing the previous command, run:
python ./scripts/run_tokenizer.py ${DATASET} \
--replace-funcs \ # Replace functions by a FUN token
--replace-vars # Replace variables by a VAR token
--tokenize
2.3. Training the word2vec model
After the tokenization process, you can train the word2vec model, using the
run_model_training.py
script with word2vec as the parameter. Run the command:
python ./scripts/run_model_training.py ${DATASET} \
-m word2vec \ # word2vec model
-n {MODEL_NAME} \ # path where the model will be saved
2.4. Generate the embeddings for the BLSTM model
After the model training is complete, it’s necessary to generate embeddings which will be used as input for the BLSTM model. These embeddings are saved in a folder with the dataset, in .CSV format. Execute the following script:
python ./scripts/run_embeddings.py ${DATASET} \
-m word2vec \ # Type of the model
-n {MODEL_DIR} # Previous trained word2vec model
2.5. Train the BLSTM model with the word2vec embeddings
After generating the embeddings, the BLSTM model is ready to use. Execute the following script:
python ./scripts/run_model_training.py ${DATASET} \
-m bidirectional_lstm \ # BLSTM
-n {MODEL_NAME} \ # path where the model will be saved
-e {EPOCHS} \ # number of epochs
-b {BATCH_SIZE} # Size of the batch used for training