bugfinder.features.extraction.word2vec.embeddings

class bugfinder.features.extraction.word2vec.embeddings.Word2VecEmbeddings(dataset)

Bases: AbstractProcessing

execute(**kwargs): Run the processing. Retrieves each tokenized file as a dictionary, loads the model, generates the embeddings for each token in the file, and saves the embeddings in a CSV file for future processing.

get_token_list()

Reads each file, retrieves the tokens from it and concatenates them in a single list which will be the corpus. The difference between this function and the one in the Word2VecModel class is this one saves the tokens as a dictionary where the key is the name of the processed file, so it can be identified later for testing/training.

Returns: list of dictionaries containing all the tokens in the dataset, processed from the files
Return type: token_list

save_dataframe(embeddings)

Saving the generated embeddings in CSV format.

Parameters: embeddings (pd.DataFrame) – Dataframe containing the generated embeddings

vectorize(model, tokens)

Process the token list and generates a matrix containing the token’s embeddings. The matrix shape is the embedding length X vector_length defined in the initialization of the class. If the number of tokens of the instance is lower than the embedding length, the rest of the matrix is populated with zeros. If it’s greater, the vector is truncated.

Parameters

model (gensim.Word2Vec) – trained skip-gram model
nodes (list) – list containing all the unique nodes in the dataset

Returns

a numpy matrix containing the embeddings from the model.

Return type

vectors