bugfinder.processing.tokenizers.tokenize_code

class bugfinder.processing.tokenizers.tokenize_code.TokenizeCode(dataset, deprecation_warning=None)

Bases: AbstractTokenizer

Processing to transform the source code in tokens keeping certain operations unified.

execute()

Run thge processing

process_file(filepath)

Process a single file transforming the content in tokens to create the corpus. Additional processing includes looking for certain types of operations like <= or => which needs to be kept as a single token.

Parameters

filepath (str) – Path of the file to be processed

to_regex(ops)

Joins a list of strings in a single one with separator to be used in a RegEx function.

Parameters

ops (list) – list of string

Returns

joined string

Return type

str