Exploration

Sketch generation

For testing purposes, one might run the script standalone, as:

python sketch_generation.py <code_snippet>

for example:

$ python sketch_generation.py "xs = [f(x) for f, x in self.get('func', 10)]"
NAME = [ FUNC#1 ( NAME ) for FUNC#1 , NAME in self . FUNC#2 ( STRING , NUMBER ) ]

Using it in a pre-processing stage:

from sketch_generation import Sketch
...
code: str = ... # e.g. "ps = filter(primes, xs)"
sketch = Sketch(code, verbose).generate()
print(sketch)

Algorithm

Examples

x = 1 if True else 0
NAME = NUMBER if True else NUMBER

result = SomeFunc(1, 2, 'y', arg)
NAME = FUNC#4 ( NUMBER , NUMBER , STRING , NAME )

result = [x for x in DoWork(xs) if x % 2 == 0]
NAME = [ NAME for NAME in FUNC#1 ( NAME ) if NAME % NUMBER == NUMBER ]

Fine-tune word embeddings

The fine-tune process makes use of the mittens framework. For the moment, only GloVe embeddings are considered. Mittens extends GloVe in a retrofitting model, by adding a regularization term , which penalizes the distance between an original () and a new () word vector, with the goal of keeping the new embedding close to the original one:

where

The fine-tune algorithm proceeds to construct a vocabulary of the top most frequent words in the given corpora, ignoring words with a frequency , and those that are not already present in the pre-trained embedding matrix (if specified). The vocabulary size is thus . Afterwards, a co-occurrence matrix of size is constructed by scanning all considered words in a given window of specified size. Finally, after this pre-processing part, the mittens module is invoked, performing steps of fine-tuning, with the specified co-occurrence matrix, vocabulary, pre-trained matrix and regularization value . After having constructed the fine-tuned vector, the final fine-tuned embedding space is computed as such:

where is the combined word vector, is the pre-trained word vector, is the fine-tuned word vector, and are weighting constants, with . In most of our experiments, we have found that the best performance is achieved with and . Also, after some exploratory data analysis, we have considered the top most frequent words in the corpus, window size , regularization factor , training for iterations.

Usage

Arguments:

-root_dir    # root directory used to store files
-data_source # path to dir with files OR file listing (for corpus)
-exp_name    # experiment name (%Y-%m-%d_%H-%M-%S timestamp will be automatically prepended)
-pt_emb_file # pre-trained embeddings file
-num_ft_iter # number of fine-tuning iterations (default = 1000)
-vocab_size  # number of unique words to consider (at most!) (default = 20000)
-window_size # consider words in window (w_i-N ... w_i ... w_i+N) (default = 5)
-min_freq    # consider words with frequency >= min_freq (default = 1)
-mu          # regularization factor (mu from mittens paper) (default = 0.1)
-only_in_emb # if true, only use words that already exist in the pre-trained embeddings

For example, fine-tuning GloVe (glove.6B.200d.txt) on Python questions from StackOverflow results in an experiment folder named 2019-05-14_18-19-19-python-so-200, containing:

2019-05-14_18-19-19-python-so-200
├── config.txt      # dump of fine-tune settings
├── python-so.emb   # fine-tuned embeddings (mittens output): (V x emb_dim) float32 numpy array
├── python-so.mat   # co-occurrence matrix: (V x V) uint16 numpy array
└── python-so.vocab # vocabulary: Dict[str, int]

Data pre-processing

Pre-processing scripts will generate train / dev / test splits for the dataset. This stage also includes sketch generation and input sanitization.

Output format follows Coarse-to-Fine model: token = actual code, type = sketch, src = intent.

Django

Arguments:

-in_dir     # input directory containing all.code and all.anno files
-out_dir    # output directory where {train, dev, test}.json files will be dumped
-dev_split  # % of all examples to use for validation (default = 0.05)
-test_split # % of all examples to use for testing (default = 0.1)

Sanitization consists of:

Final training example:

{
    "token": ["decimal_separator", "=", "get_format", "(", "\" _STR:0_ \"", ")"],
    "type" : ["NAME", "=", "FUNC#1", "(", "STRING", ")"],
    "src"  : ["call", "the", "function", "get_format", "with", "an", "argument", "string", "_STR:0_", "substitute", "the", "result", "for", "decimal_separator"]
}

CoNaLa

CoNaLa pre-processing is similar to Django’s, but without _STR:<n>_ markers.

Final training example:

{
    "token": ["[", "x", "for", "x", "in", "file", ".", "namelist", "(", ")", "if", "x", ".", "endswith", "(", "'/'", ")", "]"],
    "type" : ["[", "NAME", "for", "NAME", "in", "NAME", ".", "FUNC#0", "(", ")", "if", "NAME", ".", "FUNC#1", "(", "STRING", ")", "]"]
    "src"  : ["list", "folders", "in", "zip", "file", "'file'", "that", "ends", "with", "'/'"],
}

Mining docstrings


Code equivalence

Exploration - Alexandru Dinu