COMP 790.139: Natural Language Processing (Fall 2017): Coding Homework 2 (Sequence-to-Label Learning, Entailment Recognition)¶

Created by TA Yixin Nie (Instructor: Mohit Bansal)

Instructions¶

All the instructions are present in the jupyter notebook (as shown in the class; and see the preview below).
Install jupyter notebook in your python environment and download the file below.
https://drive.google.com/drive/folders/0B6i0pVGwapCdVHV1cnVNdWhvM1k?usp=sharing

Use this directory as your workspace and write your code in the “hw2.ipynb” file. You could also add extra images or tables in the directory and link them into “hw2.ipynb” file but grading will only base on the “hw2.ipynb” file.

Make sure to name your directory as "<your_name>_hw2" and compress it to "<your_name>_hw2.zip".
Email the file to comp790.hw@gmail.com for submission.

Homework 2 Preview¶

The main goals for homework 2 are:

Going through some setup procedures of training deep neural network for NLP.
Getting familiar with Recurrent Neural Networks.
Learning to handle variable length inputs in deep learning framework (Pytorch, Tensorflow).
Building a sequence-to-label model for natural language inference task.
Learning to improve your model by interpreting your experiment result.

Notice:
You can test or run your code in any environment but you could only show your codes, your results and your write-ups in this single notebook file. We will not re-run your code for grading.

**Fill in the #TODOs and try to stick to the provided APIs.**

import json
import numpy as np
from tqdm import tqdm

1. Loading and cleaning raw dataset (3 pt)¶

In this section, you are required to obtain the provide vocabulary and load and clear the raw SNLI dataset.

The original SNLI dataset contains about 50k/10k/10k train/dev/test sentences pairs. We also provide smaller training set that contains 0.05 of the original training set in 'snli_1.0/small_0.05_snli_train.jsonl'. You could pick one training set depending on your computation resources. Each line of the dataset file is a data point in JSON format. The 'gold_label' field contains the label ('entailment', 'neutral' or 'contradiction') for this data point. There are some data points that don't have the gold_label (their value for the gold label field is '-'). We will only use those examples which have a gold label. For a detailed description of the task and the dataset file, please refer to https://nlp.stanford.edu/projects/snli/.

The vocabulary we will use for this homework is in the vocabu.txt file. It contains 19,007 tokens and all the missing words in the snli data set should be mapped to the last token in the vocabu.txt file which is '<unk-0>'.

Finish the preprocessing code in the obtain_data() function.

def obtain_data(file_name='snli_1.0/small_0.05_snli_train.jsonl', vocab_file_name='vocabu.txt'):
    json_data = []
    
    with open(file_name) as data_file:
        for l in data_file:
            json_data.append(json.loads(l))
    
    data = []
    for item in json_data:
        pass
#       TODO: (2 pt)
#         Write your code here to 
#         1. Filter out data points without a gold label
#         2. Extract each sentence pair into a triple and concatenate them altogether into a list.
    
#       example = (s1, s2, label)
#       data.append(example)
    
    stoi = []
    with open(vocab_file_name) as voc_file:
        pass
#       TODO: (1 pt)
#         Write your code here to extract each token in the vocabulary file and concatenate them into a list.

#       stoi.append(token)
    
    return data, stoi

Run the code below (according to your choice of training set) and show how many data points are loaded.

print('# of words in vocabulary')
print(len(stoi))

data, stoi = obtain_data('snli_1.0/small_0.05_snli_train.jsonl')
print('snli_1.0/small_0.05_snli_train.jsonl')
print(len(data))

print('snli_1.0/snli_1.0_train.jsonl')
data, stoi = obtain_data('snli_1.0/snli_1.0_train.jsonl')
print(len(data))

2. Build your vocabulary (6 pt)¶

In this section, we will build a vocabulary python object and load pretrained Glove Embedding for your vocabulary. Both of them will be important throughout this assignment.

The Vocabulary object will have three attributes, namely embeddings, itos and stoi.

stoi is a python list which can be used to map a word into a unique id.
itos is a python dictionary which can be used to map a given id into the corresponding word.
embeddings is a numpy array with shape: (size_of_vocabulary, embedding_dimension). Note: The ith row of your embeddings will be the vector for the word whose id is i.

Example:

embeddings[stoi['happy']] # This will give you the corresponding vector for word 'happy'.
itos[100] # This will give you the word whose id is 100.

In the cell below:

Directly use stoi to (you loaded in the last section) initiate your vocabulary.stoi attribute. (0.5 pt)
Build itos according to the stoi. (0.5 pt)
Load Glove embedding in the function set_word_embedding(self, embedding_file) and initiate your embeddings using pretrain Glove embedding. For detailed description and format of Glove Embeddings, refer to https://nlp.stanford.edu/projects/glove/. (4 pt)
Numericalize your data in the process_data(self, data) function, namely convert the sentences into a list of ids and the labels into label_ids. (Use the dictionary label_id in the cell below for mapping.) (1 pt)

Remember that your output of process_data(self, data) should be a list of triple [(token_id_list, token_id_list, int)].

label_id = {'entailment': 0, 'neutral': 1, 'contradiction': 2}

class Vocabulary(object):
    def __init__(self, stoi):
        self.embeddings = None
        self.itos = dict()
        self.stoi = stoi
    
    def set_word_embedding(self, embedding_file):
        # TODO:
        #   Load Glove word embedding into 
        #   self.embeddings with ith row of your self.embeddings corresponded to the word whose id is i in your vocabulary
        # Important: self.embedding should be a numpy ndarray with dtype=np.float32
        pass
    
    def dummy_embedding(self, embed_d=300):
        # This is only a dummy embedding using random initialization. Do not use this function in the assignment.
        for i, w in enumerate(self.stoi):
            self.itos[w] = i
        vocab_size = len(self.stoi)
        self.embeddings = np.asarray(np.random.randn(vocab_size, embed_d), dtype=np.float32)
        
    def process_data(self, data):
        numerialized_data = []
        for s1, s2, y in data:
        # TODO: 
        #   Convert each sentence pair into a triple [(token_id_list, token_id_list, int)] and concatenate them altogether into a list.   
        #   Out-of-vocabulary tokens should be mapped to '<unk-0>'.
            
            
            numerialized_data.append((n_s1, n_s2, n_y))
            
        return numerialized_data

# You can use this code to check the correctness of your code.

vocab = Vocabulary(stoi=stoi)
vocab.set_word_embedding()
n_data = vocab.process_data(data)

print(vocab.embeddings.shape)
print(len(n_data))

print(vocab.itos['hello'])
print(vocab.stoi[6030])

(19007, 300)
27284
6030
hello

3. Padding (5 pt)¶

For a neural network model to be trained on sequential natural language data, we will need to pad each sequence into a fixed length.

The problem with sequential data for neural networks is that different examples might have different lengths. So we will need to pad all the data (or the data in a batch) to a fixed length (say 50) for parallel computation.

In the cell below, finish the convert_to_numpy(data, padding_length=50) function which can convert the data into numpy ndarray and with padding. We recommend padding value to be zero.

You can do padding in corpus level (that is padding all the example in the dataset to a fixed length which means that all the example in your dataset will have the same length) or batch level (padding the example in a batch to a fixed length, different batch might have different length)

Some Neural Network framework (like tensorflow) use a more advanced batching technique call bucketing. We will not implement bucketing in this assignment but refer to https://www.tensorflow.org/tutorials/seq2seq for more details.

The padded data could be something like this:

for start, end in batch_index_gen(3, len(y)):
    print(s1[start:end])
    print(s2[start:end])
    print(y[start:end])
    break

Output:

[[   5   26    7   99   10  281    6    8  293   13    4   39  388    3    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0]
 [  61  147    9    7  282  104 1797   19    4  242    3    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0]
 [   5    9   11    4   16  586    8   45    6   46   13    4 1043   11 8421  242    3    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0]]
[[  15   26    7   22  593  480    3    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0]
 [   5    9    6    4  242    7  164   40   23  682   18 2635    3    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0]
 [5734    4  130   13 5079   17   52   20  152  234    8   45    3    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0]]
[1 1 0]

You don't have to have the same output. All you will need to do is to convert your data into the format such that it can be feeded into a neural network.

def batch_index_gen(batch_size, size):
    batch_indexer = []
    start = 0
    while start < size:
        end = start + batch_size
        if end > size:
            end = size
        batch_indexer.append((start, end))
        start = end
    return batch_indexer

def convert_to_numpy(data, padding_length=50):
    # TODO:
    #   Write your code here for padding.
    
    total_size = len(data)
    
    for i, (s1, s2, y) in enumerate(data):
        pass
    # do something to padding and converting.
        
#     return <your data>

4. RNN Sequence-to-label Model (15 pt + 5 bonus)¶

It's time to build your RNN sequence-to-label model.

Remember that the input of the model is two sequences and you will use the label to train your model.
RNN component is mandatory for the model in this assignment.

5. Training your model (3 pt)¶

You could now train your model batch by batch using whatever optimizer you want.
In order to keep track of your training, you should also print out the loss every 1000*X batch.

Write your code in the cell below. Print out the loss every 1000*X batch and your final average loss.

6. Evaluation (8 pt)¶

Complete the eval_model(model, mode='dev') function in the cell below for evaluate your model on dev and test set. The return value of this function should be the accuracy. Try to tune your model on the dev set and finally evaluate the model with best-dev-result on the test set and report the final test set result. Note: You should try your model on test set only once.

If you are not satisfied with the result from the evaluation (in most cases), you could try to make some changes in your model and re-try.

If you are running most things correctly, your result should easily be at least 60% on the dev set if you are using 0.05 of the SNLI training data; and at least 80% on the dev set if you are using full training set.

def eval_model(model, mode='dev'):
    file_name = 'snli_1.0/snli_1.0_dev.jsonl' if mode == 'dev' else 'snli_1.0/snli_1.0_test.jsonl'
        
    dev_data, _ = obtain_data(file_name)
    dev_n_data = vocab.process_data(data)
    
    `<your dev data>` = convert_to_numpy(dev_n_data)
    
    model.eval()
    
    total = 0
    hit = 0
    
    # TODO:
    #   write your code here to show the result of your model on 
    
    return hit / float(total)

7. Additional analysis (5 pt)¶

This course is designed to train you as an NLP researcher. A researcher should not only be able to implement newly emerged models and algorithms and get them to work but also give reasons and intuitions behind every decision you make during your research (e.g. parameter and structure design).
In this section, write down anything you think that is important in this homework.
It could be:

The problems you encountered during the implementation and how you resolve it.
You were not satisfied with the results and you made some changes to (or it fails to) improve it. Why do you think those changes can (might) be helpful?

Use your imagination and try to record every detail of your experiments. The bonus will be given to novel and reasonable thoughts.