Created by TA Yixin Nie (Instructor: Mohit Bansal)
All the instructions are present in the jupyter notebook (as shown in the class; and see the preview below).
Install jupyter notebook in your python environment and download the file below.
https://drive.google.com/drive/folders/0B6i0pVGwapCdVHV1cnVNdWhvM1k?usp=sharing
Use this directory as your workspace and write your code in the “hw2.ipynb” file. You could also add extra images or tables in the directory and link them into “hw2.ipynb” file but grading will only base on the “hw2.ipynb” file.
Make sure to name your directory as "<your_name>_hw2"
and compress it to "<your_name>_hw2.zip"
.
Email the file to comp790.hw@gmail.com for submission.
The main goals for homework 2 are:
Notice:
You can test or run your code in any environment but you could only show your codes, your results and your write-ups in this single notebook file. We will not re-run your code for grading.
**Fill in the #TODOs and try to stick to the provided APIs.**
import json
import numpy as np
from tqdm import tqdm
In this section, you are required to obtain the provide vocabulary and load and clear the raw SNLI dataset.
The original SNLI dataset contains about 50k/10k/10k train/dev/test sentences pairs. We also provide smaller training set that contains 0.05 of the original training set in 'snli_1.0/small_0.05_snli_train.jsonl'
. You could pick one training set depending on your computation resources. Each line of the dataset file is a data point in JSON format. The 'gold_label'
field contains the label ('entailment', 'neutral' or 'contradiction'
) for this data point. There are some data points that don't have the gold_label (their value for the gold label field is '-'
). We will only use those examples which have a gold label. For a detailed description of the task and the dataset file, please refer to https://nlp.stanford.edu/projects/snli/.
The vocabulary we will use for this homework is in the vocabu.txt
file. It contains 19,007 tokens and all the missing words in the snli data set should be mapped to the last token in the vocabu.txt
file which is '<unk-0>'
.
Finish the preprocessing code in the obtain_data()
function.
def obtain_data(file_name='snli_1.0/small_0.05_snli_train.jsonl', vocab_file_name='vocabu.txt'):
json_data = []
with open(file_name) as data_file:
for l in data_file:
json_data.append(json.loads(l))
data = []
for item in json_data:
pass
# TODO: (2 pt)
# Write your code here to
# 1. Filter out data points without a gold label
# 2. Extract each sentence pair into a triple and concatenate them altogether into a list.
# example = (s1, s2, label)
# data.append(example)
stoi = []
with open(vocab_file_name) as voc_file:
pass
# TODO: (1 pt)
# Write your code here to extract each token in the vocabulary file and concatenate them into a list.
# stoi.append(token)
return data, stoi
Run the code below (according to your choice of training set) and show how many data points are loaded.
print('# of words in vocabulary')
print(len(stoi))
data, stoi = obtain_data('snli_1.0/small_0.05_snli_train.jsonl')
print('snli_1.0/small_0.05_snli_train.jsonl')
print(len(data))
print('snli_1.0/snli_1.0_train.jsonl')
data, stoi = obtain_data('snli_1.0/snli_1.0_train.jsonl')
print(len(data))
In this section, we will build a vocabulary python object and load pretrained Glove Embedding for your vocabulary. Both of them will be important throughout this assignment.
The Vocabulary
object will have three attributes, namely embeddings
, itos
and stoi
.
stoi
is a python list which can be used to map a word into a unique id. itos
is a python dictionary which can be used to map a given id into the corresponding word. embeddings
is a numpy array with shape: (size_of_vocabulary, embedding_dimension). Note: The i
th row of your embeddings will be the vector for the word whose id is i
.Example:
embeddings[stoi['happy']] # This will give you the corresponding vector for word 'happy'.
itos[100] # This will give you the word whose id is 100.
In the cell below:
stoi
to (you loaded in the last section) initiate your vocabulary.stoi
attribute. (0.5 pt)itos
according to the stoi
. (0.5 pt)set_word_embedding(self, embedding_file)
and initiate your embeddings using pretrain Glove embedding. For detailed description and format of Glove Embeddings, refer to https://nlp.stanford.edu/projects/glove/. (4 pt)process_data(self, data)
function, namely convert the sentences into a list of ids and the labels into label_ids. (Use the dictionary label_id
in the cell below for mapping.) (1 pt)Remember that your output of process_data(self, data)
should be a list of triple [(token_id_list, token_id_list, int)]
.
label_id = {'entailment': 0, 'neutral': 1, 'contradiction': 2}
class Vocabulary(object):
def __init__(self, stoi):
self.embeddings = None
self.itos = dict()
self.stoi = stoi
def set_word_embedding(self, embedding_file):
# TODO:
# Load Glove word embedding into
# self.embeddings with ith row of your self.embeddings corresponded to the word whose id is i in your vocabulary
# Important: self.embedding should be a numpy ndarray with dtype=np.float32
pass
def dummy_embedding(self, embed_d=300):
# This is only a dummy embedding using random initialization. Do not use this function in the assignment.
for i, w in enumerate(self.stoi):
self.itos[w] = i
vocab_size = len(self.stoi)
self.embeddings = np.asarray(np.random.randn(vocab_size, embed_d), dtype=np.float32)
def process_data(self, data):
numerialized_data = []
for s1, s2, y in data:
# TODO:
# Convert each sentence pair into a triple [(token_id_list, token_id_list, int)] and concatenate them altogether into a list.
# Out-of-vocabulary tokens should be mapped to '<unk-0>'.
numerialized_data.append((n_s1, n_s2, n_y))
return numerialized_data
# You can use this code to check the correctness of your code.
vocab = Vocabulary(stoi=stoi)
vocab.set_word_embedding()
n_data = vocab.process_data(data)
print(vocab.embeddings.shape)
print(len(n_data))
print(vocab.itos['hello'])
print(vocab.stoi[6030])
For a neural network model to be trained on sequential natural language data, we will need to pad each sequence into a fixed length.
The problem with sequential data for neural networks is that different examples might have different lengths. So we will need to pad all the data (or the data in a batch) to a fixed length (say 50) for parallel computation.
In the cell below, finish the convert_to_numpy(data, padding_length=50)
function which can convert the data into numpy ndarray and with padding. We recommend padding value to be zero.
You can do padding in corpus level (that is padding all the example in the dataset to a fixed length which means that all the example in your dataset will have the same length) or batch level (padding the example in a batch to a fixed length, different batch might have different length)
Some Neural Network framework (like tensorflow) use a more advanced batching technique call bucketing. We will not implement bucketing in this assignment but refer to https://www.tensorflow.org/tutorials/seq2seq for more details.
The padded data could be something like this:
for start, end in batch_index_gen(3, len(y)):
print(s1[start:end])
print(s2[start:end])
print(y[start:end])
break
Output:
[[ 5 26 7 99 10 281 6 8 293 13 4 39 388 3 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0]
[ 61 147 9 7 282 104 1797 19 4 242 3 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0]
[ 5 9 11 4 16 586 8 45 6 46 13 4 1043 11 8421 242 3 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0]]
[[ 15 26 7 22 593 480 3 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0]
[ 5 9 6 4 242 7 164 40 23 682 18 2635 3 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0]
[5734 4 130 13 5079 17 52 20 152 234 8 45 3 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0]]
[1 1 0]
You don't have to have the same output. All you will need to do is to convert your data into the format such that it can be feeded into a neural network.
def batch_index_gen(batch_size, size):
batch_indexer = []
start = 0
while start < size:
end = start + batch_size
if end > size:
end = size
batch_indexer.append((start, end))
start = end
return batch_indexer
def convert_to_numpy(data, padding_length=50):
# TODO:
# Write your code here for padding.
total_size = len(data)
for i, (s1, s2, y) in enumerate(data):
pass
# do something to padding and converting.
# return <your data>
It's time to build your RNN sequence-to-label model.
Remember that the input of the model is two sequences and you will use the label to train your model.
RNN component is mandatory for the model in this assignment.
Recommended reading:
Bonus will be given to those who correctly handle variable length inputs.
import torch
import torch.nn as nn
import torch_util
from torch.autograd import Variable
from torch import optim
class YourModel(nn.Module):
# TODO
# Build your model in this cell.
def __init__(self, h_size=128, v_size=10, d=300, mlp_d=256):
super(YourModel, self).__init__()
# self.embedding
# self.lstm
def display(self):
for param in self.parameters():
print(param.data.size())
def forward(self):
pass
After finishing your model, run the code in the cell below to show your model.
Here is a sample output: (You don't need to have the same output)
embedding.weight torch.Size([19007, 300])
lstm.weight_ih_l0 torch.Size([512, 300])
lstm.weight_hh_l0 torch.Size([512, 128])
lstm.bias_ih_l0 torch.Size([512])
lstm.bias_hh_l0 torch.Size([512])
lstm.weight_ih_l0_reverse torch.Size([512, 300])
lstm.weight_hh_l0_reverse torch.Size([512, 128])
lstm.bias_ih_l0_reverse torch.Size([512])
lstm.bias_hh_l0_reverse torch.Size([512])
mlp_1.weight torch.Size([256, 1024])
mlp_1.bias torch.Size([256])
sm.weight torch.Size([3, 256])
sm.bias torch.Size([3])
model = YourModel()
model.embedding.weight.data = torch.from_numpy(vocab.embeddings)
model.display()
You could now train your model batch by batch using whatever optimizer you want.
In order to keep track of your training, you should also print out the loss every 1000*X
batch.
Write your code in the cell below. Print out the loss every 1000*X
batch and your final average loss.
Complete the eval_model(model, mode='dev')
function in the cell below for evaluate your model on dev and test set. The return value of this function should be the accuracy. Try to tune your model on the dev set and finally evaluate the model with best-dev-result on the test set and report the final test set result. Note: You should try your model on test set only once.
If you are not satisfied with the result from the evaluation (in most cases), you could try to make some changes in your model and re-try.
If you are running most things correctly, your result should easily be at least 60% on the dev set if you are using 0.05 of the SNLI training data; and at least 80% on the dev set if you are using full training set.
def eval_model(model, mode='dev'):
file_name = 'snli_1.0/snli_1.0_dev.jsonl' if mode == 'dev' else 'snli_1.0/snli_1.0_test.jsonl'
dev_data, _ = obtain_data(file_name)
dev_n_data = vocab.process_data(data)
`<your dev data>` = convert_to_numpy(dev_n_data)
model.eval()
total = 0
hit = 0
# TODO:
# write your code here to show the result of your model on
return hit / float(total)
This course is designed to train you as an NLP researcher. A researcher should not only be able to implement newly emerged models and algorithms and get them to work but also give reasons and intuitions behind every decision you make during your research (e.g. parameter and structure design).
In this section, write down anything you think that is important in this homework.
It could be:
Use your imagination and try to record every detail of your experiments. The bonus will be given to novel and reasonable thoughts.