Description
1 Introduction
How do we represent word meaning so that we can analyze it, compare different words’ meanings, and use these representations in NLP tasks? One way to learn word meaning is to find regularities in how a word is used. Two words that appear in very similar contexts probably mean similar things. One way you could capture these contexts is to simply count which words appeared nearby. If we had a vocabulary of V words, we would end up with each word being represented as a vector of length |V | 1 where for a word wi , each dimension j in wi’s vector, wi,j refers to how many times wj appeared in a context where wi was used.
The simple counting model we described actually words pretty well as a baseline! However, it has two major drawbacks. First, if we have a lot of text and a big vocabulary, our word vector representations become very expensive to compute and store. A 1,000 words that all co-occur with some frequency would take a matrix of size |V | 2 , which has a million elements! Even though not all words will co-occur in practice, when we have hundreds of thousands of words, the matrix can become infeasible to compute.
Second, this count-based representation has a lot of redundancy in it. If “ocean” and “sea” appear in similar contexts, we probably don’t need the co-occurrence counts for all |V | words to tell us they are synonyms. In mathematics terms, we’re trying to find a lower-rank matrix that doesn’t need all |V | dimensions. Word embeddings solve both of these problems by trying to encode the kinds of contexts a word appears in as a low-dimensional vector. There are many (many) solutions for how to find lowerdimensional representations, with some of the earliest and successful ones being based on the Singular Value Decomposition (SVD); one you may have heard of is Latent Semantic Analysis. In Homework 2, you’ll learn about a relatively recent technique, word2vec, that outperforms prior approaches for a wide variety of NLP tasks and is very widely used. This homework will build on your experience with stochastic gradient descent (SGD) and log-likelihood (LL) from Homework 1. You’ll (1) implement a basic version of word2vec that will learn word representations and then (2) try using those representations in intrinsic tasks that measure word similarity and an extrinsic task for sentiment analysis.
For this homework, we’ve provided skeleton code in Python 3 that you can use to finish the implementation of word2vec and comments within to help hint at how to turn some of the math into python code. You’ll want to start early on this homework so you can familiarize yourself with the code and implement each part. This homework has the following learning goals: 1You’ll often just see the number of words in a vocabulary abbreviated as V in blog posts and papers. This notation is shorthand; typically V is somehow related to vocabulary so you can use your judgment on how to interpret whether it’s referring to the set of words or referring to the total number of words.
1 • Develop your pytorch programming skills through working with more of the library • Learn how word2vec works in practice • Learn how to modify the loss to have networks learn (or avoid learning) multiple things • Improve your advanced data science debugging skills • Have you work with large corpora • Identify bias in pretrained models and attempt to mitigate it • Learn how to use tensorboard This homework is a mix of conceptual and skills based learning. As you get the hang of programming neural networks, you’ll be able to teach them to do many more advanced tasks. This homework will hopefully help prepare you by again having you advance your skills while also getting you thinking about what training word embeddings can do for us (as practitioners).
2 Notes We’ve made the implementation easy to follow and avoided some of the useful-to-opaque optimizations that can make the code much faster.2 As a result, training your model may take some time. We estimate that on a regular laptop, it might take 30-45 minutes to finish training a single epoch of your model. That said, you can still quickly run the model for ∼10K steps in a few minutes and check whether it’s working. A good way to check is to see what words are most similar to some high frequency words, e.g., “january” or “good.” If the model is working, similar-meaning words should have similar vector representations, which will be reflected in the most similar word lists. We have included this as an automated test which will print out the most similar words.
The skeleton code also includes methods for writing word2vec data in a common format readable by the Gensim library. This means you can save your model and load the data with any other common libraries that work with word2vec. Once you’re able to run your model for ∼100K iterations (or more), we recommend saving a copy of its vectors and loading them in a notebook to test. We’ve included an exploratory notebook. On a final note, this is the most challenging homework in the class. Much of your time will be spent on Task 1, which is just implementing word2vec. It’s a hard but incredibly rewarding homework and the process of doing the homework will help turn you into a world-class information and data scientist! 3 Data For data, we’ll be using a sample of cleaned Wikipedia biographies that’s been shrunk down to make it manageable. This is pretty fun data to use since it lets us use word vectors to probe for 2You’ll also find a lot of wrong implementations of word2vec online, if you go looking. Those implementations will be much slower and produce worse vectors. Beware! 2 knowledge (e.g., what’s similar to chemistry?). If you’re very ambitious, we’ve include the full cleaned biographies which will be slow. Feel free to see how the model works and whether you can get through a single epoch! We’ve provided several files for you to use: 1. wiki-bios.med.txt – Eventually train your word2vec model on this data 2. wiki-bios.DEBUG.txt – The first 100 biographies. Your model won’t learn much from this but you can use the file to quickly test and debug your code without having to wait for the tokenization to finish.
3. wiki-bios.10k.txt – A smaller sample of 10K biographies that you can use to verify your word2vec is learnin something without having to train on the final corpus. 4. wiki-bios.HUGE.txt – A much larger dataset of biographies. You should only use this if you want to test scalability or see how much you can optimize your method. 5. word pair similarity predictions.csv – This is the data for the debiasing evaluation that you’ll use to estimate similarities for CodaLab. You only need this for the very last part of the homework.
4 Task 1: Word2vec In Task 1, you’ll implement parts of word2vec in various stages. Word2vec itself is a complex piece of software and you won’t be implementing all the features in this homework. In particular, you will implement: 1. Skip-gram negative sampling (you might see this as SGNS) 2. Rare word removal 3. Frequent word subsampling You’ll spend the majority of your time on Part 1 of that list which involves writing the gradient descent part. You’ll start by getting the core part of the algorithm up without parts 2 and 3 and running with gradient descent and using negative sampling to generate output data that is incorrect. Then, you’ll work on ways to speed up the efficiency and quality by removing overly common words and removing rare words. Parameters and notation The vocabulary size is V , and the hidden layer size is k.
The hidden layer size k is a hyperparameter that will determine the size of our embeddings. The units on these adjacent layers are fully connected. The input is a one-hot encoded vector x, which means for a given input context word, only one out of V units, {x1, . . . , xV }, will be 1, and all other units are 0. The output layer consists of a number of context words which are also V -dimensional one-hot encodings of a number of words before and after the input word in the sequence. So if your input word was word w in a sequence of text and you have a context window3 ±2, this 3Typically, when describing a window around a word, we use negative indices to refer to words before the target, so a ±2 window around index i starts at i − 2 and ends at i + 2 but excludes index i. 3 means you will have four V -dimensional one-hot outputs in your output layer, each encoding words w−2, w−1, w+1, w+2 respectively. Unlike the input-hidden layer weights, the hidden-output layer weights are shared: the weight matrix that connects the hidden layer to output word wj will be the same one that connects to output word wk for all context words.
The weights between the input layer and the hidden layer can be represented by a V ×k matrix W and the weights between the hidden layer and each of the output contexts similarly represented as C with the same dimensions. Each row of W is the k-dimension embedded representation vI of the associated word wI of the input layer—these rows are effectively the word embeddings we want to produce with word2vec. Let input word wI have one-hot encoding x and h be the output produced at the hidden layer.
Then, we have: h = WT x = vI (1) Similarly, vI acts as an input to the second weight matrix C to produce the output neurons which will be the same for all context words in the context window. That is, each output word vector is: u = Ch (2) and for a specific word wj , we have the corresponding embedding in C as v ′ j and the corresponding neuron in the output layer gets uj as its input where: uj = v ′T j h (3) Note that in both of these cases, multiplying the one-hot vector for a word wi by the corresponding matrix is the same thing has simply selecting the row of the matrix corresponding to the embedding for wi . If it helps to think about this visually, think about the case for the inputs to the network: the one-hot embedding represents which word is the center word, with all other words not being present. As a result, their inputs are zero and never contribute to the activation of the hidden layer (only the center word does!), so we don’t need to even do the multiplication. In practice, we typically never represent these one-hot vectors for word2vec as it’s much more efficient to simply select the appropriate row.
An unoptimized, naive version of word2vec would predict which context word wc was present given an input word wI by estimating the probabilities across the whole vocabulary using the softmax function: P(wc = w ∗ c |wI ) = yc = exp(uc) PV i=1 exp(ui) (4) This original log-likelihood function is then to maximize the probability that the context words (in this case, w−2, . . . , w+2) were all guessed correctly given the input word wI . Note that you are not implementing this function! Showing this function raises two important questions (1) why is it still being described and (2) why aren’t you implementing it? First, the equation represents an ideal case of what the model should be doing: given some positive value to predict for one of the outputs (wc), everything else should be close to zero.
This objective is similar to the likelihood you implemented for Logistic Regression: given some input, the weights need to be moved to push the predictions closer to 0 or closer to 1. However, think about how many weights you’d need to update to minimize 4 this particular log-likelihood? For each positive prediction, you’d need to update |V | − 1 other vectors to make their predictions closer to 0. That strategy which uses the softmax results a huge computational overhead—despite being the most conceptually sound.
The success of word2vec is, in part, due to coming up with a smart way to achieve nearly the same result without having to apply the softmax. Therefore, to answer the second question, now that you know what the goal is, you’ll be implementing a far more efficient method known as negative sampling that will approximate creating a model that minimizes this equation! If you read the original word2vec paper, you might find some of the notation hard to follow. Thankfully, several papers have tried to unpack the paper in a more accessible format. If you want another description of how the algorithm works, try reading Goldberg and Levy [2014]4 or Rong [2014]5 for more explanation. There are also plenty of good blog tutorials for how word2vec works and you’re welcome to consult those6 as well as some online demos that show how things work.7 .
There’s also a very nice illustrated guide to word2vec https://jalammar.github. io/illustrated-word2vec/ that can provide more intuition too. 4.1 Getting Started: Preparing the Corpus Before we can even start training, we’ll need to determine the vocabulary of the input text and then convert the text into a sequence of IDs that reflect which input neuron corresponds to which word. Word2vec typically treats all text as one long sequence, which ignores sentences boundaries, document boundaries, or otherwise-useful markers of discourse. We too will follow suit. In the code, you’ll see general instructions on which steps are needed to (1) create a mapping of word to ID and (2) processing the input sequence of tokens and covert it to a sequence of IDs that we can use for training. This sequence of IDs is what we’ll use to create our training data. As a part of this process, we’ll also keep track of all the token frequencies in our vocabulary. ■ Problem 1. Modify function load data in the Corpus class to read in the text data and fill in the id to word, word to id, and full token sequence as ids fields. You can safely skip the rare word removal and subsampling for now.
4.2 Negative sampling For a target word, the nearby words in the context form the positive example for training our prediction model. Rather than train word2vec like a regular mutliclass classification model (which uses the softmax function to predict outputs8 ), word2vec uses a small number of randomly-selected 4https://arxiv.org/pdf/1402.3722.pdf 5https://arxiv.org/pdf/1411.2738.pdf 6E.g., http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/ 7https://ronxin.github.io/wevi/ 8When using the softmax, you would update the parameters (embeddings) for all the words after seeing each training instance. However, consider how many parameters we have to adjust: for one prediction, we would need to change |V |N weights—this is expensive to do! Mikolov et al. proposed a slightly different update rule to speed things up. Instead of updating all the weights, we update only a small percentage by updating the weights for the predictions of the words in context and then performing negative sampling to choose a few words at random as negative examples of words in the context (i.e., words that shouldn’t be predicted to be in the context) and updating the weights for these 5 words as negative examples.9 These negative examples are referred to as the negative samples.
The negative samples are chosen using a unigram distribution raised to the 3 4 power: Each word is given a weight equal to its frequency (word count) raised to the 3 4 power. The probability for a selecting a word is just its weight divided by the sum of weights for all words. The decision to raise the frequency to the 3 4 power is fairly empirical and this function was reported in their paper to outperform other ways of biasing the negative sampling towards infrequent words. Computing this function each time we sample a negative example is expensive, so one important implementation efficiency is to create a table so that we can quickly sample words. We’ve provided some notes in the code and your job will be to fill in a table that can be efficiently sampled.10
■ Problem 2. Modify function generate negative sampling table to create the negative sampling table. 4.3 Generating the Training Data Once you have the tokens in place, the next step is get the training data in place to actually train the model. Say we have the input word “fox” and observed context word “quick”. When training the network on the word pair (“fox”, “quick”), we want the model to predict an output of 1 signalling this word (“quick”) was present in the context. With negative sampling, we are will randomly select a small number of negative examples (let’s say 2) for each positive example to update the weights for. (In this context, a negative example is one for which we want the network to output a 0 for). When updating the model (later), our parameters will be updated on our current ability to predict 1 for the positive examples and 0 for the negative examples. To generate the training, you’ll iterate through all token IDs in the sequence. At each time step, the current token ID will become the target word. You’ll use the window size parameter to decide how many nearby tokens should be included as positive training examples. The original word2vec paper says that selecting 5-20 words works well for smaller datasets, and you can get away with only 2-5 words for large datasets. In this assignment, you will update with 2 negative words per context word. This means that if your context window selects four words, you will randomly sample 8 words as negative examples of context words. We recommend keeping the negative sampling rate at 2, but you’re welcome to try changing this and seeings its effect (we recommend doing this after you’ve completed the main assignment). Note: There is one important PyTorch-related wrinkle that you will need to account for, which is described in detail in the code. ■
Problem 3. Generate the list of training instances according to the specifications in the code. negative predictions. 9There is another formulation of word2vec that uses a hierarchical softmax to speed up the softmax computation (which is the bottleneck) but few use this in practice. 10Hint: In the slides, we showed how to sample from a multinomial (e.g., a dice with different weights per side) by turning it into a distribution that can be sampled by choosing a random number in [0,1]. You’ll be doing something similar here. 6 4.4 Define Your word2vec Network Now that the data is ready, we can define our PyTorch neural network for word2vec. Here, we will not use layers but instead use PyTorch’s Embedding class to keep track of our target word and context word embeddings.
■ Problem 4. Modify the init weights function to initialize the values in the two Embedding objects based on the size of the vocabulary |V | and the size of the embeddings. Unlike in logistic regression where we initialized our β vector be zeros, here, we’ll initialize the weights to have small non-zero values centered on zero and sampled from (-init range, init range).11 The next step is to update the forward function, which takes as input some target word and context words and predicts 0 or 1 for whether each context word was present. Formally, for some target word vector vt and context word vector vc, word2vec makes its predictions as σ(vt · vc) (5) where σ is the sigmoid function (like in Homework 1). Word2vec aims to learn parameters (its two embedding matrices) such that this function is maximized for positive examples and minimized for negative examples.
■ Problem 5. Modify the forward function 4.5 Train Your Model Once you have the data in the right format, you’re ready to train your model! You will need to implement the core training loop like you did in Homework 1, where you iterate over all the instances in a single epoch and potentially train for multiple epochs. One key difference this time is that you will use batching. In Homework 1 we had a stark contrast between (1) full gradient descent where a single step required us to compute the gradient with respect to all the data and (2) stochastic gradient descent where take a step based on the prediction error for a single instance. However, there is a middle ground! Often we can improve the gradient by computing it with respect to a few instances instead of just one. Analogously, consider if you wanted to know if you were on the right track, it can help to ask a few folks, but you don’t need to ask everyone (and asking just one person could be risky and send you on the wrong track).
Batched gradient descent is the same way. Conveniently, PyTorch works nearly seamlessly with batching. We can tell the DataLoader class our batch size and it will return a random sample of instances of that size. The code you write for the forward function will also work with a batch too with no modifications (most of the time). This behavior is even better for us because often computers are much faster at larger computations—especially GPUs—so trying to do the forward/backward passes for an entire batch is often just as fast as doing them for a single instance.
Note: One caveat to things just working is that sometimes your forward-pass code will be set up so that it can’t work with batching. The code hints and description in the notebook will hopefully help you avoid these, but we’re also here to support you in Piazza. 11Why initialize this way? Consider what would happen if our initial matrices were all zero and we had to compute the inner product of the word and context vectors. The value would always be zero and the model would never be able to learn anything! 7 In your implementation we recommend starting with these default parameter values: • batch size = 16 • k = 50 (embedding size) • η = 5e − 5 (learning rate) • window ±2 • min token freq = 5 • epochs = 1 • optimizer = AdamW You can experiment around with other values to see how it affects your results. Your final submission should use a batch size > 1. For more details on the equations and details of word2vec, consult Rong’s paper [Rong, 2014], especially Equations 59 and 61.
■ Problem 6. Modify the cell containing the training loop to complete the required PyTorch training process. The notebook describes in more details all the steps ■ Problem 7. Check that your model actually works. We recommend running your code on the wiki-bios.10k.txt file for one epoch. After this much data, your model should know enough for common words that the nearest neighbors (words with the most similar vectors) to words like “january” will be month-related words. We’ve provided code at the end of the notebook to explore. Try a few examples and convince yourself that your model/code is working. Once you’re finished here, you’re not yet ready to run everything but you’re close! 4.6 Implement stop-word and rare-word removal Using all the unique words in your source corpus is often not necessary, especially when considering words that convey very little semantic meaning like “the”, “of”, “we”.
As a preprocessing step, it can be helpful to remove any instance of these so-called “stop words”. Note that when you remove stop words, you should keep track of their position so that the context doesn’t include words outside of the window. This means that a sentence with “my big cats of the kind that…” if you have a context window of ±2, then you would only have “my” and ”big” as context words (since “of” and “the” get removed) and not include “kind.” 4.6.1 Minimum frequency threshold. In addition to removing words that are so frequent that they have little semantic value for comparison purposes, it is also often a good idea to remove words that are so infrequent that they are likely very unusual words or words that don’t occur often enough to get sufficient training during SGD. While the minimum frequency can vary depending on your source corpus and requirements, we will set min count = 5 as the default in this assignment. Instead of just removing words that had less than min count occurrences, we will replace these all with a unique token . In the training phase, you will skip over any input word that is but you will still keep these as possible context words.
■ Problem 8. Modify function load data to convert all words with less than min count occurrences into tokens. Modify your dataset generation code to avoid creating a training instance when the target word is . 4.6.2 Frequent word subsampling Words appear with varying frequencies: some words like “the” are very common, whereas others are quite rare. In the current setup, most of our positive training examples will be for predicting very common words as context words. These examples don’t add much to learning since they appear in many contexts.
The word2vec library offers an alternative to ensure that contexts are more likely to have meaningful words. When creating the sequence of words for training (i.e., what goes in full token sequence as ids), the software will randomly drop words based on their frequency so that more common words are less likely to be included in the sequence. This subsampling effectively increases the context window too—because the context window is defined with respect to full token sequence as ids (not the original text), dropping a nearby common words means the context gets expanded to include the next-nearest word that was not dropped. To determine whether a token in full token sequence as ids should be subsampled, the word2vec software uses this equation to compute the probability pk(wi) of a token for word wi being kept in for training: pk(wi) = r p(wi) 0.001 + 1! · 0.001 p(wi) (6) where p(wi) is the probability of the word appearing in the corpus initially. Using this probability, each occurrence of wi in the sequence is randomly decided to be kept or removed based on pk(wi).
■ Problem 9. Modify function load data to compute the probability pk(wi) of being kept during subsampling for each word wi . ■ Problem 10. Modify function load data so that after the initial full token sequence as ids is constructed, tokens are subsampled (i.e., removed) according to their probability of being kept pk(wi). 4.7 Getting Tensorboard Running As you might guess, training word2vec on a lot of data can take some time. This waiting process will be increasingly true as you train larger and larger models (not just word2vec). However, the larger pytorch ecosystem provides some fantastic tools for you, the practitioner, to monitor the progress. In this subtask, you’ll be using one of those tools, tensorboard, that allows you to log how your model is doing and then you can connect to the tensorboard interface and see the plot. Figure 1 shows an example of the tensorboard plot for our reference implementation after one epoch of training. Here, we’ve just recorded a running sum of the loss every 100 steps. You will want to do the same. This will help you see how quickly your model is converging. If you train multiple models, tensorboard will show all of their training plots so you can see how your choice in hyperparameters affects training speed and which model as learned the most 9 Figure 1: An example tensorboard run from the reference solution where the running sum of loss is reported every 100 steps (i.e., the sum of those steps’ loss) across one epoch on the training data. Hovering over any point shows the loss at that time, as well as the relative wall-clock time. As you can see, after one epoch the model as learned something but has probably not fully converged! (has the lowest loss). In practice, many people use tensorboard to determine when to stop training after seeing at their model has effectively converged.
■ Problem 11. Add tensorboard logging to your training loop so that you keep track of the sum of the losses for the past 100 steps and record the value with tensorboard. 4.8 Train Your Final Model All the pieces are now in place and you can verify the model has learned something. For your final vectors, we’ll have you train on at least one epoch. Before you do that, we’ll have you do one quick exploration to see how batch size impacts training speed.
■ Problem 12. Try batch sizes of 2, 8, 32, 64, 128, 256, 512 to see how fast each step (one batch worth of updates) is and the total estimated time. For this, you’ll set the parameter and then run the training long enough to get an estimate for both with tqdm wrapped around your batch iterator. You do not need to finish training for the full epoch. Make a plot where batch size is on the x-axis and the tqdm-estimated time to finish one epoch is on the y-axis. (You may want to log-scale one or both of the axes). You can try other batch sizes too in this plot if you’re curious. In your write up, describe what you see. What batch size would you choose to maximize speed? Side note: You might also want to watch your memory usage, as larger batches can sometimes dramatically increase memory.
■ Problem 13. Train your model on at least one epoch worth of data. You are welcome to change the hyperparameters as you see fit for your final model (although batch size must be > 1. Record the full training process and save a picture of the tensorboard plot from your training run in your report. We need to see the plot. It will probably look something like Figure 1. 10 5 Task 2: Save Your Outputs Once you’ve finished training the model for at least one epoch, save your vector outputs. The rest of the homework will use these vectors so you don’t have even re-run the learning code (until the very last part, but ignore that for now). Task 2 is here just so that you have an explicit reminder to save your vectors. We’ve provided a function to do this for you.
6 Task 3: Qualitative Evaluation of Word Similarities Once you’ve learned the word2vec embeddings from how a word is used in context new we can use them! How can we tell whether what it’s learned is useful? As a part of training, we put in place code that shows the nearest neighbors, which is often a good indication of whether words that we think are similar end up getting similar representations. However, it’s often better to get a more quantitative estimate of similarity. In Task 3, we’ll begin evaluating the model by hand by looking at which words are most similar another word based on their vectors. Here, we’ll compare words using the cosine similarity between their vectors. Cosine similarity measures the angle between two vectors and in our case, words that have similar vectors end up having similar (or at least related) meanings. ■ Problem 14. Load the model (vectors) you saved in Task 2 by using the Jupyter notebook provided (or code that does something similar) that uses the Gensim package to read the vectors. Gensim has a number of useful utilities for working with pretrained vectors. ■ Problem 15. Pick 10 target words and compute the most similar for each using Gensim’s function. Qualitatively looking at the most similar words for each target word, do these predicted word seem to be semantically similar to the target word? Describe what you see in 2-3 sentences. Hint: For maximum effect, try picking words across a range of frequencies (common, occasional, rare words). ■ Problem 16. Given the analogy function, find five interesting word analogies with your word2vec model. For example, when representing each word by word vectors, we can generate the following equation, king – man + woman = queen. In other word, you can understand the equation as queen – woman = king – man, which mean the vectors similarity between queen and women is equal to king and man. What kinds of other analogies can you find? (NOTE: Any analogies shown in the class recording cannot be used for this problem.) What approaches worked and what approaches didn’t? Write 2-3 seconds in a cell in the notebook. 7 Task 4: Debiasing word2vec Once you have completed all other steps, only then start on Task 4! Methods that learn word meaning from observing how words are used in practice are known to pick up on latent biases in the data. As a result, these biases persist in the vectors and any 11 downstream applications that use them—something we don’t want if, for example, the vectors are used in an NLP program screening resumes for whether to interview. In Task 4, we’ll try our hand at preventing these biases by modifying the training procedure. You won’t need to completely eliminate bias by any means, but the act of trying to reduce the biases will open up a whole new toolbox for how you (the experimenter/practitioner) can change how and what models learn. There are many potential ways to debias word embeddings so that their representations are not skewed along one “latent dimension” like gender. In Task 4, you’ll try to remove gender bias! The point of this part of the assignment is to have to start grappling with a hard challenge but there is no penalty for doing less-well! One common technique to have models avoid learning bias is similar to another one you already—regularization.
In Logistic Regression, we could use L2 regularization to have our model avoid learning β weights that are overfitted to specific or low-frequency features by adding a regularizer penalty where the larger the weight, the more penalty the model pays in its loss (remember that model parameters are trying to lower this loss). Recall that this forces the model to only pick the most useful (generalizable) weights, since it has to pay a penalty for any non-zero weight. In word2vec, we can adapt the idea to think about whether our model’s embeddings are closer or farther to different gender dimensions. For example, if we consider the embedding for “president”, ideally, we’d want it to be equally similar to the embeddings for “man” and “woman”. One idea then is to penalize the model based on how uneven the similarity is. We can do this by directly modifying the loss: loss = loss_criterion(preds, actual_vals) \ + some_bias_measuring_function(model) Here, the some bias measuring function function takes in your model as input and returns how much bias you found. The example code provides a few ideas like: 1. penalizing the model for having dissimilar vectors for words like “man” and “woman” (i.e., those two vectors should have high cosine similarity) 2. penalizing the model if some word like “president” is more similar to the vector for either “man” or “woman” (i.e., the vector should have the same similarity). We can easily add in these kinds of terms because of how PyTorch tracks the gradient (compare that with how you might have done this in Homework 1!).
Given the current loss for the context word prediction, we can add the penalty to this loss so that our word2vec model (1) learns to predict the right context words while (2) avoids learning biases. As a result, the backpropagation will update the embedding weights to reduce the loss with respect to both our context word prediction and bias penalty. Wow! There are many possible extensions and modification to how to compute the penalty for bias. For example, • Why just ”president”? Maybe add more words, • Why just ”man” and ”woman”? Why not other words or other gender-related words or other gender identities? • Could you use both ideas above at once? 12 • Could your force the gender information into a specific embedding dimension and the drop it? (kind of like how we dropped dimensions with the SVD) • Could you train a model to predict gender from embeddings and then penalize the model based on how well it does? • How often do you need to apply these penalties when learning? Every step? Every few steps? (how much do they slow the training down?) • How much should you weight the penalty compared with the penalty for wrong context-word predictions? Your task is to build upon the very simple ideas here and in the code to define some new “bias penalty”. You’ll add this penalty value to the loss value that you calculated from your loss function. There is no right way to do this and even some right-looking approaches may not work— or might word but simultaneously destroy the information in the word vectors (all-zero vectors are unbiased but also uninformative!). Once you have generated your model, record word vector similarities for the pairs listed on Canvas in word-pair-similarity-predictions.csv using the cosine similarity. You’ll upload the file to CodaLab (details on Piazza), which is kind of like Kaggle but lets use a custom scoring program to measure bias.
We’ll evaluate your embedding similarities based on how unbiased they are and how much information they still capture after debiasing. Your grade does not depend on how well you do in CodaLab, just that you tried something and submitted. However, the CodaLab leaderboard will hopefully provide a fun and insightful way of comparing just how much bias we can remove from our embeddings. You are welcome to keep trying new ideas and submit them to see just how much bias you can remove! We’re looking forward to seeing what the most successful approach is!! 7.1 CodaLab Submission To make a submission to CodaLab, go to https://codalab.lisn.fr/ and register a new account. Then search for the competition ”SI630 W22 Homework 2” and apply to join. You can make submissions under Participate – Submit/View Results. You need to submit a zip file containing only the word pair similarity predictions.csv file.
8 Optional Tasks Word2vec has spawned many different extensions and variants. For anyone who wants to dig into the model more, we’ve included a few optional tasks here. Before attempting any of these tasks, please finish the rest of the homework and then save your code in a separate file so if anything goes wrong, you can still get full credit for your work. These optional tasks are intended entirely for educational purposes and no extra credit will be awarded for doing them. 13 8.1 Optional Task 1: Modeling Multi-word Expressions In your implementation word2vec simply iterates over each token one at a time. However, words can sometimes be a part of phrases whose meaning isn’t conveyed by the words individually. For example “White House” is a specific concept, which in NLP is an example of what’s called a multiword expression. 12 In our particular data, there are lots of multi-word expressions. As biographies a lot of people are born in the United States, which ends up being modeled as “united” and “states”—not ideal! We’ll give you two ideas. In Option 1 of Optional Task 1, we’ve provided a list of common multi-word expressions in our data on Canvas (common-mwes.txt). Update your program to read these in and during the load data function, use them to group multi-world expressions into a single token. You’re free to use whatever way you want, recognizing that not all instances of a multi-word expression are actually a single token, e.g., “We were united states the leader.” This option is actually fair easy and a fun way to get multi-word expressions to show up in the analogies too, which leads to lots of fun around people analogies.
Option 2 is a bit more challenging. Mikolov et al. describe a way to automatically find these phrases as a preprocessing step to word2vec so that they get their own word vectors. In this option, you will implement their phrase detection as described in the “Learning Phrases” section of Mikolov et al. [2013].13 8.2 Optional Task 2: Better UNKing Your current code treats all low frequency words the same by replacing them with an token. However, many of these words could be collapsed into specific types of unknown tokens based on their prefixes (e.g., “anti” or “pre”) or suffixes (e.g., “ly” or “ness”) or even the fact that they are numbers or all capital letters! Knowing something about the context in which words occur can still potentially improve your vectors. In Optional Task 2, try modifying the code that replaces a token with with something of your own creation.
8.3 Optional Task 3: Some performance tricks for Word2Vec Word2vec and deep learning in general has many performance tricks you can try to improve how the model learns in both speed and quality. For Optional Task 3, you can try two tricks: • Dropout: One useful and deceptively-simple trick is known as dropout. The idea is that during training, you randomly set some of the inputs to zero. This forces the model to not rely on any one specific neuron in making its predictions. There are many good theoretical reasons for doing this [e.g., Baldi and Sadowski, 2013]. To try this trick out, during training (and only then!), when making a prediction, randomly choose a small percentage (10%) of the total dimensions (e.g., 5 of the 50 dimensions of your embeddings) and set these to zero before computing anything for predictions. • Learning Rate Decay: The current model uses the same learning rate for all steps. Yet, as we learn the vectors are hopefully getting better to approximate the task. As a result, we 12https://en.wikipedia.org/wiki/Multiword_expression 13http://arxiv.org/pdf/1310.4546.pdf 14 might want to make a smaller change the vectors as time goes on to keep them close to the values that are producing good results. This idea is formalized in a trick known as learning rate decay where as training continues, you gradually lower the learning rate in hopes that the model converges better to a local minima. There are many (many) approaches to this trick, but as an initial idea, try setting a lower bound on the learning rate (which could be zero!) and linearly decrease the learning rate with each step. You might even do this after the first epoch. If you want to get fancier, you can try to only start decreasing the learning rate when the change in log-likelihood becomes smaller, which signals that the models is converging, but could still potentially be fine-tuned a bit more.
8.4 Optional Task 4: Incorporating Synonyms As a software library, word2vec offers a powerful and extensible approach to learning word meaning. Many follow-up approaches have extended this core approach with aspects like (1) added more information on the context with additional parameters or (2) modifying the task so that the model learns multiple things. In Optional Task 4 you’ll try one easy extension: Using external knowledge! Even though word2vec learns word meaning from scratch, we still have quite a few resources around that can tell us about words. One of those resources which we’ll talk much more about in Week 12 (Semantics) is WordNet, which encodes word meanings in a large knowledge base. In particular, WordNet contains information on which word meanings are synonymous. For example, “couch” and “sofa” have meanings that are synonymous.
In Optional Task 4, we’ve provided you with a set of synonyms (synonyms.txt) that you’ll use during training to encourage word2vec to learn similar vectors for words with synonymous meanings. How will we make use of this extra knowledge of which woulds should have similar vectors? There have been many approaches to modifying word2vec, some which are in your weekly readings for the word vector week. However, we’ll take a simple approach: during training, if you encounter a token that has one or more synonyms, replace that token with a token sampled from among the synonymous tokens (which includes that token itself). For example, if “state” and “province” are synonyms, when you encounter the token “state” during training, you would randomly swap that token for one sampled from the set (“state”, “province”). Sometimes, this would keep the same token, but other times, you force the model to use the synonym—which requires that synonym’s embedding to predict the context for the original token.
Given more epochs, you may even predict the same context for each of the synonyms. One of the advantages of this approach is that if a synonyms word shows up much less frequently (e.g., “storm” vs. “tempest”), the random swapping may increase the frequency of the rare word and let you learn a similar vector for both. Update the main function to take in an optional file with a list of synonyms. Update the train function (as you see fit) so that if synonyms are provided, during training tokens with synonyms are recognized and randomly replaced with a token from the set of synonyms (which includes the original token too!) Train the synonym-aware model for the same number of epochs as you used to solve Task 3 and save this model to file. 15 As you might notice, the synonyms.txt has synonyms that only make sense in some contexts! Many words have multiple meanings and not all of these meanings are equally common. More over a word can have two parts of speech (e.g., be a noun and a verb), which word2vec is unaware of when modeling meaning.
As a result, the word vectors you learn are effectively trying to represent all the meanings for a word in a single vector—a tough challenge! The synonyms we’ve provided are a initial effort of identifying common synonyms, yet even these may shift the word vectors in unintended ways. In the next problem, you’ll assess whether your changes have improved the quality of the models. Load both the original (not-synonym-aware) model and your new model into a notebook and examine the nearest neighbors of some of the same words. For some of the words in the synonyms.txt file, which vector space learns word vectors that have more reasonable nearest neighbors? Does the new model produce better vectors, in your opinion? Show at least five examples of nearest neighbors that you think help make your case and write at least two sentences describing why you think one model is better than the other. 9 Hints 1. Start early; this homework will take time to debug and you’ll need time to wait for the model to train for Task 1 before moving on to Tasks 2-4 which use its output or modify the core code. 2. Run on a small amount of data at first to get things debugged.
3. The total time for load data on the reference implementation loading the normal training data is under 20 seconds. 4. The training data generating code takes around 10-15 minutes on the wiki-bios.med.txt file. 5. Average time per epoch on our tests was ∼ one hour for the wiki-bios.med.txt file. If you are experiencing times much greater than than, it’s likely due to a performance bug somewhere. We recommend checking for extra loops somewhere. 6. If your main computer is a tablet, please consider using Great Lakes for training the final models (you can develop/debug locally though!) 10 Submission Please upload the following to Canvas by the deadline: 1. your code for word2vec 2. your jupyter notebook for Parts 3 and 4 16 3. a PDF copy of your notebook (which as the written answers) that also includes what your username is on CodaLab. Please upload your code and response-document separately. We reserve the right to run any code you submit; code that does not run or produces substantially different outputs will receive a zero. In addition, you should upload your debiased model’s similarity scores to the CodaLab site.
References Pierre Baldi and Peter J Sadowski. Understanding dropout. Advances in neural information processing systems, 26: 2814–2822, 2013. Yoav Goldberg and Omer Levy. word2vec explained: deriving mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722, 2014. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119, 2013. Xin Rong. word2vec parameter learning explained. arXiv preprint arXiv:1411.2738, 2014. 17