`Slot Machines`

Last time on this series of blog posts:

Most if not all distributional semantics models can be framed as mathematical estimators of distributional substitutability—to put it plainly, all the vectors we get come from machines that learn how to fill in word slots.

I'd suggest going back to the previous posts on how to derive word vectors (in particular #5, #9 and #10) since all of what we discussed then will be relevant now.

If all of it is fresh in your mind, you should be able to make the connection with the previous post easily. I'll focus on BERT here for simplicity, but what I'm about to discuss also applies to other models—in all, BERT happens to be more convenient here.

As you recall, BERT is trained to fill-in-the-blanks, so if I pass it a sentence with a blank token, such as:

The most ____ animal is the underfed mallard

then it has to come up with plausible fillers for this blank slot—say violent, vicious, dangerous... As you can see, there's a bunch of plausible words you could put in that slot. In fact, valid slot-filling words correspond quite exactly to a set of words in a paradigmatic relation, as we saw in the previous post. Or to use Harris' jargon, we'd say that valid words here should be distributionally substitutable.

There's one more thing we get out of this: what we actually do when we train a BERT model is that we task it with maximizing the probability of the masked word in this given context. In other words, we train the model so that the following equation:

is maximized when X corresponds to a word we actually observe in this context, such as "violent" or "fearsome". The model's guess as to what should be filled in this blank corresponds to whichever word X maximize the equation above. In jargonese, we say that models are trained to maximized the probability of the masked word conditioned on its context of occurrence.

This little mathematical detail has two consequences. First, we don't actually get a single answer from a model such as BERT: rather, we can probe the model for classes of words. Second, the modle doesn't actually produce a yes/no answer: instead, it assign probability scores to all possible words. Let's connect back these two points to our previous observations on distributional substitutability: if we expect a class of words to equally fit in a given slots—if any if the words violent, vicious, dangerous, fearsome could fit—then the model should assign roughly the same probabilities to all of these words. Conversely, the model should also assign lower probabilities to less likely words (such as cuddly or eloquent in the example above) and zero probability to syntagmatically forbidden words (say the, of or window).

In other words: distributional models can be thought of as models that learn how likely it is that a word is substitutable in a given context. We can come up with a fancy equation to celebrate our finding (I insist):

or the probability assign to a target word t given some context c. From which we can also give an equation for distributional substitutability (because who doesn't like equations, am I right?):

We'd say that x and y are always substitutable if, for all contexts c we can find, the model assigns them equal probability.

Note: When I was writing above that BERT is just more convenient for this demonstration, I was refering to the fact that it's easier to get it to spit out some equation of the form of p( t | c ). Most neural network word embedding models are trained on objectives that we can reformulate into something that looks like this. This isn't entirely a lucky accident: it turns out that if you want to train some network from raw text without having to manually annotate your training data, then using this conditional probability is one of the few solutions you have. Older non-neural models (i.e., that pre-date word2vec) are generally less easy to convert into this format. I should also remark that Magnus Sahlgren, whose paper I previously mentioned, does point out to a family of models that don't really fit in with this approach.
TL;DR: it's complicated.

Why is it a big deal to have an equation for all of this stuff? Well, ask yourself the same question with physics: why is it a big deal that we have some equation for gravity? It lets us make predictions that we can then compare to actual data and see how our theory and our models actually fare. We have some idea that word embedding models are estimators of distributional substitutability: now we'll have to test it.

This will have to wait until the following post. Here's a sneak peek of what's to come:

Squids. And squirrels.