The basics of Distributional Semantics

What is the meaning of meaning? I know this question seems borderline smartass-territory, but it has been a major unanswered question in the study of language for a few millennia now. It was in Plato's Cratylus. There's a rather famous book in linguistics called (no jokes) "The Meaning of Meaning", by Odgen and Richards, written in 1923, but it's not like this question has been solved in the last century either.

Perhaps, if we can't tell what meaning is, we should tell what meaning does instead. And here's an idea that many have worked with: take an infinite spreadsheet, and for every sentence you encounter, record every object, every action that sentence implies. So one row of your spreadsheet will read "In the sentence I bashed Plato's head with my copy of the Cratylus, there are one "Plato", one "I", one book, one act of bashing someone's head, etc."

This looks like a very convoluted way of paraphrasing our original sentence, but it does have some advantages: now you know exactly what's in that sentence, and what other sentences have similar bits of meaning. You can decide whether two sentences are paraphrases, i.e., have the same meaning, by looking at whether equivalent spreadsheet-rows. You have atoms of meaning, and the power of the atom is formidable (or so I've heard).

There are shortcomings to our nuclear spreadsheet. To start with, how should we decide what to include? To go back to our previous example: should we include the fact that it is my copy of the Cratylus—should we add some sort of "owing" situation to our spreadsheet? And what if it's only the copy I borrowed from the local library? Must we have two different rows on our spreadsheet, then? Here's another problem: perhaps you are convinced that "the Cratylus". is a mid-70's rock album—in which case your and my spreadsheets will not agree. Also, how long would it take to catalogue every last one of the infinite number of sentences we can come up with?

And lastly, what have we explained of the meaning of our atoms? What would such a spreadsheet tell us about what a book is? Perhaps there are some things we could say—in principle, our spreadsheet would contain a row indicating that "books" have those things called "pages", and another stating that said "pages" are "written on", etc. We could certainly cycle through our spreadsheet indefinitely, but it would only push us back into an endless cycle of references to other spreadsheet rows. Not that we would glean nothing from this journey through our spreadsheet: it does allow us to link words to one another, depending on how they are related—although in a very inconvenient way.

Out with the spreadsheet, then, in with Zellig Harris. In 1954, he wrote a few pages on how you could use distributional properties to describe language and its structure. "Distribution" is just our scientific term for the sort of contexts associated with a linguistic item. For instance, we speak of "complementary distributions" of phones whenever their contexts have no overlaps, such as the two "th" sounds in English. English speakers always know whether they'll pronounce a word with a "th" in it using the [ð] sound of "this" or the [θ] sound of "thin", and there are no words where switching [ð] for [θ] would change the meaning.

Distribution doesn't have to be specifically phonetic, we can also apply it to any linguistic item. In this article, Harris also explicitly discusses how this distributional structure is linked to meaning—more precisely, how they can be expected to be correlated. Here's the gist: suppose you just said the word "dog". It's therefore likely that you were talking about dogs; and from that, I, cunning linguist, can deduce that words related to dogs are going to appear in your speech: you are more likely to use words such as "canine", "barking" or "tail" than "pope", "quantum" or "asinine".

In short, the meaning of "dog" is correlated with the words that appear around it, that is to say, word meaning should correlate with word distribution. And that's what we call the distributional hypothesis, on which are founded distributional semantics, and that's what I study in my thesis. This is in fact a very similar idea to what we pointed at before by cycling through our spreadsheet; but it also allows us to skip the whole hand-labelling process altogether. It hinges on us being able to properly characterize the distribution of a word, but that's something the NLP community knows how to do. More on that in later installments, I guess?

There you have it! My thesis is about comparing distributional semantics models with dictionary definitions. I think the latter are less mysterious than the former, hence why I'm starting with those. Let me also stress a few key points: distributional semantics is by no means the only semantic theory we use in NLP and CL. It has its weaknesses, and we'll come to that (spoiler: that's where the squirrels and octopuses come in!). There are also many, many models that were dubbed as "distributional semantics", some of which are otherwise completely unrelated.

Anyways, talk to you next week, I'm off to the library, I have to explain how I got their copy of the Cratylus so... dirty.