Some misconceptions about dictionaries

And now, something completely different: a six-hundred words lecture on the dictionary.

Half of my thesis (the better half, really) is about distributional semantics vectors, which I've introduced over the past installments. The other thing I study is dictionary definitions. Let me start today by rectifying some misconceptions you may have when it comes to dictionaries.

To start with, let's get rid of the myth that dictionaries were invented during the Enlightenment. Dictionaries are very old. The Harrisses and Diderots of 18th century Europe were building upon a long and fairly well established tradition. To give an idea of how old we're talking: you'll often hear that the oldest surviving dictionary is the Chinese Erya (爾雅), which dates back anywhere from the 6th century BC to the 3rd century BC. Most scholars will agree that this is the earliest document that we can recognize as a proper dictionary. But that's not all: we also have glossaries written in cuneiform from the early 2nd millenium BC, and lexica, such as 4th century BC Philitas of Cos' Ἄτακτοι γλῶσσαι ("Disorderly words", literally) which listed rare, archaic, dialectal or technical words.

Another notion that is worth clarifying is that of prescriptive vs. descriptive approaches to language. That distinction is kind of 101 linguistics class material, but let's quickly review it. Grammars and the like tell you "the right way" to use language; they prescribe how words should be used. Linguistics, on the other hand, makes no claim about what how you should use words: linguists only describe language. As with most 101 classes, it's a good first introduction, but oversimplifies some aspects of the debate. Sociolinguists will readily distinguish usages based on how close to the "social best practice" they are; i.e., whether they fit the norm.

Going back to dictionaries, lexicographers (at least in modern times) are very much invested in using a descriptive approach: they document what usages exist "in the wild" when writing dictionary definitions. Likewise, there is a rather long-standing tradition of linguists working with—or as—lexicographers. John Rupert Firth, which was previously mentioned in this series of post, worked on the Oxford English Dictionary and discussed at length the proper methodology for compiling definitions in "Linguistic Analysis as a Study of Meaning" (1952). Natalia Shvedova both took over the job of compiling definitions for the Russian Ozhegov dictionary after the death of the original author, Sergei Ozhegov, and wrote multiple monographs and essays on Russian syntax. A whole branch of linguistics, Meaning-Text theory, founded by Igor Mel'čuk, is closely tied to the project of producing "Explanatory Combinatorial Dictionaries".

The last point that is worth clarifying is that lexicographers are very much not opposed to bringing in new technologies in the art of writing definitions. Lexicographers frequently use large corpora of texts to see whether their definitions describe actual word usage: this is made possible by the existence of technology to process and explore these large corpora, such as concordancers, computer program that tabulate and display the contexts in which a given input word is attested. One concrete example of such software would be SketchEngine. Another domain where dictionaries make use of modern technology is for data storage. XML, a very popular electronic document format widely used today, was developed with the Oxford English Dictionary in mind.

To sum up: dictionaries are a very well established type of lexical resources. Linguists frequently work with dictionaries, and lexicographers are very much open to innovation. It should therefore come as no surprise that there is a large body of work in NLP centered around using dictionaries. The plan is to look at these works next week!