In response to a recent thread on the LingPipe mailing list about using Wikipedia for a word-sense disambiguation corpus, I’d like to point out Rada Mihalcea’s 2007 NAACL paper:
- Mihalcea, Rada. 2007. Using Wikipedia for Automatic Word Sense Disambiguation. NAACL.
The idea’s very simple. A Wikipedia page represents a single sense of an ambiguous phrase. Both the senses and text for training can be extracted from Wikipedia’s link structure.
Rada maintains the SenseEval web pages and serves on the advising committee for the bakeoffs, which have pretty much defined the field recently.
Leaning on Wikitext
The wikitext markup used for Wikipedia pages contains explicit disambiguation and mention-level information. Consider these examples from Rada’s paper:
- In 1834, Sumner was admitted to the [[bar (law)|bar]] at the age of twenty three, and entered private practice in Boston.
- It is danced in 3/4 time (like most waltzes), with the couple turning approx. 180 degrees every [[bar (music)|bar]].
The double brackets “[[ ]]” enclose links with the item on the left being the disambiguated reference, e.g. “bar (law)” or “bar (music)” and the item on the right the ambiguous word, e.g. “bar”. When viewing the page, the text to the right of the vertical bar shows up.
The scale’s impressive — there were 1108 examples of “bar” used as link text with 40 different categories, and this was a few years ago.
Rada extracted paragraphs around the mentions for training data. These are clearly marked in the Wikitext, so nothing fancy needs to be done for paragraph extraction (beyond the pain of Wikitext parsing itself).
What About Disambiguation Pages?
Wikipedia is full of disambiguation pages, which Wikipedia itself describes as:
Disambiguation in Wikipedia is the process of resolving the conflicts that occur when articles about two or more different topics could have the same “natural” page title. This category contains disambiguation pages: non-article pages containing links to other Wikipedia articles and disambiguation pages.
For instance, the Wikipedia page for “bar”, lists an astounding number ambiguous meanings for the word. There are 11 popular usages (e.g. bar serving alcohol, bar of gold), 5 math and science usages (e.g. unit of pressure, bar chart), 4 uses in the law, and many other usages, some which appear under other categories.
I’d think the pages linked to would be useful for training. But Rada didn’t use them, listing reasons:
- mismatch between disambiguation pages and usage, causing precision problems from links in a disambiguation page that are never referenced with the word and recall problems from some usages not showing up on the disambiguation page,
- inconsistencies in the naming of disambiguation pages
The first issue is still problematic, though the recall side of it seems to be improving. The second issues is also still problematic. Although the category helps find such pages, Wikipedia’s formatting is still more like natural language than a database.
As you might expect, it worked pretty well. Rada evaluated a naive Bayes classifier with features including contextual words with part-of-speech tags.
Coding it in LingPipe
This is something you could easily replicate in LingPipe. You could use one of the naive Bayes parsers. Or any of the other classifiers we consider in our word-sense disambiguation tutorial, including K-nearest neighbors and character or token-based language model classifiers.
If you need part-of-speech tags, check out our part of speech tagging tutorial.
With naive Bayes and easily extracted unsupervised data, you could also use EM semi-supervised training, as described in our EM tutorial.
You could also use a discriminative linear classifier, as described in our logistic regression tutorial.
The hard part’s all the data munging for Wikipedia. If you build a parser in Java and want to share, I’d be happy to link to it from here or host it in our development sandbox.