Matthew Hurst (Latest Post) and Fernando Pereira (Latest Post) have been having an interesting discussion of the hype surrounding Powerset. The reason I call it hype is because of the New York Times‘s breathless coverage (not worth reading) and the lack of details in Powerset’s own description:
Powerset is a Silicon Valley company building a transformative consumer search engine based on natural language processing.
Our unique innovations in search are rooted in breakthrough technologies that take advantage of the structure and nuances of natural language. Using these advanced techniques, Powerset is building a large-scale search engine that breaks the confines of keyword search.
By making search more natural and intuitive, Powerset is fundamentally changing how we search the web, and delivering higher quality results.
Powerset’s search engine is currently under development. Please check back in the near future for more information about our technology and for signing up for our alpha.
I call what’s going on “hype” because I can’t tell what’s being proposed.The last thing I remember being hyped to this extent without being described was the Segway scooter.
So my question is: What does this “transformative consumer search engine” look like from a user’s perspective?
I’ll leave you with a pointer to my two comments on Matthew Hurst’s latest blog entry, NLP and Search: Free Your Mind. Matthew was suggesting that we can leverage redundancy of information and present answers, not documents (or passages). I’ll repeat my entries here:
I couldn’t agree more about doing confidence-based collective data extraction. Those who’ve participated in the DARPA bakeoffs over time have been doing just what Matthew suggests.
That’s why we built LingPipe to handle confidence-based extraction for tagging, chunking and classification. But we prefer to use an expectation-based form of inference rather than a winner-take all, which only makes sense for bakeoff entries. Most social networking and clustering software works just fine with expectations.
In real life, at least for intelligence analysts and biologists, recall is the game. The reason is the long tail. Most relations are not mentioned dozens of times in various forms. In other words, you may have to deal with the equivalent of “youtube data mining”, because language is vague.
But it’s not just recall of relations. Intelligence analysts need to link back to sources to generate their reports. Biologists won’t trust current relation extraction programs to guide their research because even reasoning cumulatively, they’re not precise enough. Too much depends on context outside of the phrase, sentence or even paragraph/document in which a fact is stated.
Could anyone help someone like me with limited imagination? I’m trying to envision what the next generation of NLP enabled search is going to look like from a user’s perspective.
Let’s say my wife and I are trying to settle a bet that arose over dinner, such as whether the lead singer of Jethro Tull was Scottish or English. I actually went to Google and used their newish pattern based search:
Google QUERY: ian anderson was born in *
Here’s the first few “answers”:
1. Paiseley, Scotland
3. Scotland in 1947
4. Fife in 1947
6. Croydon before England won the world cup
7. Williston, ND
8. Nottingham on June 29, 1948
Now which of those is the “right” answer? Well, first it depends on which Ian Anderson we’re talking about. There are 10 with their own Wikipedia pages.
The voting thing doesn’t help much here unless you happen to know that Dunfermline is in West Fife, which is in Scotland, which is in the United Kingdom. I’m confused about answer 1, because it’s “Paiseley”, which is spelled wrong. Wikipedia claims the answer is Dunfermline. Clearly the source is important in providing answers.