Archive for the ‘Breck’s Blog’ Category

A day teaching at the Johns Hopkins Summer Institute

August 17, 2006

I went down to the Johns Hopkins to teach information extraction for a day at the NAACL summer school. It was 28 students with an hour
morning presentation and a 3.5 hour lab in the afternoon. The only constraint was that I was to return them in good condition and preferrably a bit more learned in the ways of LingPipe and information extraction. The students ranged in experience from undergraduates to senior graduate students.

I decided that a good lab project would be for them to reprocess with the results returned by a search engine. So I loaded the
excellent open source search engine Lucene with 1300 FBIS articles from back in the TIDES days and set the problem of helping intelligence analysts sort through a days worth of fresh intelligence about Iraq. Their task was to find better ways of presenting returned results. In the morning presentation I covered the basic input/output setup I was giving them and the source for using LingPipe to do sentence ranking with language models and extraction of named entities up to the level of coreference.

After the morning presentation, they broke up into 6 groups of on average 4 people and hatched a plan over lunch. At 1:30 we started the lab, Bob showed up to lend a helping hand. All the groups briefed Bob or me on what they were doing and we helped them get started. Lots of interesting ideas were floated and a steady hum built in the lab as we got working.

Once they got going I slipped out and procured a bottle of Moet Champagne (the real stuff–none of this California malarky) as 1st prize. Bob noted that I was perhaps as interested in teaching them about quality wine as linguistics….

The whole session was a blur, but in the end we saw interesting applications using entity detection for node/link visualization, a few efforts linking locations to google maps (not very detailed in Iraq), an effort to recognize sentences of future intent using tense.

We ended with votes after brief presentations and a group of students sliped out in search of an ice bucket.

Lessons learned: 3.5 hours is not much time, we should have structutred things more perhaps–the project would have been much better set as a week long effort. It is really fun to work with smart motivated students. The tasks limitations had more to do with project management than coding skills.

Thanks to Roy Tromble our TA, Jason Eisner and David Yarowski who invited us.


A Nobel in Computational Linguistics?

July 31, 2006

How is that? Amongst the research geeks we see a BIG opportunity in
squeezing more information out of written human knowledge. Generally
that means the research literature but it can extend to databases and
other “encodings” of information about the world. The squeezing
involves the transition from word-based search to concept-based

It’s a big deal and one that you personally care about if you think
you will be needing the serious attention of a doctor 15 years from
now. Making the leap will uncover a new world of therapies, treatments
and scientific understanding–at least that is the idea and it is well
worth exploring. It is a cure-for-cancer level achievement. As they
say at the Indy 500: “Researchers, start your graduate students”.

A few details, but I am going to keep this sketchy. Mr. Search Engine
does a pretty good job finding words in documents, but words are a
long way from finding every document in MEDLINE that mentions the
gene id 12, official name Serpina3. Why?

Not enough found: The concept for Serpina3 is expressed in
documents as ‘ACT’, ‘GIG24’, ‘AACT’ amongst others and
Mr. Search Engine misses these entirely. Attempts to help
Mr. Search Engine have pretty much failed up to now.

Too much found: The alias ‘ACT’ is highly ambiguous amongst
genes as well as the word ‘act’ in more common use. It is like
finding John Smith on the web–Mr. Search Engine doesn’t even
get the fact that there are lots of different things in the
world mentioned the same way.

What is the payoff? Once you get concept indexing sorted out, then you
can start playing games very effectively with some old ideas floated
by Don Swanson in ’88 originally about migraines and dietary magnesium*. The
approach there tries to find disease A with underlying causes B, and
then find treatments C which apply to B but are not known to apply to
disease A yet.

Nice idea–the problem is that it is pretty seriously limited if the
A, B and C’s are limited to keyword lookup. Make those concept lookups
and Dr. Swanson’s approach will gain some serious traction. Once that
happens I see Dr. Swanson and the folks who solve the concept indexing
problem enjoying some quality time in Stockholm. I hope they invite me
along for the celebration dinner.


*-There are tons of other interesting ideas that would gain traction with
concept search as well. Swanson however is the first person I know of who
actually did something with it. Cite:

SWANSON, D. R. (1988), Migraine and magnesium: eleven neglected
connections, Perspectives in Biology and Medicine, 31 : 526–557.

Presentation at NY JavaSIG

May 30, 2006

Last tuesday (May 23) I gave a 40 min how-to talk on uses of linguistics in the “application stack” featuring source level details of how to get LingPipe jumping through hoops like “did you mean?” style query spell checking, sentence classificaiton and finally sentence detection, entity detection up to within document coref.

Slides at NYJava Sig Past Presentations

How did it go? Otis came up to me gesturing the universal sign of head-about-to-explode (hands pressing sides of skull in) and stuttered that half of the talk would have been more than enough. One person hated it in the NY Sig forum “Should anyone waste their time listening to things like Linguistics+Software” but that prompted a bunch of useful reponses.

Some quotes:

“Yes, I really enjoyed the speakers….While the Ling Pipe speaker was, in my mind, a quirky academic, I wouldn’t expect much less from a guy that has spent many, many hours thinking about how language is constructed and used and subsequently trying to encode that knowledge in software.”

Computational linguistics has definately rotted my mind–quantifier scope ambiguities really warped reality.

“I thought the linguistics toolkit – LingPipe – was very interesting. The topic is quite academic but the use of linguistics is growing in everyday applications with the increase of unstructured or semi-structured data in the form of email, chats and all types of electronic documents. However, you are right that this is definitely not a mainstream concern in the world of java these days.”

That is pretty much the problem we are trying to solve, make lingusitcs more of a mainstream concern, part of a competent developers toolkit like a DB would be.

“Personally I thought the LingPipe talk was fascinating. Imo, they need another layer over what Breck was describing, which was a sort of ‘system call’ layer. A set of higher level components for various use cases would go a long way. Seems like a very powerful library to extract semantics out of text.”

This comment had Bob and I talking for a while. Perhaps a single class that has precompiled models with methods for the “standard” things is called for.
Maybe it will make it into the 2.5 release.

Giving the talk was a good experience, I spent a week writing it up and will spend more time tuning. I am looking for other venues to give it so get in touch if you have ideas.