The Canadians are as psyched as we are about character n-grams and have applied them to a host of new problems: (1) Alzheimer’s type classification from transcripts, (2) signature-based virus detection from executables, (3) author gender attribution, (4) document clustering, (5) Spam Filtering, and even (6) genome sequence clustering and classification.

Check it out from Vlado Keselj’s List of Publications. Vlado, who’s now at Dalhousie after a Ph.D. at Waterloo, seems to have taken the torch from Fuchun Peng, who recently graduated from Waterloo and moved to UMass. Fuchun’s dissertation is well worth reading for the wide range of character n-gram classification evaluations.

Anyone game to recreate any this work in LingPipe?

