In a recent comment on an earlier post on licensing, we got this spam comment. I know it’s spam because of the links and the URL.
It makes faculty adage what humans can do with it. We’ve approved to beacon bright of that with LingPipe’s authorization — we artlessly can’t allow the attorneys to adapt our own arbitrary royalty-free license! It was advised to accept some AGPL-like restrictions (though we’d never heard of AGPL). At atomic with the (A)GPL, there are FAQs that I can about understand.
ELIZA, all over Again
What’s cool is how they used ELIZA-like technologies to read a bit of the post and insert it into some boilerplate-type generation. There are so many crazy and disfluent legitimate comments that with a little more work, this would be hard to filter out automatically. Certainly the WordPress spam filter, Akismet, didn’t catch it, despite the embedded links.
Black Hat NLP is Going to Get Worse
It would be really easy to improve on this technology with a little topic modeling and better word spotting (though they seem to do an OK job of that) and better language modeling for generation. Plus better filtering a la modern machine translation systems.
The real nasty applications of such light processing and random regeneration will be in auto-generating reviews and even full social media, etc. It’ll sure complicate sentiment analysis at scale. You can just create blogs full of this stuff, link them all up like a good SEO practitioner, and off you go.