Natural Language Generation for Spam


In a recent comment on an earlier post on licensing, we got this spam comment. I know it’s spam because of the links and the URL.

It makes faculty adage what humans can do with it. We’ve approved to beacon bright of that with LingPipe’s authorization — we artlessly can’t allow the attorneys to adapt our own arbitrary royalty-free license! It was advised to accept some AGPL-like restrictions (though we’d never heard of AGPL). At atomic with the (A)GPL, there are FAQs that I can about understand.

ELIZA, all over Again

What’s cool is how they used ELIZA-like technologies to read a bit of the post and insert it into some boilerplate-type generation. There are so many crazy and disfluent legitimate comments that with a little more work, this would be hard to filter out automatically. Certainly the WordPress spam filter, Akismet, didn’t catch it, despite the embedded links.

Black Hat NLP is Going to Get Worse

It would be really easy to improve on this technology with a little topic modeling and better word spotting (though they seem to do an OK job of that) and better language modeling for generation. Plus better filtering a la modern machine translation systems.

The real nasty applications of such light processing and random regeneration will be in auto-generating reviews and even full social media, etc. It’ll sure complicate sentiment analysis at scale. You can just create blogs full of this stuff, link them all up like a good SEO practitioner, and off you go.

5 Responses to “Natural Language Generation for Spam”

  1. John Q Passerby Says:

    There’s an interesting thing in the book Anathem where spammers have so overrun their internet with false but grammaticly correct and sensical statements, it is completely unusable without complex filtering software. When I read it, I thought that is basically where we are headed.

  2. Mark Risher Says:

    In addition to this ELIZA-scraped text, another tactic we see frequently is innocuous, hand-written text like “I totally agree; you should also check out [link].”

    In our experience, topic and language modeling can detect clumsy approaches, but there doesn’t seem to be a structural reason the post can’t be more fluent. Instead, a stateful approach that looks at the user in addition to the language itself has proven the most effective technique.

    • Bob Carpenter Says:

      Oh, we get gazillions of these, too. Things like “this page doesn’t display in IE” or “I love your style, what is it?”. A very few of those have turned out to be real.

      I think the NL generation will get better to the point where it’ll be hard to automatically filter. The problem is that the generation techniques use the same kind of model. So if you use a reasonable-sized n-gram to generate with a topic model for content, you’re going to be pretty hard to weed out. Not impossible, but requiring increasing amounts of cleverness.

      So indeed, we’ll need to go to other factors, like social factors. And of course the tried-and-true honeypot method where you put a blog full of “lorem ipsum” type stuff up and wait for spam.

      • Mark Risher Says:

        Yeah, the one that breaks it regardless of the n-gram is quoting verbatim snippest, e.g.

        >> I think the NL generation will get better to the point where it’ll be…
        Totally agree: check out [link]

        We are working on a WordPress plugin version of our comment spam and NLP filter specifically to address these problems; can we sign LingPipe up for the beta?

  3. A geek with a hat » Natural Language Generation system architectures Says:

    […] Natural Language Generation for Spam ( […]

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: