Language Model Generated Injection Attacks: Cool/Disturbing LingPipe Application


Joshua Mason emailed us with a link to his (with a bunch of co-authors) recent ACM paper “English Shellcode” ( Shell code attacks can attempt to seize control of a computer by masquerading as data. The standard defense is to look for tell-tale patterns in the data that reflect the syntax of assembly language instructions. It is sort of like spam filtering.The filter would have to reject strings that looked like:


which would not be too hard if you knew to expect language data.

Mason et al changed the code generation process so that lots of variants of the injection are tried but filtered against a language model of English based on the text of Wikipedia and Project Gutenberg.The result is an injection attack that looks like:

“There is a major center of economic activity, such as Star Trek, including The Ed Sullivan Show. The former Soviet Union.”

This is way better than I would have thought possible and it is going to be very difficult to filter. It would be interesting to see how automatic essay grading software would score the above. It is gibberish, but sophisticated sounding gibberish.

And it used LingPipe for the language processing.

I am a firm believer in the white hats publicizing exploits before black hats deploy them surreptitiously. This one could be a real problem however.


2 Responses to “Language Model Generated Injection Attacks: Cool/Disturbing LingPipe Application”

  1. Dave Lewis Says:

    I too wouldn’t have guessed that it was possible to make code look this English-like. The authors suggest that considering syntactic or semantic information might help, but I’m inclined to think that any advances in that direction would be more useful for generating attacks than preventing them. They do finish by saying that the real need is to avoid externally controlled inputs from being executed in the first place. Indeed!

    • Breck Says:

      The key to the high quality of the output is that there is a bunch of conditional jump n-bytes commands in ASCII (p through z). Josh said in an email that he would not have even tried the approach without that flexibility. About 40% of the English attack text is being executed.

      Josh also said that the filters for the unsophisticated version do not exist currently so there is no reason to go to the English generation step. That is depressing, At least I know if a firewall vendor comes calling about filtering for the simpler attack that more will need to be done.

      Cool paper in any case.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: