Stegasuras: Neural Linguistic Steganography

ashton314 · on Sept 5, 2019

I think that this kind of automated stenography demonstrates that legislation requiring ~~mandated backdoors~~, ahem, "responsible encryption", is completely nonsensical: we could encrypt our messages with something powerful and safe, then encode it with something like this. Boom. Plausible deniability that you were sending anything other than plaintext. Sure, the plaintext is long-winded and verbose, and maybe a little nonsensical, but would a computer be able to reliably distinguish between encrypted-then-stenographically-concealed text and "real" plaintext?

Anyway, really cool work.

penagwin · on Sept 5, 2019

> but would a computer be able to reliably distinguish between encrypted-then-stenographically-concealed text and "real" plaintext?

Computers can detect stenographically with statistics.

That's part of what this project/paper is about, they created a model that not only produces text that passes as human, but it's significantly less detectable than previous encoding methods.

8bitsrule · on Sept 5, 2019

>Nonsensical

E.g. One sentence of the 'cover text' I got read:

"Then, on September 8, 1816, Sherman surrounded Army headquarters and killed three emissaries and surrendered with large supplies and safety pins."

Sherman wasn't born until 1820.

sswaner · on Sept 6, 2019

But, if you use a different LM Context that is not so easily fact checked then you won't realize that it is nonsensical.

I used this LM Context: "This message contains confidential information and is intended only for the individual named. If you are not the named addressee you should not disseminate, distribute or copy this E-mail. Please notify the sender immediately by E-mail if you have received this E-mail by mistake and delete this E-mail from your system."

To encrypt this fake, but real looking PII data: "Jeffery,Gourlay,2/2/55,495-24-1236"

And produced this output cover text: "Disclaimer: An individual may be reminded of this email at any time by telephone from a different address, e.g. e-mail. However, online resources and other media (e.g. e-mailed stories, articles, etc.) that include original content, are not covered under this policy. The information contained herein"

Malware or a bad actor could transmit this past enterprise InfoSec controls all day without easy detection.

jsilence · on Sept 7, 2019

Also poetry might make for a good lm context. Nonsenicalness can be explained as artistical expression.

DennisP · on Sept 5, 2019

Wow, that's exactly the dream I had last night.

ebg13 · on Sept 5, 2019

> stenography

Steganography. Stenography is something else unrelated.

cs702 · on Sept 5, 2019

Very cool!

If I understand correctly, this approach can generate steganographic output that is virtually indistinguishable from the output generated by the language model used. Quoting from section 4.3 of the paper:

"Most striking, arithmetic coding with the unmodulated language model induces a q distribution with a KL of 4e-8 nats. This indicates that, consistent with theory (Sallee, 2004), arithmetic coding enables generative steganography matching the exact distribution of the language model used."[a]

Regardless of theory, I can't help but think, holy cow, the K-L divergence between the distribution of tokens in the steganographic output versus the output of a GPT-2 model is a minuscule 0.00000004 nats!

Did I understand correctly? Given that language models are getting even larger and the distribution of their output is getting harder and harder to distinguish from that of human output[b], I find this incredibly powerful.

Anyway, congratulations and thank you for sharing on HN!

[a] https://arxiv.org/abs/1909.01496

[b] https://nv-adlr.github.io/MegatronLM

ebg13 · on Sept 5, 2019

To the people wondering how to make it work...

On https://steganography.live/encrypt leave the LM context intact and type your secret message into the secret message box. Then click Encrypt at the bottom.

Select the output cover text and copy it to your clipboard.

Now go to https://steganography.live/decrypt and paste the output cover text in and click Decrypt to recover your secret message.

yellowapple · on Sept 6, 2019

The steganographic output with the default LM context would make for some excellent alternate-historical writing prompts:

----

On January 25, 1798, Washington was elected President of the Confederate States of America. He later led the United States in the battle of Fort Sumter, North Carolina, in which he secured the passage of the Union Army, defeated the enemy and secured its last major victory. Washington was awarded the Order of the Pacific on December 27, 1798. His wife, Bea, was born in 1833 and settled in the Virginia river valley at East Mary Street, Bowersville. In 1855, Bea had married. Shortly thereafter, General John F. Washington was elected Governor of Virginia and commissioned to serve a year in captivity. He became enthralled with the Union and persuaded Henry P. Morgan, to meet his need for volunteers. Washington volunteered with the Federal troops using a French silver coin and with their aid the Americans located and captured the greatest concentration of troops there. After her capture, Bea was sent to Washington's camp in Richmond, Richmond and had a farewell ring to the American flag. It was this ring that Washington had set aside for himself with his death and a letter from his wife, who was

----

EDIT: decrypting that message twice (i.e. decrypting the result of decrypting the above message) apparently "works" and produces some interesting results. It gets even more wacky as I continue decrypting more and more outputs.

bajie · on Sept 5, 2019

Similar to a project I did in college using Markov chains instead. Located at http://cellprocessing-1329.appspot.com/stenography

Type the text to encode with in the top box, and the text to encode in the bottom box and press the bottom at the bottom. Easiest is to just click on the right like Gettysburg Address or Apologies by Plato to fill it with a preset text, but if you fill it yourself you need some words that follow other words to work, like at a minimum "a a b a".

Bootwizard · on Sept 5, 2019

The default text and settings didn't do anything, so I'm not exactly sure what I should be expecting from this site. Did this work for anyone else?

Edit: I can't get it to spit out anything no matter what I do.

joefkelley · on Sept 5, 2019

Nope, doesn't work at all for me either.

ixtli · on Sept 5, 2019

Type a message into the second box. The one that says "secret text" in order to get the cover text which you can then put into the decrypt page.

gajomi · on Sept 5, 2019

FYI you need to type something into the "Secrete message" and ask to "Encrypt" using the prepopulated "LM context".

Both the context and secrete message change the encrypted message

pmoriarty · on Sept 5, 2019

This reminds me of mimic functions:

https://en.wikipedia.org/wiki/Mimic_function

nullc · on Sept 6, 2019

I couldn't figure out how to combine this kind of model with wet paper codes, which is too bad since wet paper codes are really the only known way to resist an attacker with an equally good statistical model.

The closest I came was putting the text in a gray coded word per token form, then using GPT-2 as the error metric in the encoder, but the resulting bitrate was very low.

rw · on Sept 6, 2019

Stegasuras is convincing work and the quality looks excellent.

I wrote a steganographic tool in this same spirit back in 2011, called Plainsight.

Back then, we didn't have deep learning, and the "Imagenet moment for NLP" had yet to arrive.

My Python code, with examples, is here: https://github.com/rw/plainsight

Unlike the OP, my Plainsight algorithm is 100% invertible by construction, and accepts binary input. (I verified the inversion process with "roundtrip fuzzing", a technique I still use today.)

Plainsight uses each bit of the input message to generate tokens. Bits are used to decide how to traverse a Huffman-style n-gram tree, weighted by frequency. This tree of n-grams is the model used in both the encoding and decoding steps. The drawbacks to my method are that the output 1) can be verbose and 2) does not convince a human that it's plausible, except for short messages.

Stegasuras has orders-of-magnitude better output, and seems to solve the problems I couldn't solve eight years ago. I would venture that their new result has as much to do with advances in language modeling, as it does with the particulars of their encoding and decoding algorithms.

I'll also note that I'm glad these researchers were able to use grant money to do this work. As a non-academic, I applied for an AI Grant to support me in upgrading Plainsight to use deep learning, but I was turned away at the time.

Finally, one of the ideas I picked up back then is that spam can be used to contain secret messages. Send enough gibberish to enough people, with your intended recipient included, and you'll look like a spammer--not a spy:

   $ wget https://spamassassin.apache.org/publiccorpus/20030228_spam.tar.bz2
   $ tar -jxvf 20030228_spam.tar.bz2
   $ cat spam/0* > spam-corpus.txt

   $ echo "The Magic Words are Squeamish Ossifrage" | plainsight -m encipher -f spam-corpus.txt > spam_ciphertext
   
   $ cat spam_ciphertext
   (8.11.6/8.11.6) 3 (Normal) Internet can send e-mails until to transfer 26 10 [127.0.0.1]
   also include address from the most logical, mail business for your Car have a many our
   portals ESMTP Thu, 29 1.0 this letter on internet, <a style=3D"color: 0px; text/plain;
   cellspacing=3D"0" how quoted-printable about receiving you would like width=3D"15%"
   width=3D"15%" border="0" width="511" Date: Tue, 27 Thu, 19 26 because
   zzzz@localhost.spamassassin.taint.org for
   
   $ cat spam_ciphertext | plainsight -m decipher -f spam-corpus.txt
   Adding models:
   Model: spam-corpus.txt added in 2.57s (context == 2)
   input is "<stdin>", output is "<stdout>"   
   deciphering: 100% | 543.84  B/s | Time: 0:00:00
   
   The Magic Words are Squeamish Ossifrage

swsieber · on Sept 5, 2019

Anybody have a cheatsheet for the acronyms used? It's pretty neat.

kcazyz · on Sept 5, 2019

Hi, author here - GPT-2: a recent large language model (generates sentences conditioned on previous sentences) LM: language model ^ PPL: perplexity (https://en.wikipedia.org/wiki/Perplexity) KL: Kullback–Leibler divergence (https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_diver...)

debatem1 · on Sept 5, 2019

Cool demo, is the source live? I'd like to see how it works with ipv4-over-twitter.

Edit: nevermind, found it. For those looking, it's at https://github.com/harvardnlp/NeuralSteganography

swsieber · on Sept 6, 2019

Thank you for going the extra mile and including direct Wikipedia links.

sorokod · on Sept 6, 2019

What is the bound on size(stegatext) / size(plaintext)

?

domnomnom · on Sept 5, 2019

Scary!!!

ngcc_hk · on Sept 6, 2019

very good article. Highly recommended