melannen: Commander Valentine of Alpha Squad Seven, a red-haired female Nick Fury in space, smoking contemplatively (Default)
melannen ([personal profile] melannen) wrote2003-06-04 11:35 pm

OPQRS TUVSW XSXYZ AOBSU

It is *utterly* mindwarping to go straight from Neal Stephenson to Brother Cadfael. The mind-warpingest bit is that there is so little *difference*. I got half-a-chapter further in St. Peter's Fair before some little reference to the Crusades or something shocked me into realizing I wasn't on Qhwglm anymore. Then I had to keep reminding myself. I am sure there is some deep insight here.

Oh yes-- I am now finished with Cryptonomicon, and well submerged back in the heady days of my youth when the NSA was giving me a crash course in cryptanalysis. When I used a real Enigma machine, decrypted VENONA intercepts, and asked impertinent questions with classified answers. Before I got old, past my mathematical prime (which in this case would have to be 17).

So this has left me wondering why nothing resembling my own pet cryptosystem is ever mentioned in discussions of such things.
In my understanding, there are several basic types of "encryption":
hiding messages by actually *hiding* the physical message: steganography
hiding messages by replacement on the level of languages, á la Windtalkers;
hiding on the level of words (or phrases, etc): encoding;
hiding on the level of letters (or bits, etc): enciphering.

But it seems to me there's a level between codes and ciphers that's never exploited: hiding on the level of *phonemes*. That is, encryption by transliteration. This is the system I've always used for medium-security data. It works like this:
1. Learn or design a phonemic, or at least, non-Latin alphabet, such as Shavian1. (with a bit of practice, I can pick one up in a week or so of use, with the result that I can now pronounce Shavian, Quenya, Norse runes, Greek, Cyrillic, tap code, gift-shop-Heiroglyphics, a fair amount of Ascii, my father's handwriting, and two that are entirely of my own invention. If I can do this, I imagine most people can.)
2. Transliterate your message into this alphabet. The fun thing about transliteration is that it's fairly chaotic: a message of sufficient length, transliterated by n people, will produce n different results. Yet, any of those results, read by any of the people, can be fairly easily decrypted back to the original.)
3. Arbitrarily conflate or divide characters in your alphabet until you have 26 or 36 or 35 or some other number which can be mistaken for your native alphabet;-- in other words, # now stands for both "hw" and "ee".
4. Replace the transliterated letters with Latin letters by one-to-one substitution;
5. possibly further encrypt the result.

This has the following advantages:
With a little practice, such a message can be written and read at the same speed as pre-standardized-spelling English, requiring no heavy math, or fancy supplies, a prime concern in cases where pgp-level stuff is unavailable;
Most code-breaking algorithms, even, as far as I'm aware, fancy computer-based ones, depend on letter-frequency analyses, looking for 'cribs' (words the codebreaker is fairly sure will appear in the message), and words that appear more than once in a message. My transliteration scheme short-circuits them all-- for example in a given message, the word 'the' might appear several times; the most common word, letters, letter pairs, and letter trio in the language: codebreaker's bonanza. In the transliterated message, it might appear as 'htuh', 'hdee', 'the', 'teh', 'þiy', variations thereof, etc, totally screwing a letter-based analysis, yet perfectly intelligible to an accustomed reader.
An attempt to compensate for this requires knowing not letter frequencies but phoneme frequencies-- and phoneme frequencies vary from person to person based on individual dialect peculiarities-- and the way an individual transliterates said dialect would further vary- so it seems like for a brute-force statistical approach, you'd need frequency counts for each individual coder. Yet it can be *read*, by anyone who knows the system.

Disadvantages:
Depends in part on security-by-obscurity, which I have just blown wide open, but Stephenson convinced me it was necessary;
I have no idea about the fudge factors on computerized cryptanalysis, or how that would apply to my scheme; it's possible all the above about spelling was bullshit, and they could plow through it almost like rot-13;
Nobody else ever mentions it, which creates the possibility I'm totally overlooking something.

My response to the second above is that I'm working on the assumption this would be certainly crackable with time and effort. But I only propose it for situations where time and effort can't or won't be put into play; if you're facing *that*, use modern computer encryption. Or good steganography. But for passing notes in class; or arranging a meeting that will occur in the next few hours; or planning to asassinate the third guy in the airplane; or hundreds of other situations, the speed-of-encryption vs speed-of-breaking equation appears to come out positive.

The third is most worrying: the wimpy girly part of my mind says it's because this system is in fact clearly and totally useless; the paranoid part says it's because it is, in fact, in use, by the cabal that secretly controls (and incidentally includes all the crypto geeks, who therefore won't talk about it); the cynical part says it's because a pen-and-paper, non-mathematics-friendly system is neither clean nor macho enough; the conflict-avoiding part says it's probably because pen-and-paper non-mathematics-friendly probably-eventually-breakable systems, while useful in *my* contexts, aren't particularly useful to the vast majority of encryption-users. Those with money.

So I've destroyed the obscurity (and incidentally probably given sister a free road to my diary should she care (I'm betting she doesn't-- security-by-monotony)) in hopes that one of you will either poke holes until it sinks, point me to somewhere that has an actual rigorous analysis of the system, or at least suggest a good place to ask.
(If I was really smart I'd just ask my friendly local computerscientist-mathematician-linguist. Or failing that crosspost this to e2. But I am not yet confident enough to do either.)

(Anonymous) 2003-06-05 02:15 am (UTC)(link)
What the hell is 'þiy'?

--C

[identity profile] alfedenzo.livejournal.com 2003-06-05 05:54 am (UTC)(link)
First, as you pointed out, this is security through obscurity. While it doesn't suffer from some of the same problems as a one-time pad (ie. reusing it doesn't compromise the security quite as much), a one-time pad, when properly generated (ie. no pastor's wives), used only once and promptly destroyed completely, is unbreakable, while the transliteration system is not.

Part of the problem is that you've still got English there, and some sets of phonemes are a) going to show up a lot (ie. the, a, I, etc.), and b) are unlikely to vary between writers beyond a certain margin, even with the extend set of characters. Part of the problem is that unless you're rapidly changing alphabets (and even then), an intercepting Eve could learn the character set by observing messages passing her, and then quickly and easily break any further messages.
ext_193: (Default)

[identity profile] melannen.livejournal.com 2003-06-05 10:55 pm (UTC)(link)
þ is the anglo-saxon character 'thorn', which represents the phoneme soft th, as in the word 'thin'.

It is very groovy.
ext_193: (Default)

[identity profile] melannen.livejournal.com 2003-06-05 11:31 pm (UTC)(link)
First off, don't go hating on Mrs. Tenney! I speak from long experience that pulling truly random bingo balls is one of the most difficult, and mentally exhausting, tasks one can be asked to do. And a minister's wife probably has as much experience with it as anyone. As of the last time I was actually really into this-say, four years ago- it was still acknowledged that generating randomness, or even *defining* it, is one of the hardest tasks in cryptology.

The problem with one-time pads isn't so much reuse as logistics; a reused pad is still damn hard to break, just no longer impossible. (plus you can still only read the messages on that particular pad) (I know this, as one of the people who taught my crypto class had worked on Venona) The problem is that in order to encrypt or decrypt you need a copy of the pad, so there are all these copies floating around which it is possible for the enemy to physically intercept. Especially if you want to encrypt anything for indefinite periods of time, in which case you can't just burn the pads after use. The Unabomber used cryptographically perfect OTP's on his journals; but he had to store the pads near the journals if he ever wanted to read them again, so decryption was trivial.

The point of my system is that "the", "a", "i", etc, are not in fact phonemes. Think about all the different ways people pronounce "the". When I'm transliterating out of one of my alphabets, I often realize I've spelled it several different ways without even realizing. Between different people, with different regional and family dialects, there'd be many, many variations. I admit I don't know how many, or how statistically significant they'd be-- the idea is I don't know that anyone else does, either, and it'd be a great deal of work to find out.

Actually, while I was fooling around with Solitaire encryption last night, I realized that a pronunciation-based system is mentioned in passing-- very much in passing-- in Cryptonomicon, so perhaps at least it's not a *completely* stupid idea.
And now that my memory's jogged itself, something to this effect may have been used for the Voynich Manuscript, which has stood up to 400 years of analysis.

Ooh, that reminds me too work on my "face in the frost" cipher some more. .

I really need to go to bed soon.