This is part 1 of a series of articles I am planning write. In this part I will outline the problem, and offer a solution proposal and define the requirements.

Remembering binary data or even hexadecimal sequences is hard, although it is useful. It finds it’s application in reading out loud your PGP key’s fingerprint, hence the PGP word list [^1] was created. Looking up the correct word matching a byte using the list is feasible, yet cumbersome, so clever people invented tools suck as mnencode/mndecode [^2] to make life easier.

Application example of mnencode:

$ md5sum mn_wordlist.c | cut -d ' ' -f 1
3a8a77c13b792659ba7440d907bb4daa
$ md5sum mn_wordlist.c | cut -d ' ' -f 1 | mnencode 
index ranger rebel. planet school gondola. studio chamber jazz
mercy antonio jazz. gibson picture havana. million gemini jeep
bagel observe report. gordon shave record. airline
$ md5sum mn_wordlist.c | cut -d ' ' -f 1 | mnencode | mndecode 
3a8a77c13b792659ba7440d907bb4daa

Yes, I know, theoretically you could also use the NATO spelling alphabet, but a phrase like three alpha eigth alpha seven seven charly one... is not only longer, but also not easily remembered. And in any case, you’d only be able to encode digits and characters, not the whole range of a byte.

While you could use base32:

$ echo -n 3a8a77c13b792659ba7440d907bb4daa | base32
GNQTQYJXG5RTCM3CG44TENRVHFRGCNZUGQYGIOJQG5RGENDEMFQQ====

I wouldn’t claim that this is something you’d really like to do either.

But I also do not like “index ranger rebel. planet school gondola. studio chamber jazz”. What if, instead of random words, the encoder would form proper sentences? That’s what I’m trying to achieve.

To make it simple, my current requirements are:

  • The sentence should be in the form ADJECTIVE NOUN VERB ADVERB, such as ‘english ships sink easily’.
  • I shall have at least 256 words of each types, so that one sentence translates to a 4-byte sequence.
  • For simplicity I will only use the singular case of all nouns for now.
  • Present tense for verbs only.

I might make up new requirements as I go, but that it’s for now.

To get started, I downloaded a list of English words (in fact I do not even remember where I got this word list from), imported this into a Python list and used The Hannover Tagger [^3] to classify the list:

from HanTa import HanoverTagger as ht

tagger_en = ht.HanoverTagger('morphmodel_en.pgz')

en_pos = [tagger_en.analyze(i) for i in en]
en_pos[100:103]
[('earthlier', 'NN'), ('earthly', 'JJ'), ('ethel', 'NP')]

So now I classified all the words from the huge list into their types. For convenience I will create four new lists, of all the word types I’m interested in.

The tags for the word types I’m interested in are:

  • Adjective: JJ
  • Noun, singular: NN
  • Verb, present tense: VB
  • Adverb: RB
en_JJ = [i[0] for i in en_pos if i[1] == 'JJ']
en_NN = [i[0] for i in en_pos if i[1] == 'NN']
en_VB = [i[0] for i in en_pos if i[1] == 'VB']
en_RB = [i[0] for i in en_pos if i[1] == 'RB']

Now we have the full corpus to create our first sentence! Behold:

print(en_JJ[42] + " " + en_NN[42] + " " + en_VB[42] + " " + en_RB[42])
ablative authenticity overfeed absently

So much for this part. Comparing the result against our requirements:

  • The sentence should be in the form ADJECTIVE NOUN VERB ADVERB, such as ‘english ships sink easily’.
  • I shall have at least 256 words of each types, so that one sentence translates to a 4-byte sequence.
  • For simplicity I will only use the singular case of all nouns for now.
  • Present tense for verbs only.

So three out of four requirements, not bad for perhaps 30 minutes of work.

In the next posting I will create the byte-value to word mapping and whip up some new requirements. I can already think of some, such as that all words must meet certain requirements such as maximizing phonetic difference and word lengths (or does anyone want ‘prestidigitation’ as a word?)