A PGP Words variant creating natural sentences - part 2

This is part 2 of a series of articles I currently writing. In this posting I will define the mapping from a byte to a word for all word types and get the encoder and decoder up and running. For part 1 of the series, go and read A PGP Words variant creating natural sentences - part 1.

In the last part I was able to generate a grammatically correct sentence according to my requirements.

print(en_JJ[42] + " " + en_NN[42] + " " + en_VB[42] + " " + en_RB[42])
ablative authenticity overfeed absently

What’s now missing is:

I shall have at least 256 words of each types, so that one sentence translates to a 4-byte sequence.

Since I already have all our adjectives, nouns, verbs and adjectives in distinct lists, I now need to pick 256 candidates from every one of these lists and oh Loki, these lists are huge:

In [76]: len(en_JJ), len(en_NN), len(en_RB), len(en_VB)
Out[76]: (8968, 27821, 4096, 3089)

That are a lot of words, but let’s just grab 256 words from each list.

In [77]: JJ = random.sample(en_JJ, 256)
In [78]: NN = random.sample(en_NN, 256)
In [79]: RB = random.sample(en_RB, 256)
In [80]: VB = random.sample(en_VB, 256)

Now say, if I have a byte, such as ‘A’ - which is 65 in ASCII - I can simply treat the lists as arrays and get the word representation. As an example, I’m gonna pick the JJ-list:

In [90]: ord('A')
Out[90]: 65

In [91]: JJ[65]
Out[91]: 'intense'

In [92]: JJ.index('intense')
Out[92]: 65

Easy peasy!

Now the whole thing needs to be mangled together. I will just do a quick PoC with no streaming, not even a proper function, no file reading, no nothing, just a quick hack to show how it can be done. But one thing I’m gonna take into account right away; for the input I will only accept byte arrays which will save us from a lot of headache in the future.

Also because I’m lazy, I’m only gonna use the adjective array for now (JJ).

In [95]: bla = b'This is a test.'

In [98]: for i in bla:
    ...:     print(JJ[i], end=' ')
    ...: 
celtic enormous alembic noninfectious noncorrosive alembic noninfectious noncorrosive carboniferous noncorrosive deplete unclothe noninfectious deplete suppressive

In [99]: sentence = 'celtic enormous alembic noninfectious noncorrosive alembic noninfectious noncorrosive carboniferous noncorrosive deplete unclothe noninfectious deplete suppressive'.split(' ')

In [111]: for i in sentence:
     ...:     print(chr(JJ.index(i)), end='')
     ...: 
This is a test.

OK! So in principle, it seems to work, although it’s not really what I wanted, so the requirements are not really met yet.

I shall have at least 256 words of each types, so that one sentence translates to a 4-byte sequence.

So I know have four maps already, but it’s not the sentence I wanted in the definition of ADJECTIVE NOUN VERB ADVERB. Also, I am relying on characters here, something I really shouldn’t do.

But one step after another. Now let’s tackle the “proper sentence”-part. Given:

def encode(instream):
    for i in range(0, len(instream)-1, 4):
        print(JJ[instream[i]], end=' ')
        print(NN[instream[i+1]], end=' ')
        print(VB[instream[i+2]], end=' ')
        print(RB[instream[i+3]], end='. ')

Then:

In [121]: encode(b'This is a test.')
celtic belligerence zoom daintily. noncorrosive myopia ferret institutionally. carboniferous acquiescent contemplate unemotionally. noninfectious lurker whack ---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[121], line 1
----> 1 encode(b'This is a test.')

Cell In[120], line 6, in encode(instream)
      4 print(NN[instream[i+1]], end=' ')
      5 print(VB[instream[i+2]], end=' ')
----> 6 print(RB[instream[i+3]], end='. ')

IndexError: index out of range

Oi! Because I had the strict sentence-assembly rule of ADJECTIVE NOUN VERB ADVERB, our input stream can only be an integer multiple of 4, but we only have 15 characters. So what now…

ANYWAY, let’s try binary and fix the bug later!

In [122]: encode(bytearray.fromhex('deadbeefbabe'))
spinsterish witchery mow laterally. bony oversensitiveness ----...and error

But the encoding part works in principle. To fix this, we have two possibilities. We either introduce padding words, something which can be ignored by the decoder and must be inserted by the encoder when it tries to read something beyond the end of the array.

Or we make up more sentence patterns. Or we simply do nothing, catch the exception and stop, and tell the decoder to stop at the end, no matter if it’s a valid sentence or not.

Because I’m lazy, I’m simply going for a fixed padding word, something which doesn’t show up in any dictionary; I’m gonna take =, which is, in Morse telegraphy, BT or “break”. This will only show up in the encoded part, and will be discarded when decoding.

Also another thing I just realized, I always append the character . at the end of the last word; when tackling decoding, I must make sure to strip this or my lookup in the map will fail.

But first: The missing requirement is met!

I shall have at least 256 words of each types, so that one sentence translates to a 4-byte sequence.

But now I need to add a few more. These are not business requirements as such, but technical requirements so that I don’t screw it up.

When the number of words in a sentence is not a multiple of 4, pad the missing words as =.
Mind the stray . characters at the end of each full sentence.
Stay lazy until it becomes an impediment.

Tackle padding first. In my next post!