A PGP Words variant creating natural sentences - part 3

This is part 3 of a series of articles I currently writing. In this posting I show the first fully working prototype which encodes arbitrary binary data into sentences and is able to perform the reverse operation.

In the previous part of this series I showed that in principle encoding and decoding works, but I encountered problems when the number of words are not dividable by 4, as a sentence according to my simple grammar structure has always 4 word.

So I adapted the for-loops a little, introduced padding, and had success!

In [228]: encode(b'This is a test.')
Out[228]: 'perfunctory sinkhole vandalize rhetorically. oxidative randier dislodge abhorrently. fluffed poncho sensationalize indoors. toxic epee stomp = = = '

In [229]: decode(_)
Out[229]: b'This is a test.'

Without further ado, the implementation looks like this.

def encode(instream):
    # check if input is a bytearray
    assert type(instream) == bytes or type(instream) == bytearray, TypeError("input must be a bytearrary")
    # prepare output string
    o = ''
    remainder = len(instream) % 4
    for i in range(0,len(instream)):
        if i % 4 == 0:
            o += JJ[instream[i]] + ' '
        elif i % 4 == 1:
            o += NN[instream[i]] + ' '
        elif i % 4 == 2:
            o += VB[instream[i]] + ' '
        else:
            o += RB[instream[i]] + '. '
    for i in range(0, remainder):
        o += '= '
    return o

def decode(instream):
    # check if input is a string
    assert type(instream) == str, TypeError("input must be a string")
    # prepare output bytearray
    out = b''
    # split by words
    words = instream.split(' ')
    # get rid of padding and empty elements
    words = [i for i in words if i != '=' and i != '']
    for i in range(0, len(words)):
        w = words[i]
        if i % 4 == 0:
            out += JJ.index(w).to_bytes()
        elif i % 4 == 1:
            out += NN.index(w).to_bytes()
        elif i % 4 == 2:
            out += VB.index(w).to_bytes()
        else:
            out += RB.index(w.rstrip('.')).to_bytes()
    return out

More toying around:

In [351]: encode(b'The way to get started is to quit talking and begin doing. -Walt Disney')
Out[351]: 'perfunctory sinkhole snuggle certainly. islamophobic bullet ruck certainly. raw saxophonist unwind hooky. confident epee unwind indefinably. raw bullet emend independently. confident twister unwind humanly. toxic poncho sensationalize immaterially. oxidative preform pummel humanly. raw poncho sensationalize gossipy. nontechnical foundling vandalize illogically. sickbed poncho eject illogically. electrifying poncho dispossess holly. sickbed randier elongate certainly. electrifying saxophonist vandalize illogically. sickbed egoism unwind cozily. unsightly bullet wield independently. oxidative sibylline vandalize indefinably. workaholic mishearing ruck = = = '

In [352]: decode('perfunctory sinkhole snuggle certainly. islamophobic bullet ruck certainly. raw saxophonist unwind hooky. confident epee unw
     ...: ind indefinably. raw bullet emend independently. confident twister unwind humanly. toxic poncho sensationalize immaterially. oxidati
     ...: ve preform pummel humanly. raw poncho sensationalize gossipy. nontechnical foundling vandalize illogically. sickbed poncho eject ill
     ...: ogically. electrifying poncho dispossess holly. sickbed randier elongate certainly. electrifying saxophonist vandalize illogically. 
     ...: sickbed egoism unwind cozily. unsightly bullet wield independently. oxidative sibylline vandalize indefinably. workaholic mishearing
     ...:  ruck = = = ')
Out[352]: b'The way to get started is to quit talking and begin doing. -Walt Disney'

In [353]: encode(bytearray.fromhex('deadbeef01020304'))
Out[353]: 'peacemaking micro darken unnecessarily. anglophile creativeness purl affectedly. '

In [354]: decode('peacemaking micro darken unnecessarily. anglophile creativeness purl affectedly. ')
Out[354]: b'\xde\xad\xbe\xef\x01\x02\x03\x04'

But does it work with binary data and in all edge cases? I can imagine the following edge cases.

Random bytes not being correctly handled; what about 0x00 or 0xff?
What if I pass on an empty byte array?
Byte array of length 1, 2, 3, which are not integer multiples of four?
What if the byte array is longer than max int?

Test 1: 0x00ff

In [236]: encode(bytearray.fromhex('00ff'))
Out[236]: 'contumelious nunnery = = '

In [237]: decode(encode(bytearray.fromhex('00ff')))
Out[237]: b'\x00\xff'

Test 2: Empty bytes or byte array:

In [240]: decode(encode(bytearray()))
Out[240]: b''

In [241]: decode(encode(bytes()))
Out[241]: b''

Test 3: Odd lengths (1,2,3)

In [242]: decode(encode(b'1'))
Out[242]: b'1'

In [243]: decode(encode(b'12'))
Out[243]: b'12'

In [244]: decode(encode(b'123'))
Out[244]: b'123'

Longer than max int: Whell, in Python integer is unbounded, but there’s the word size, but this one is, in fact - on my machine:

In [246]: sys.maxsize
Out[246]: 9223372036854775807

Nope, I’m not gonna test that.

But regarding arbitrary binary data, we still could do some fuzzing, just because I can and I should. So let’s whip up a simple function generating random byte arrays or arbitrary length and see what happens.

Let gen_random_bytes be the function:

import random
def gen_random_bytes(length=16):
    o = b''
    for i in range(0, length-1):
        o += random.randint(0,255).to_bytes()
    return o

And toy around with it.

In [250]: t = gen_random_bytes(); t
Out[250]: b'\x11|`;=\xa8\t\n\xbd\xd6\x1e\xbc\xce\xac#'

In [251]: decode(encode(t)) == t
Out[251]: True

More…

In [269]: t = gen_random_bytes(2048)

In [270]: e = encode(t)

In [271]: d = decode(e)

In [272]: d == t
Out[272]: False

Oi! What’s going on here… I always suspect a length mismatch, or the first and last byte to be different, but…

In [303]: t[0], d[0]
Out[303]: (89, 89)

In [304]: t[-1], d[-1]
Out[304]: (173, 173)

In [305]: len(t), len(d)
Out[305]: (2047, 2047)

OK, byte-by-byte comparison:

In [307]: for i in range(0, len(t)-1):
     ...:     if t[i] != d[i]:
     ...:         print("mismatch at offset {}; in t: {}, in d: {}".format(i, t[i], d[i]))
     ...: 
mismatch at offset 75; in t: 242, in d: 119
mismatch at offset 699; in t: 242, in d: 119
mismatch at offset 823; in t: 242, in d: 119

What? Now this is kind of unexpected… So 242 encodes to a word which is decoded to 119? In what dictionary does it show up?

In [308]: 75 % 4, 699 % 4, 823 % 4
Out[308]: (3, 3, 3)

So it’s when i % 4 == 3, which is the final else clause in encode, and that takes care of encoding the adverbs from the RB array. Let’s check what’s in the array at these positions:

In [309]: RB[242]
Out[309]: 'y'

In [310]: RB[119]
Out[310]: 'y'

Hahahahaha! OK, so we have duplicate words in the same list, so the bijection is broken :-) Rember back in the last part, where I simply did RB = random.sample(en_RB, 256) to create a dictionary from a all adverbs? I did not check for duplicates.

But this bug needs to be fixed later, for now I will simply regenerate it in the hope it’s gonna fix itself :-)

In [320]: RB = random.sample(en_RB, 256)

In [320]: RB.sort()

In [321]: RB
....lots of stuff....
'withal',
 'wittily',
 'y',
 'y',
 'y']

Oh noes! I made it worse! How often does it even show up?

In [344]: [(i, en_RB[i]) for i in range(0, len(en_RB)) if en_RB[i] == 'y']
Out[344]: 
[(1051, 'y'),
 (1141, 'y'),
 (1214, 'y'),
 (1241, 'y'),
 (1308, 'y'),
...

In [345]: len([(i, en_RB[i]) for i in range(0, len(en_RB)) if en_RB[i] == 'y'])
Out[345]: 65

Ugh… OK. You know what: It’s Saturday, and I’m gonna continue on a later day :-)

So. What did I achieve:

When the number of words in a sentence is not a multiple of 4, pad the missing words as =.
Mind the stray . characters at the end of each full sentence.
Stay lazy until it becomes an impediment.

So the staying lazy part was not achieved because I found out that the quality of my word list isn’t good enough!

So I need to write down a few more requirements:

Make sure all word lists have only unique words
Repeat the tests and prove that it works for all byte sequences

Also, as a stretch-goal: I figured out that many words in te list are arcane or pretty uncommon. I might to replace the word list altogether. I looked into different other word lists and I might use the corpus “Simple English Wikipedia” by the University of Leipzig [^1].

New stretch goals:

Replace the corpus with Simple English by the Uni Leipzig
For every word class, chose the most popular once to avoid having arcane words
Manually inspect every list for duplicates or false classification. I’ve seen ambiguous words classified as adverbs, but were in fact nouns.

Read you next time!

[^1] D. Goldhahn, T. Eckart & U. Quasthoff: Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages. In: Proceedings of the 8th International Language Resources and Evaluation (LREC’12), 2012. URL: https://wortschatz.uni-leipzig.de/en/download/