A PGP Words variant creating natural sentences - part 4

This is part 4 of a series of articles I currently writing. As we saw in the last part, while I was able to encode and decode arbitrary binary data into and from grammatically correct English sentences, my ‘corpus’ (word lists) are of questionable quality. In this issue I will try to generate better word lists, eliminating the problem of duplicate words and hopefully pick easier words.

As we saw in the last posting, I had the problem that words showed up multiple times in the word list. I only noticed this in one word list - the adverb RB - but I can’t rules out the same happens in other lists.

Let’s first try to quantify how often the problem occurs, for this I’m will count the occurrences of every word in all lists. If every word only occurs once, the list is good.

This is a common task I am doing very often when analyzing cryptography puzzles or when trying to crack historic cryptograms, so let me show the helper function, gen_hist, I’m using.

def gen_hist(s):
    """Accepts a flat list and returns a histogram as a dictionary""
    hist={}
    for i in s:
        if i not in hist.keys():
            hist[i] = 1
        else:
            hist[i] += 1
    return(hist)

Let me give you one very simple example:

In [17]: gen_hist([1, 2, 2, 3, 3, 3, 4, 5, 6, 6, 7])
Out[17]: {1: 1, 2: 2, 3: 3, 4: 1, 5: 1, 6: 2, 7: 1}

For a list of words:

In [21]: gen_hist(["Hello", "Hello", "World", "why", "why", "is", "it", "bloody", "bloody", "raining", "again"])
Out[21]: 
{'Hello': 2,
 'World': 1,
 'why': 2,
 'is': 1,
 'it': 1,
 'bloody': 2,
 'raining': 1,
 'again': 1}

Now let’s do:

create a histogram of all four lists, JJ, NN, RB and VB
show all words which show up more than once

In Python, it’s a one-liner, perhaps a little hard to read, but just try it out yourself with a few variants and you get a hang on it:

In [54]: {word: count for word, count in sorted(gen_hist(JJ).items(), key=lambda item: item[1]) if count > 1}
Out[54]: {}

In [55]: {word: count for word, count in sorted(gen_hist(NN).items(), key=lambda item: item[1]) if count > 1}
Out[55]: {}

In [56]: {word: count for word, count in sorted(gen_hist(RB).items(), key=lambda item: item[1]) if count > 1}
Out[56]: {'y': 2}

In [57]: {word: count for word, count in sorted(gen_hist(VB).items(), key=lambda item: item[1]) if count > 1}
Out[57]: {}

So, just like the old saying “It’s always DNS!”, here, it’s always the adverbs and the ‘y’, though I have no idea why it’s showing up in the first place. Instead of sampling again from the original list, eleminating duplicates, I will fix the original list instead.

Let’s identify all the duplicates there!

In [59]: {word: count for word, count in sorted(gen_hist(en_RB).items(), key=lambda item: item[1]) if count > 1}
Out[59]: 
{'not': 2,
 'meany': 2,
 'now': 2,
 'rather': 2,
 'ruby': 2,
 'smokey': 2,
 'snoopy': 2,
 'downy': 2,
 'ly': 7,
 'y': 65}

Tbh, I have no clue why ‘ly’ and ‘y’ show up in this list, but I have this feeling that these are helper tokens (suffixes) the Hannover Tagger uses to identify adverbs from the stem verb (‘sufficient-ly’, you get the idea). Let’s simply drop the ones which look suspicious. I would say the ‘y’ and ‘ly’ can be dropped, but ‘not’ and ‘now’ should stay (‘English ships sink now/not’).

In [71]: new_en_RB = [word for word, count in sorted(gen_hist(en_RB).items(), key=lambda item: item[1]) if count == 1]

In [72]: {word: count for word, count in sorted(gen_hist(new_en_RB).items(), key=lambda item: item[1]) if count > 1}
Out[72]: {}

Ha! But now I lost the words I wanted, but they are not many, so simply add them again.

In [73]: new_en_RB += ['not', 'meany', 'now', 'rather', 'ruby', 'smokey', 'snoopy', 'downy']

Boom. No, finally, pick a new adverb list with random words and check.

In [76]: RB = random.sample(new_en_RB, 256)

In [77]: {word: count for word, count in sorted(gen_hist(RB).items(), key=lambda item: item[1]) if count > 1}
Out[77]: {}

Great! So now all the tests shall pass!

Get the binary-fuzzer up again:

import random
def gen_random_bytes(length=16):
    o = b''
    for i in range(0, length-1):
        o += random.randint(0,255).to_bytes()
    return o

Let’s try this:

In [82]: r = gen_random_bytes(2048)

In [83]: r == decode(encode(r))
Out[83]: True

In [84]: r = gen_random_bytes(100000)

In [85]: r == decode(encode(r))
Out[85]: True

Toot, toot! But I thought of something entirely else. While generating lot’s of random input values (‘fuzzing’) to test unexpected behavior, I should try all possible values 0-255 for in all four dictionary. So I should create a list of input in the form ${A}${B}${C}${D} for A, B, C and D being all values from 0-255. That makes sure that all words in all list have been used once.

However, we do not need to generate $2^32$ numbers, 1024 will be enough, we can go octet by octet, setting the others simply to 0.

To do this, I will use a really dirt-cheap trick using NumPy, the identity matrix and a simple loop. Why? Because I like to abuse useful things for purposes nobody ever thought about and because my brain works funny.

In [118]: import numpy as np

In [119]: t = np.identity(4, dtype=np.uint8)

In [120]: t
Out[120]: 
array([[1, 0, 0, 0],
       [0, 1, 0, 0],
       [0, 0, 1, 0],
       [0, 0, 0, 1]], dtype=uint8)

In [121]: bytes(t[1])
Out[121]: b'\x00\x01\x00\x00'

In [122]: bytes(t[1]*0)
Out[122]: b'\x00\x00\x00\x00'

In [123]: bytes(t[1]*255)
Out[123]: b'\x00\xff\x00\x00'

See the point? So:

In [142]: test_pattern = []

In [143]: for pos in range(0,4):
     ...:     for i in range(0,256):
     ...:         cur_pattern = bytes(t[pos] * i)
     ...:         test_pattern += [ cur_pattern ]
     ...: 

In [145]: test_pattern[0]
Out[145]: b'\x00\x00\x00\x00'

In [146]: test_pattern[255]
Out[146]: b'\xff\x00\x00\x00'

In [147]: test_pattern[256]
Out[147]: b'\x00\x00\x00\x00'

In [148]: test_pattern[257]
Out[148]: b'\x00\x01\x00\x00'

In [149]: test_pattern[1023]
Out[149]: b'\x00\x00\x00\xff'

There! Full list of all test-pattern! Now please don’t moan. I know there a gazillion other ways to generate it; bit-shifting, for-loops and whatnot. But I like the idea of using a identity matrix and multiplying it with a scalar to get a 32-bit test-pattern!

Now run the tests:

In [150]: for i in test_pattern:
     ...:     if i != decode(encode(i)):
     ...:         print("Error for pattern {}".format(i))
     ...: 

In [151]:

OK, we’re good!

Enough for today. I eliminated the duplicates and found a really off way to generate a test, I stayed lazy and did not work on any stretch goals, but let’s wrap it up:

Stay lazy until it becomes an impediment.
Make sure all word lists have only unique words
Repeat the tests and prove that it works for all byte sequences

In a gist this means: I now have a fully and correctly working encoding and decoding logic. I can translate arbitrary binary data to English sentences and back again.

But during development I figured out two things: I do not even need my padding scheme, as my decoder simply drop it’s. Additionally, I really need to refactor the code so that I can properly publish it. So:

remove padding scheme as it’s not necessary
refactor code; shall be a self-sufficient script accepting input from file or stdin, and write to stdout or file.

Stretch goals, none reached:

Replace the corpus with Simple English by the Uni Leipzig
For every word class, chose the most popular once to avoid having arcane words
Manually inspect every list for duplicates or false classification. I’ve seen ambiguous words classified as adverbs, but were in fact nouns.

And while I was waiting to fall asleep last night I was pondering… Why stop here? Why not create Limericks? Hahaha, I don’t know… Let;s see :-)

Until next time!