Pages

CodeEval Burrows-Wheeler transform

The CodeEval problem #184 asks us to generate the original text from an input line representing the result of a Burrows-Wheeler transformation.

The CodeEval page explains in a few words what is that transformation and what it useful for. Even if could look a bit unclear at first, it is quite easy to write a function that operates the transformation. Here is my python 3 code for it:
def bw_transform(s):
    size = len(s)
    rots = sorted([s[i:size] + s[0:i] for i in range(size)])
    result = [rot[-1] for rot in rots]
    return ''.join(result)
The input line is rotated by one char at time, and all these rotations are stored in a list, that get sorted. Then I extract the last character in each rotation to another list that, joined, gives the result.

However our job is to reverse the result of the transformation, and that leads to a piece of code more complicated. Actually, I'm not sure the fuction I get could be considered a good result. It gets accepted by CodeEval, though.

Here is the test case I used to help me getting through the job:
def test_provided_1(self):
    self.assertEqual('Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo$',
        solution('oooooooo$  ffffffff     ffffffffuuuuuuuuaaaaaaaallllllllbbBbbBBb'))

def test_provided_2(self):
    self.assertEqual('James while John had had had had had had had had had had had a better effect on the teacher$',
        solution('edarddddddddddntensr$  ehhhhhhhhhhhJ aeaaaaaaaaaaalhtf thmbfe           tcwohiahoJ eeec t e '))

def test_provided_3(self):
    self.assertEqual('Neko no ko koneko, shishi no ko kojishi$', solution('ooooio,io$Nnssshhhjo  ee  o  nnkkkkkkii '))

def test_extra(self):
    self.assertEqual('easy-peasy$', solution('yyeep$-aass'))
As you can see also from the tests, in this implementation of the Burrows-Wheeler algorithm there is the trick of adding a marker, here a dollar sign, at the end of the line. That's why I defined a constant in my script:
EOL = '$'
Then I create a matrix, initialized as a list of empty lists, then, for each element, I repeatedly inserted at the beginning of each rotation each character in line.
But notice that, after each run of the inner for loop, I sort the rotations:
def solution(line):
    rots = [[] for _ in line]
    for _ in line:
        for ch, rot in zip(line, rots):
            rot.insert(0, ch)
        rots.sort()
So, in the end, rots contains all the rotations of the original string.

Now I should find out the right rotation. But that's easy, since I used the EOL marker to identify it.
for candidate in rots:
    if candidate[-1] == EOL:
        return ''.join(candidate)
It is just a matter of selecting the rotation that has it as its last character.

CodeEval took quite a longish time to give its positive response on this solution. I think I should work more on it to get a better result. Anyway, I put the test case and the python script I have written on GitHub.

No comments:

Post a Comment