Pages

CodeEval Longest Common Subsequence

The CodeEval problem #6 is commonly used as an example to introduce to Dynamic Programming.
We have two strings, we want to know how many and which characters they have in common.

On CodeEval we are given a sample, I added a second test, shorter, that helped me better to develop my python script:
def test_provided_1(self):
    self.assertEqual('MJAU', solution('XMJYAUZ;MZJAWXU'))

def test_simple(self):
    self.assertEqual('HI', solution('AHOI;HI'))
Solving this problem with DP comes quite naturally. I have two strings, I put it one vertically, the other horizontally, I leave top row and left column set to zero, just to simplify the calculations, and then I scan the matrix from top-left to right, row by row, since I get to the right-bottom corner.

Cell by cell, I check if the characters in the two strings match or not. In case of matching, I get the value of the cell one step to the left, one step over, and increase it by one. Otherwise I get the values from the cell over, the cell to the left, and I keep the highest value between the two.

Here is the matrix for my simple test above:
-  -  H  I
- [0, 0, 0]
A [0, 0, 0]
H [0, 1, 1]
O [0, 1, 1]
I [0, 1, 2]
The bottom right cell tells us that there are two matches. The algorithm I am using makes a bit more laborious to get which are them. Well, not in this case, since the shorter string has size two, we already know that "HI" is the longest common subsequence. But generally speaking, we have to backtrack the job we have done, starting from the bottom-right, observing where there is a change of value in a cell compared to its neighbors, moving up to find the next border inside the matrix.

The resulting python is quite terse, so I think could help to better understand the thing.

First part, generating the matrix. In ver and hor I put the two strings. Think that ver is "AHOI" and hor is "HI", from the example above.
t = [[0 for j in range(len(hor)+1)] for i in range(len(ver)+1)]  # 1
for i in range(len(ver)):  # 2
    for j in range(len(hor)):
        t[i + 1][j + 1] = t[i][j] + 1 if ver[i] == hor[j] else max(t[i+1][j], t[i][j+1])
1. Create an all-zero bidimensional list sized +1 of the respective referenced string.
2. Loop on all the cells, and apply the algorithm that I described above. Notice that the indexes i,j are 0-based, so they are OK when referring to the strings in ver and hor. When working on the matrix, however, I adjust them adding one to refer to the current cell, row and column. I guess this is the most confusing part. Maybe following the debugger stepping in the code could help a better understanding.

Now I prepare for backtracking. Let's set a few variables:
i, j = len(ver), len(hor)  # 1
result = [None] * t[i][j]  # 2
cur = -1  # 3
1. I let i and j refer to the bottom-right cell.
2. I am going to put the result in this list. I know the size, it is stored in the bottom-left cell, I initialize each element to None, missing any better default.
3. Since I am backtracking. I am going to put the first character in the last position, and so on.

And this is the backtracking loop. Notice that I changed the convention for i, j. Here it looked to me more natural using i and j without adjustment to refer to the cells in the matrix.
while i > 0 and j > 0:
    if t[i][j] == t[i-1][j]:  # 1
        i -= 1
    elif t[i][j] == t[i][j-1]:  # 2
        j -= 1
    else:
        result[cur] = ver[i-1]  # 3
        cur -= 1
        i -= 1
        j -= 1
1. Compare the current cell with the one above. If they are the same, I can move upward.
2. I couldn't move upward, I try to move leftward.
3. I am on the verge of changing values. This means that the letter in the string (pay attention to the index, adjusted by subtracting one!) is part of the subsequence. Store it in the result, and move on.

Finally we have just to convert the result from list to a string, the usual pythonic way is a join on an empty string.

The script has been accepted with full marks. It was just a bit slow, and it consumed lot of memory. So I decided to port the code to C++, task that was easy and rewarding. Just a few minor changes, besides the obvious differences due to the languages grammars, for instance, I found easier in the backtracking to write the result backward and the reverse it by the STL function.

For full reference, I put the unit test and the python script on GitHub.

No comments:

Post a Comment