This Thread: CodeEval Levenshtein Distance

We are given in input a short list of words, a separator, then another huge list of words. For each word in the first list, we have to return the numbers of elements in its "social network" from the second list. A word is "friend" of another one if their Levenshtein distance is one. Meaning, they have just one letter difference, as "aa" with "aaa", or "aaa" with "aba". Pay attention that the example provided by CodeEval, when tested against the file they pushed on GitHub, is slightly wrong.

The problem is conceptually simple, its main issue is about performance. Let's write firstly a straightforward solution that scores poorly even when implemented in C++, and don't even get accepted in my python implementation, due to its slowness.

The definition of friendship given in this problem should let us think about graphs and cliques. Fortunately we are not asked to get the maximum clique in the network, but only the maximum clique for the few specified nodes. So we should expect "just" an Big-Oh N squared time complexity on the number of elements in the full social network that, unfortunately, we know to be large.

Still, on top on that N x N, we should consider the cost of calculating the Levenshtein distance for any couple of words. This problem leads almost naturally to a Dynamic Programming solution, however, also in this case we should expect an expensive temporary complexity, give or take in the order of m x n, where m and n are the lengths of the two words.

A Levenshtein distance of one

But, wait a minute, really we are not interested in the Levenshtein distance, just on determine if two words have it equals to one. This is a much more simple problem, and we can even break it in two parts. If the two words have the same size, we could just check if there is one and only one different character in them, i.e., we can pass from one string to the other just with one char swap:

def is_swap(lhs, rhs):
    difference = False
    for left, right in zip(lhs, rhs):  # 1
        if left != right:  # 2
            if difference:
                return False
            difference = True
    return difference  # 3

1. In production code, I would have asserted the precondition len(lhs) == len(rhs), here I just rely on the caller to do it. Assuming the same size for lhs and rhs, I zip them, extracting the characters at the same position and putting them in the cycle variables, left and right.
2. If left is not equals to right, I raise the difference flag. But if it was already raised, I know the difference is too strong, and I can stop checking.
3. If there is one and only one difference I would return True.

Other part of my simplified Levenshtein check. If I have two strings with difference in their size of one, the longest one should be the same as the shorter, but having one character more.

def is_add(shorter, longer):
    if len(shorter) > len(longer):  # 1
        shorter, longer = longer, shorter

    x = 0  # 2
    i = 0  # 3
    while i < len(shorter):
        if shorter[i] != longer[i+x]:  # 4
            if x == 1:
                return False
            x += 1
        else:
            i += 1

    return True if x < 2 else False  # 5

1. I assume the caller ensures the precondition on the parameter size, but doesn't verify which is which. This simplify the caller code, making this function a bit less readable. Anyway, the point of this check is ensuring that the string named shorter refers to the actual shorter one in input. 2. Counter for the extra characters found in the longer string. 3. Loop variable on the shorter string. The longer one will be ahead of x positions. 4. I found a difference in the strings. Add the difference counter, but if I have already seen a difference, in that case I know the two strings are too different to be friends. 5. I return True if here x is both 0 or 1. Zero means a case like "aa" vs. "aax". Each comparison in the length of shorter succeeded, however longer has a character more. One means the difference is in the middle. More than one difference is trapped in the while loop. Setting up the graph My first idea was to generate the social network graph, and then looking on it for the friends of the requested nodes. Saying that words is a dictionary, each word I get in input would be pushed in it as key, initializing its value with an empty set:

for test in file:
    words[test.rstrip('\n')] = set()

Then I call a utility function, check_friends(), to push each friend in its place - seeing it as a graph, the node is in the key, the edges are in the value.

def check_friends():
    for lhs in words.keys():  # 1
        for rhs in words.keys():
            if (len(lhs) == len(rhs) and is_swap(lhs, rhs) or  # 2
                    abs(len(lhs) - len(rhs)) == 1 and is_add(lhs, rhs)):  # 3
                words[lhs].add(rhs)
                words[rhs].add(lhs)

1. The infamous N x N loop. I check each word against all the other ones. Actually, I let is_swap also to detect if the word is tested against itself. 2. Same size, let is_swap() decide if the words are friends. 3. One character difference, ask to is_add() for friendship. Getting the clique Checking if a given word is in the full social network is easy and fast, being stored in a dictionary. Then, I apply a Breadth-first search (BFS) algorithm to navigate the graph and collecting all the friends and friends of friends:

def solution(word):
    results = []  # 1
    visited = set()  # 2

    net = deque()  # 3
    if word not in words.keys():  # 4
        return 0
    net.append(word)  # 5
    visited.add(word)

    while net:  # 6
        friend = net.popleft()
        fofs = words[friend]
        for fof in fofs:  # 7
            if fof not in visited:
                net.append(fof)
                visited.add(fof)
        results.append(friend)  # 8

    return len(results)

1. Here I am going to push all the elements in the word social network. 2. This is to avoid infinite looping. It's equivalent of marking a node as visited. 3. In this queue I push the node in the personal net for the current node that I am checking. 4. It is not officially stated in the problem that the caller could ask only for a word in the list, so I added this control. In any case it is cheap. 5. Initialize the algorithm before starting. The passed word is ready to be processed, and it is marked as visited. 6. While there are nodes to be processed, loop. 7. Push each friend of the current friend in the net, mark it as visited. 8. Push the current friend in the results set. This approach works alright, and I'm quite happy with it. However, for the way the problem is conceived, it is too slow. We need to do something to improve performances. It is clear where the bottleneck is. Generating the graph for the full social network is obviously very expensive. Do we really need to do it? Usually it would be a good idea, but here we know that we need to calculate the social network just for a handful of words. So, let's avoid to do it. A slimmer algorithm First change, word is not anymore a dictionary, but just a set. This means there is no value associated to each element, when I read a word I just push it in the set, and that's it.

for test in file:
    words.add(test.rstrip('\n'))

Also, no need to call check_friends(), instead, I do almost the same job in a get_friends() function, but only for the relevant items:

def get_friends(item):
    result = []
    for word in words:
        if (len(item) == len(word) and is_swap(item, word) or
                abs(len(item) - len(word)) == 1 and is_add(item, word)):
            result.append(word)
    return result

The changes in solution() are minimal:

def solution(word):
    results = set()
    visited = set()

    net = deque()
    if word not in words:
        return 0
    net.append(word)
    visited.add(word)

    while net:
        friend = net.popleft()
        fofs = get_friends(friend)
        for fof in fofs:
            if fof not in visited:
                net.append(fof)
                visited.add(fof)
        results.add(friend)

    return len(results)

And now CodeEval accepts happily the code. I pushed both my python solutions to GitHub, the complete but slow one, and the sleeker one, tailored on the problem requisites.

This Thread

Pages

CodeEval Levenshtein Distance

No comments:

Post a Comment

Labels

Recent comments

Tutorials - Examples - Books

Have a look at these blogs ...