Hackerrank Tries: Contacts

We are asked to provide a way to store contacts and check how many of them starts with a given prefix. As the name of this HackerRank problem suggests, we are gently pushed in the direction of implementing a trie. The issue is that one of the test cases that the solution has to pass to be accepted - #12 - is a bit too heavy for a plain python implementation.

My initial idea was to create a trie based on this node structure.
class Node:
    def __init__(self):
        self.count = 1
        self.children = {}

trie = Node()
An empty trie is just a node containing an empty dictionary of children.
If I add a new item on it, say a contact named 'x', I create a children with key 'x' and as value a Node created by the above constructor.
If I add a second item with the same starting letter, say 'xy, I increase the counter for the 'x' element and then create a 'y' element as its child.

I need two functions to implement the required interface, add() and find():
def add(node, data):
    pass


def find(node, data):
    pass
Given this skeleton, I written a python test unit based on the sample given by HackerRank:
def test_provided_1(self):
    add(trie, 'hack')
    add(trie, 'hackerrank')
    self.assertEqual(2, find(trie, 'hac'))

def test_provided_2(self):
    add(trie, 'hack')
    add(trie, 'hackerrank')
    self.assertEqual(0, find(trie, 'hak'))
I played a bit around, in the end I came out with this implementation:
def add(node, name):
    for letter in name:
        sub = node.children.get(letter)  # 1
        if sub:
            sub.count += 1
        else:
            sub = node.children[letter] = Node()  # 2
        node = sub  # 3
1. I am looping on each letter of the passed contact name, if the children dictionary in this level contains the relative letter, I access that node and increase its counter.
2. Otherwise I create a new node in the dictionary using as key the current letter.
3. Then I follow the recursive trie structure passing to the node in the level below.

Finding the counter associated to a prefix in a trie is just a matter of following its path in the tree.
def find(node, data):
    for letter in data:
        sub = node.children.get(letter)  # 1
        if not sub:
            return 0
        node = sub
    return node.count  # 2
1. If we don't find a letter in the path we are following in a matching level of the tree, there is no contact with such prefix. Otherwise we get the node among the children and pass to the level below.
2. If we have not intercepted a missing letter, the counter associated to the current node is the answer we want pass back to the caller.

This works fine, it is just a bit slow, with all the nodes that we create for each name we push in the trie.

We need a faster solution. A way would be changing programming language, and it looks very easy to convert this code in plain C. However, I had an idea that could make the algorithm faster.

Basically, I want to reduce the number of nodes created. When I insert the first name, for instance 'hack', I don't know if I really need to split it letter by letter. Maybe it will be the only contact starting by 'h'. So, I could keep it in a single chunk. Just one node instead of four. Good deal.
I pay it when I insert 'hackerrank', here I am forced to split 'hack' letter by letter (actually, I could have been more conservative, and keep the common prefix 'hack' in a single node. I kept it as a note. If the simpler split-all technique I went for was not enough, I would have tried this other approach). Still I let the tail of the new name, 'errank', in a single node.

Let's try it. No pun intended.

I have to store the actual piece of data in any Node:
class Node:
    def __init__(self, data=None):
        self.count = 1
        self.data = data
        self.children = {}

trie = Node()
Adding gets more complicated, since I have to think about splitting:
def add(node, name):
    for i in range(len(name)):  # 1
        sub = node.children.get(name[i])
        if not sub:
            node.children[name[i]] = Node(name[i:])  # 2
            return

        sub.count += 1
        if sub.data == name[i:]:  # 3
            return
        elif len(sub.data) > 1:  # 4
            split = sub.data[1:]
            sub.data = sub.data[0]
            sub.children[split[0]] = Node(split)
        node = sub  # 5
1. I can't use anymore the handy for-each loop, I must fallback to a plain for in range one since I need the current position in the word to split it when required.
2. When I detect a missing node I created it. The node is not associated anymore to a single letter in the name, but to the full chunk starting from the current position to the end of the word.
3. When I get a match, things are getting a bit more complicated. But not in this specific case, when the tail of the current word is a match with the content of the current node. Job done!
4. Here is the real extra work. The current node contains a chunk of two or more letters, and they don't match with the chunk from the new name. I need to split, I don't do any check on what I am splitting, I simply slice off one letter from the current chunk, let the first one in this node and put the rest in the child relative to the first letter of the split.
5. Back on the original track. I move to the node in the level below, and I am ready to the next step.

Finding stays simple, just a minor change:
def find(node, data):
    for i in range(len(data)):  # 1
        node = node.children.get(data[i])
        if not node:
            return 0
        if node.data.startswith(data[i:]):  # 2
            break
    return node.count
1. We need to know the position in the current word, here too.
2. We find the node when the data in it contains the current chunk of the new name, starting from its current position to its end.

Now the algorithm is fast enough so that also this python implementation gets fully accepted. I pushed on GitHub both versions, the "plain" one and the "chunk" one, and the relative unit tests.

No comments:

Post a Comment