This Thread: 2013

Dominator by golden leader

I want to check if in a collection of integers there is a value that occurs more than half the times. I have already given a solution to this problem in the previous post, but I need something smarter, linear in time and constant in space complexity.

To achieve this result I am about to implement an algorithm based on the discussion that you can read in a paper available on Codility. Look at the end of that document, for a Python function named goldenLeader.

In a few words, the idea is that if a dominator exists, we could discard couples of different elements, and in the end, we should ends up with having spared at least one dominator element. If no dominator exists, we'll find out we have throw away all the elements looking for it.

However, our function should not return just the dominant value, but an index of the original vector that refers to an element with such a value. So we need to twist a bit that algorithm.

typedef std::pair<unsigned, int> ValInd; // 1

int solution(const std::vector<int>& input)
{
    int count = 0; // 2
    ValInd candidate; // 3
    for (unsigned i = 0; i < input.size(); ++i)
    {
        if (count == 0) // 4
        {
            count = 1;
            candidate.first = input[i];
            candidate.second = i;
        }
        else // 5
        {
            count += (candidate.first == input[i]) ? 1 : -1;
        }
    }

    if (count == 0) // 6
        return -1;

    if (std::count(input.begin(), input.end(), candidate.first) > (int) input.size() / 2)
        return candidate.second;

    return -1;
}

1. I need to remember the current value and the index where I have found it.
2. I am going to count how many elements, supposedly being dominators, I have found that are not yet matched with non-dominator. I could have pushed them on a stack, but it is obviously cheaper doing in this way.
3. Here I am going to store the current candidate.
4. No previous candidate survived the looping, the current element is chosen.
5. I have a candidate, I compare it against the current element. If they have the same value, I increase the counter, otherwise I decrease it.
6. If there is no pending candidate, I am sure there is no dominant value. Otherwise I count how many elements with the candidate value are in the input collection. If they are enough, bingo.

Go to the full post

Dominator

We have a collection of integers (whichever 32 bit signed int is possible, and up to one million of them), we want to know if there is a dominant value in it. Meaning, a value that occurs more than half the times. If so, the index of one of such item (anyone would do) should be returned. Otherwise a -1 would mean "no dominant value".

To make things a bit more spicy, linear time and constant space complexity is required.

I found this problem in the Codility train page, section Leader, under the nickname Dominator. There you would find more details and the free words description of the test case I have here translated for C++ and GoogleTest:

int solution(const std::vector<int>& input);

TEST(Domi, Given)
{
    int h[] = { 3, 4, 3, 2, 3, -1, 3, 3 };
    std::vector<int> input(h, h + 8);

    ASSERT_EQ(3, input[solution(input)]);
}

More than half the elements in the collection have value 3, so I expect as output the zero-based index of an element whose value is 3.

Not considering the space complexity requirement, is not difficult to find an (almost) suitable solution.

What I would do, is creating a map where the values of the input vector are associated to a counter that keep track of their sequence. Or better, since we want to know also the index of one of such elements, besides the counter I am going to store also the first index I get.

Then it would be just a matter of scanning the map.

Here is a possible C++98 implementation:

typedef std::pair<unsigned, unsigned> CountInd; // 1
typedef std::map<unsigned, CountInd> Counter; // 2

int solution(const std::vector<int>& input)
{
    Counter counter;
    for (unsigned i = 0; i < input.size(); ++i)
    {
        Counter::iterator pos = counter.find(input[i]);
        if (pos != counter.end()) // 3
            ++pos->second.first;
        else
            counter[input[i]] = std::pair<unsigned, unsigned>(1, i);
    }

    for (Counter::const_iterator it = counter.begin(); it != counter.end(); ++it)
    {
        if (it->second.first > input.size() / 2)
        {
            return it->second.second;
        }
    }

    return -1;
}

1. The value in the map would be a pair of frequency and index in the original vector.
2. I don't care about keeping the elements in the map ordered. If I were allowed to use C++11, the STL unordered_map would have been a much better choice. I am paying in term of time complexity something that I am not using. On the other hand, defining a custom hash map for this problem would be kind of an overkilling.
3. If I have already added an element in the map for the current value, I have just to increase its counter. Otherwise I add a new element where the key is the current input value, and the value is a pair having as first the current counter for this element (one), and as second the index to the current element in the input vector.

This solution scores full marks on Codility, but we know that is not the right answer. Time complexity is so close to be linear that their tested didn't detect the logarithmic multiplicator caused by the STL map. The real issue is the space complexity (even though not detected, too) that is linear instead of constant.

To achieve this result, we could apply an astute algorithm that you could see described in this linked document from Codility, where is given in a Python implementation in a function called goldenLeader.

I am going to use this hint to refactor my solution.

Go to the full post

Fish eat fish - linear solution

A bunch of cannibal fishes move up and downstream in a narrow river following the rules that I explained in the previous post. There I also showed a few test cases (C++ and GoogleTest) and a suboptimal first solution. Here I present a better algorithm that achieves an asymptotic linear time complexity even in the worst case.

The previously shown solution has the issue of repeating on and on checks on the same couple of fishes. We can avoid this delaying the comparison until it becomes a strict necessity.

The trick that I use here is avoid doing anything with any fish going downstream but storing them in a stack. The real job is done for each fish going upstream. If there is no one coming from the other direction, it survives for sure, otherwise it should decide its fate with a comparison against all the fishes previously stored in the stack, starting from the one closest to it (that's why a stack is a good choice as container).

Here is how I implemented this algorithm:

int solution(std::vector<int>& sizes, std::vector<int>& directions)
{
    int count = 0; // 1
    std::stack<int> goingDown; // 2
    for (unsigned i = 0; i < sizes.size(); ++i)
    {
        if (directions[i] == 1) // 3
        {
            goingDown.push(sizes[i]);
            continue;
        }

        if (goingDown.empty()) // 4
        {
            ++count;
            continue;
        }

        while (!goingDown.empty()) // 5
        {
            if (sizes[i] > goingDown.top()) // 6
                goingDown.pop();
            else
                break;
        }
        if (goingDown.empty()) // 7
            ++count;
    }
    return count + goingDown.size(); // 8
}

1. I don't need anymore to store the state of each fish, I could just keep track of the number of fishes going upstream that survived.
2. Here I push each fish that is going downstream.
3. The current fish is going downstream, I just have to push it in the stack and go to the next iteration.
4. The current fish is going upstream. If there is no previous fish moving in the other direction, he made it. I increase the counter and go to check the next one.
5. Otherwise, the current fish has to prove to be bigger than any fish in the stack to survive.
6. If the current fish is bigger that the current top in the stack, pop it (that is, eat it). Otherwise, it is the one to be eaten, and I should stop looping on the stack.
7. If the stack of downwards-going fishes is empty, this means that the current fish has eaten them all up (or it was so lucky to find no traffic). So I can increase the counter.
8. In "count" I have the number of all the upwards-going fishes that made it. I just have to add to it the numbers of the downwards-going fishes to get their current total number.

The C++ source code for both solutions and relative test cases is on github.

Go to the full post

Fish eat fish

We have a bunch of fishes (there could be up to 100,000) in a very narrow river, each of them with a different weight (represented by a non negative integer up to one billion). Any single fish move upstream or downstream (0 means up, 1 down). If two fishes meet (one moving up, the other moving down) the smaller one is eaten by the bigger. How many fishes are going to survive?

You can find this problem in the Codility train page, under the Stacks and Queues section, nickname Fish.

To better think about a solution, I have taken the one proposed in the problem description and converted it to a GoogleTest for C++, then adding the ones that looked interesting to me. Here is a selection of them:

int solution(std::vector<int>& sizes, std::vector<int>& directions);

TEST(Fish, Given) // 1
{
    int a[] = { 4, 3, 2, 1, 5 };
    std::vector<int> sizes(a, a + 5);

    int b[] = { 0, 1, 0, 0, 0 };
    std::vector<int> directions(b, b + 5);

    ASSERT_EQ(2, solution(sizes, directions));
}

TEST(Fish, SameDirection) // 2
{
    int a[] = { 4, 3, 2, 1, 5 };
    std::vector<int> sizes(a, a + 5);

    std::vector<int> directions(5);

    ASSERT_EQ(5, solution(sizes, directions));
}

TEST(Fish, Mixed) // 3
{
    int a[] = { 4, 3, 2, 1, 5 };
    std::vector<int> sizes(a, a + 5);
    int b[] = { 0, 1, 0, 1, 0 };
    std::vector<int> directions(b, b + 5);

    ASSERT_EQ(2, solution(sizes, directions));
}

TEST(Fish, NoMeeting) // 4
{
    int a[] = { 4, 3, 2, 1, 5 };
    std::vector<int> sizes(a, a + 5);
    int b[] = { 0, 0, 0, 1, 1 };
    std::vector<int> directions(b, b + 5);

    ASSERT_EQ(5, solution(sizes, directions));
}

TEST(Fish, Pilot) // 5
{
    int a[] = { 1, 5, 3, 4, 2 };
    std::vector<int> sizes(a, a + 5);
    int b[] = { 1, 1, 0, 0, 0 };
    std::vector<int> directions(b, b + 5);

    ASSERT_EQ(2, solution(sizes, directions));
}

1. All the fish is going upstream but the second one, that weights 3. It is going to eat the third and forth, but it is going to be eaten by the fifth. In the end just two fishes are going to survive, the first and the last one.
2. The fish is compactly moving upstream, all of them should survive.
3. Mixed situation, only two fishes are going to survive.
4. The school is parted in two groups that are not meant to cross.
5. This is the one I found most interesting. The first two fishes are moving downwards, against the other three ones. The first one is so thin that would usually have no chance to survive, but it has just after it a big brother that is going to shield it to any attack. So, against all odds, it survives too.

Then I have written my first solution, based on the idea of checking each element against any other fish it is going to meet, keep track of their status in a vector of boolean and then sum up all the surviving fishes.

int solution(std::vector<int>& sizes, std::vector<int>& directions)
{
    std::vector<bool> alive(sizes.size(), true); // 1

    for (unsigned i = 0; i < sizes.size(); ++i) // 2
    {
        if (directions[i] == 0) // 3
        {
            for (int j = i - 1; j >= 0; --j)
            {
                if (directions[j] == 1)
                    alive[sizes[i] < sizes[j] ? i : j] = false;
                if (sizes[i] < sizes[j])
                    break;
            }
        }
        else // 4
        {
            for (unsigned j = i + 1; j < sizes.size(); ++j)
            {
                if (directions[j] == 0)
                    alive[sizes[i] < sizes[j] ? i : j] = false;
                if (sizes[i] < sizes[j])
                    break;
            }
        }
    }

    return std::accumulate(alive.begin(), alive.end(), 0); // 5
}

1. Initially, are the fishes are alive.
2. I am checking any fish against the ones that are upwards or downwards, accordingly to its moving direction.
3. My current fish is moving upstream. Let's compare it against each fish on its left, starting from the closest one. I break the loop in case it gets eaten by a bigger fish.
4. Same as above, but the current fish has to fight against all the other guys downstream.
5. Finally it is just a matter of counting how many of them are still alive.

Normally this solution works fine, however in the worst case scenario the internal loop involves a number of fish in the order of the collection size, leading to an O(N**2) time complexity that makes this algorithm unacceptably slow.

Think for instance to this innocuous-looking test case:

TEST(Fish, Big)
{
    std::vector<int> sizes(100000);
    for(unsigned i = 0; i < sizes.size(); ++i)
        sizes[i] = i + 1;
    std::vector<int> directions(100000);

    ASSERT_EQ(100000, solution(sizes, directions));
}

One hundred thousand fishes, all of them going in the same direction. A human being could give the answer in the blink of an eye. The above solution, however, painfully check all the right elements to any fish in the school.

Can we do better than this? Yes. As you would see in the next post, it is just a matter of adding a bit of intelligence to the algorithm to get, even in the worst cases, a linear(ish) time complexity.

Go to the full post

Narcissistic numbers

You could find a thorough description of narcissistic number, also known as Armstrong number, perfect digital invariant, plus perfect number, on Wolfram Mathworld. In a few words, a narcissistic number is an n-digit numbers that equals the sum of all its digits at the nth power.

We want to write a function that checks if a given number respects this definition.

It is easy to see as any positive one-digit number is narcissistic, and the Wolfram page I linked above provides a list of other narcissistic numbers. I used them to write these test cases (in C++ for GoogleTest):

bool isNarcissistic(const std::string& input);

TEST(Narcissus, Simple)
{
  ASSERT_TRUE(isNarcissistic("2"));
  ASSERT_FALSE(isNarcissistic("42"));
}

TEST(Narcissus, Three)
{
  ASSERT_TRUE(isNarcissistic("153"));
  ASSERT_TRUE(isNarcissistic("370"));
  ASSERT_TRUE(isNarcissistic("371"));
  ASSERT_TRUE(isNarcissistic("407"));
  ASSERT_FALSE(isNarcissistic("497"));
}

TEST(Narcissus, Eight)
{
  ASSERT_TRUE(isNarcissistic("24678050"));
  ASSERT_TRUE(isNarcissistic("24678051"));
  ASSERT_TRUE(isNarcissistic("88593477"));
  ASSERT_FALSE(isNarcissistic("88583475"));
}

Here is a possible implementation for such function:

bool isNarcissistic(const std::string& input)
{
  assert(!input.empty()); // 1
  if(input.size() == 1) // 2
    return input[0] != '0';

  int powered = 0; // 3
  for(unsigned i = 0; i < input.size(); ++i)
    powered += std::pow((input[i] - '0'), input.size());

  return powered == std::atoi(input.c_str()); // 4
}

1. This should never happen. It should be user code responsibility ensure that the passed input parameter is not empty. Production code should also ensure that the input strings contains only decimal digits.
2. All one-digit numbers are accepted, with the noteworthy exception of zero.
3. Each digit in the number are elevated to the expected power (third, if there are three digits in the number) and the result is summed up.
4. Compare the original value with the calculated one, and return true if they are the same.

Go to the full post

Self-descriptive numbers

Write function that check if its input represents a self-descriptive base-10 integer.

A number is said self-descriptive if its digit at the i-th position is the counter for the i-digit in the number itself.
21200 is self-descriptive, since it has 2 zeros, 1 ones, 2 twos, 0 threes, and 0 fours.

Here is a possible C++ implementation:

bool isSelfDescr(const std::string& input)
{
  if(input.empty() || input.size() > 10) // 1
    return false;

  for(unsigned i = 0; i < input.size(); ++i) // 2
    if(std::count(input.begin(), input.end(), '0' + i) != input[i] - '0')
      return false;

  return true;
}

1. I am expecting a 10-based number, that means it can't have more than 10 digits - and it shouldn't be empty.
2. For each digit in the number, I count how many of them are in the string, using the STL algorithm count(), and I compare the result against the number in that position.

Go to the full post

Happy numbers

In the Doctor Who episode "42" (the seventh of the third modern series), the tenth Doctor explains that a (natural positive) number that reduces to one when you sum up the square of its digits and continue iterating it until it yields one is said to be happy. If this description doesn't look so descriptive to you, there is a page about happy numbers on wikipedia that looks interesting.

We can check if a number is happy with a C++ function like this one:

bool isHappy(unsigned value)
{
  std::set<unsigned> numbers; // 1
  numbers.insert(1); // 2
  while(numbers.find(value) == numbers.end()) // 3
  {
    numbers.insert(value); // 4
    int next = 0;
    while(value > 0)
    {
      int digit = value % 10;
      next += std::pow(digit, 2); // 5
      value /= 10;
    }
    value = next;
  }

  return value == 1; // 6
}

1. A sad number would enter in a hopeless infinite loop. To detect it, we should keep memory of the states it has already passed, so that we can check them.
2. We could stop transforming the input number when we get a value that we have already seen (sad), of when we get 1 (happy).
3. If the current value is not in the buffering set, we iterate once again.
4. Add the current value to the "already seen" set, and calculate the next state.
5. Each digit is squared and added.
6. If value has reached 1, we have a happy number.

Go to the full post

String tolower()

The problem here is converting an entire C++ string to lower (or upper) case.

The raw part of the job is done by the toupper/tolower ctype functions in the standard library. The point is that those functions work on a single character, we should find a way of iterating on the entire string.

We could just loop from the beginning to the end of the string, applying tolower() to each element in the sequence. Something like this:

for(std::string::iterator it = line.begin(); it != line.end(); ++it)
  *it = std::tolower(*it);

This implementation is alright, still we could avoid to explicitly do some work that nothing add as a value to our code, as declaring the iterator and a for-loop block.

We could use instead the STL transform() algorithm, that hides the for-block and its loop variable, asking as input the begin-end iterators for the input sequence, the begin iterator to the output one, and the operation it has to apply to transform it.

Here we are doing an in-place transformation, so input and output are the same.

Nice and plain. There is a tiny nuisance, though. There could be a name-clash on tolower/toupper, since besides the ctype version, there is also a localized version, defined in the locale include. So we can't simply say to the compiler we want to use the standard tolower function, we need to specify which tolower is needed. In our case, the one that expects an int as both input parameter and return value:

std::transform(line.begin(), line.end(), line.begin(), static_cast<int(*)(int)>(std::tolower));

Go to the full post

Bit check

Given an unsigned int and two numbers representing the one-based index of bits (from the least significative one) in that word, write a function that returns true if those bits have the same value.

This problem is based on the CodeEval Bit Positions challenge.

A test case (C++ for GoogleTest) should be enough to clarify the requirements:

TEST(BiPo, CaseGiven)
{
  EXPECT_TRUE(solution(86, 2, 3)); // 1
  EXPECT_FALSE(solution(125, 1, 2)); // 2
}

1. 86 base ten is 1010110 base two. Its second and third bit from left are both set to 1. We expect our function to return true.
2. 125 base ten is 1111101 base two. Its first bit is 1 and the second is 0. Therefore the function should return false.

Here is my C++ implementation:

bool solution(unsigned word, int p1, int p2)
{
  bool one = word & static_cast<int>(std::pow(2, p1-1)); // 1
  bool two = word & static_cast<int>(std::pow(2, p2-1));

  return !((one) ^ (two)); // 2
}

1. I convert the passed index in the associated binary value. If I had 3 in input, this mean that I want to check the third bit, whose binary value is 100, a decimal four. Then I apply a bitwise-and (the operator &) to the input word and the converted bit value. Its result will be true if the input word value has that bit up.
2. I want to return true if the selected bits (one and two) have the same value. To do that I apply a bitwise-exclusive-or (operator ^) to the bit flags. It would return true if the two flags are mutually exclusive (if one is true, the other is false), that is the exact opposite of what I want the function return. So it is just a matter of negating it (operator !).

The original CodeEval problem asks to output a string ("false" or "true") accordingly to the boolean value we get. This could be easily achieved through the ios boolalpha manipulator:

std::cout << std::boolalpha << solution(n, p1, p2) << std::endl;

Go to the full post

Writing backwards

We want to write a function that reverse the words in input. Just the order of words is reverted, not their actual structure. If we get in input "hello" and "world" we want in output "world hello" and not "dlrow olleh". Notice that there is a blank between each couple of words, but not at the end of the sentence.

This problem is based on the CodeEval Reverse world challenge.

A few test cases (C++11 for GTest as xUnit framework) would clarify what I'd expect from that function:

TEST(ReWo, Given) // 1
{
  ASSERT_EQ(solution( { "Hello", "World" } ), "World Hello");
}

TEST(ReWo, OneWord) // 2
{
  ASSERT_EQ(solution( { "Hello" } ), "Hello");
}

TEST(ReWo, Empty) // 3
{
  ASSERT_DEATH(solution( { } ),"Assertion .* failed.");
}

1. Vanilla case, the words in input are combined in a string from the last one up.
2. A single word in input should not break the algorithm.
3. It is a caller responsibility to ensure that at least a word is passed. To make this requirement explicit in the code, I want my function to assert the input being not empty. I could have been less strict, and simply return an empty string, but I wanted to show how to use the GoogleTest ASSERT_DEATH macro. More on the same topic on a previous post.

Here is how implemented in plain C++98 the function:

std::string solution(const std::vector<std::string>& input)
{
  assert(!input.empty()); // 1

  std::ostringstream oss; // 2
  for(std::vector<std::string>::const_reverse_iterator it = input.rbegin(); it != input.rend() - 1; ++it)
    oss << *it << ' '; // 3
  oss << input[0]; // 4

  return oss.str(); // 5
}

1. The input vector can't be empty.
2. I cared more about readability than performances here. Otherwise I would have be tempted to use a plain STL string, reserving to it enough memory on creation (summing up all the string sizes plus the number of blanks).
3. Put to the string stream all the elements in the vector followed by a blank, starting from the last one (reverse begin) up the the second (the one before the reverse end). We have asserted that the vector has at least an element, so know that this loop works fine. In case there is only one string in input, this block is simply skipped.
4. The first element in input (possibly the only one), is put to the string stream as last token. No trailing blank, as required.
5. Extract the resulting string from the stream and return it.

Go to the full post

Repeat pattern

Given a vector containing an arbitrary numbers of integers, check if there any repeat pattern in it.

I found this problem in CodeEval, where is marked with the slightly misleading name of Detecting Cycles (usually we talk about detecting cycles in the domain of graphs, here we have to do with a plain vector). There you could also find a test case that helps to clarify what our function has to do. I have rewritten it in C++11 for the GoogleTest framework:

std::vector<int> solution(const std::vector<int>& input);

TEST(DeCy, Given)
{
  std::vector<int> result = solution( {2, 0, 6, 3, 1, 6, 3, 1, 6, 3, 1} );

  ASSERT_EQ(3, result.size());
  ASSERT_EQ(6, result[0]);
  ASSERT_EQ(3, result[1]);
  ASSERT_EQ(1, result[2]);
}

As you see, here there is a repeat pattern of three integers (3, 1, 6). Our function should detect it and return it.

Our input could be much more complicated, so I have written a few more test cases, but I spare you. However you should notice that the minimal case is when the pattern consist of just a number that appear twice.

I have interpreted the problem request in the most natural way, so that I can split it in two logical parts. Firstly I check if there is at least a repeated integer in the sequence. If I find it, I check how long is the pattern, filling a vector with each matching element.

This algorithm scales terribly, in the worst case it has a cubic (!) asymptotically time complexity. Any improvement suggestion is welcomed. In the meantime, here is my C++98 implementation:

std::vector<int> solution(const std::vector<int>& input)
{
  if(input.size() < 2)
    return std::vector<int>(); // 1

  for(std::vector<int>::const_iterator first = input.begin(); first != input.end() - 1; ++first) // 2
  {
    for(std::vector<int>::const_iterator second = first + 1; second != input.end(); ++second) // 3
    {
      if(*first != *second)
        continue;

      int len = second - first; // 4
      if(second + len > input.end())
        len = input.end() - second;

      for(int i = 0; i < len; ++i)
      {
        if(*(first+i) != *(second+i)) // 5
          len = i;
      }

      return std::vector<int>(first, first + len);
    }
  }

  return std::vector<int>();
}

1. The input sequence has one or no element, there is not much to do besides returning an empty vector.
2. Loop on all the elements (but the last one).
3. Loop on the elements following the first iterator.
4. The first and second characters match, calculate their distance. The pattern couldn't be bigger than that. Wait a minute, we should also consider the vector size, that could put a lower limit the the pattern length.
5. Find the actual pattern size, if less than the theoretical maximum length.

Go to the full post

Matching many types of brackets

We have a string in input that could be quite longish (200K). It contains just a bunch of open and close brackets, in three possible varieties, round, squared, curly. We want to check if they are properly nested.

I have already discussed a similar problem in the past. In that case the string contained just one type of open/close parenthesis, so the job was even simpler.

I found this version of the problem in the Codility train page, section Stacks and Queues, under the nickname Brackets.

There you could find a couple of test cases, that I write here in C++ for the GoogleTest framework:

int solution(const std::string& input);

TEST(Brac, Given1) // 1
{
    ASSERT_EQ(1, solution("{[()()]}"));
}

TEST(Brac, Given2) // 2
{
    ASSERT_EQ(0, solution("([)()]"));
}

1. This is an example of a correctly nested string.
2. This is not. In third position it is not expected a closing round bracket, since there is a pending squared one before.

The same considerations I made for the simpler version of this problem apply here. The only variation is that I have to use a proper stack to keep memory of the pending open brackets.

Here is my C++98 implementation:

int solution(const std::string& input)
{
    if(input.size() % 2) // 1
      return 0;

    std::stack buffer;

    for(std::string::const_iterator it = input.begin(); it != input.end(); ++it)
    {
        if(*it == '{' || *it == '[' || *it == '(') // 2
            buffer.push(*it);
        else // 3
        {
            if(buffer.empty())
                return 0;

            char expected = *it == '}' ? '{': *it == ']' ? '[' : '(';
            if(buffer.top() == expected)
                buffer.pop();
            else
                return 0;
        }
    }

    if(buffer.empty()) // 4
        return 1;
    return 0;
}

1. There is no need of doing any check further in case of an uneven number of brackets.
2. If the current character is an open bracket, I put it in the stack.
3. Otherwise (assuming the input source is reliable) I expect it is a closing bracket that should match with a corresponding open one. If the stack is empty, the input string is corrupted, otherwise I determine which kind of bracket I expect, compare it with the element at the top of the stack, and pop it (if matches) or return the "no good" value.
4. Check a last time the stack, it should be empty now, otherwise we have some open bracket that we failed to close.

Go to the full post

Selecting the longest lines

We have in input a few strings, we want to output a given number of them, ordered by their size.

This is such a simple problem that I couldn't think of any interesting test case about. You can find it in the CodeEval open challenge under the name Longest Lines.

At its core, the solution to this problem is nothing more than a couple of C++ statements.

Assuming the number of lines we want to output is stored in an int variable named nr, and the vector of strings named lines keep the data, it is just a matter of sort the vector by lines size and return its first nr elements:

std::sort(lines.begin(), lines.end(), StringSizeCmp());
for(int i = 0; i < nr; ++i)
  std::cout << lines[i] << std::endl;

Right. But I forgot to tell you who is that StringSizeCmp guy. It is a simple functor that defines how sort() should decide which element comes first:

struct StringSizeCmp : public std::binary_function<std::string, std::string, bool>
{
  bool operator()(const std::string& lhs, const std::string& rhs) const
  {
    return lhs.size() > rhs.size();
  }
};

As clearly stated by inheritance, StringSizeCmp is a binary function that gets a couple of strings in and gives a boolean out. The comparison is here solved accordingly to the input sizes.

C++11 makes it even simpler, by means of lambda function. We could combine the predicate definition and its usage in a single line:

std::sort(lines.begin(), lines.end(),
    [](const std::string& lhs, const std::string& rhs) {
  return lhs.size() > rhs.size();
});

Go to the full post

Summing up primes

Which is the sum of the first one thousand prime numbers?

This question is (also) a CodeEval problem named Sum of Primes.

Even though it looks similar to the previous CodeEval problem I have seen, there's something that makes it more interesting:

int solution()
{
  std::vector<int> primes; // 1
  primes.reserve(1000);
  primes.push_back(2);
  primes.push_back(3);

  for(int i = 5; primes.size() < 1000; i+=2) // 2
  {
    bool isPrime = true;
    for(unsigned j = 1; j < primes.size() && (std::pow(primes[j], 2) <= i); ++j) // 3
    {
      if(i % primes[j] == 0)
      {
        isPrime = false;
        break;
      }
    }

    if(isPrime)
      primes.push_back(i);
  }

  return std::accumulate(primes.begin(), primes.end(), 0); // 4
}

1. I am going to put the generated primes in this vector.
2. I have already pushed the first two elements by hands, now I push all the other ones. I start checking 5, and then I step to the next odd number.
3. To check if a number is prime, it is enough to ensure that no other prime number is among its divisors. Moreover, we can stop checking when we reach a tentative divisor that, when squared, is bigger than the candidate prime.
4. Finally, just sum all the prime numbers up.

Some time ago, I have written about a similar but more general problem, how to check if a given number is prime.

Go to the full post

Prime and palindrome

Write a program that output the biggest number below 1000 that is prime and palindrome.

You could test your solution on CodeEval, where this very simple problem is named Prime Palindrome.

Its cleanest solution would be, for instance in Python, something like that:

print 929

However, if you are answering this question on an interview, you'd probably better think to a less concise solution.

For instance you could write a for-loop like this (here I used C++, but you could easily adapt it to about any programming language):

for(int i = 989; i > 100; i -= 2) // 1
{
  if(isPalindrome(i) && isPrime(i)) // 2
  {
    std::cout << i << std::endl;
    break;
  }
}

1. I start looping from the highest available palindromic number, after 999 that is obviously not prime, down to 100, since it is easy to see that at least 101 is both a palindrome and prime, stepping by two, given that I don't want to check even numbers.
2. If the current number is both palindrome and prime, print it and break the loop.

Checking our number for its palindromicity is trivial. It is just a matter of comparing its first and last cipher.

A bit more interesting is the check on its primality, please follow the link to a previous post I have written on the matter. Here we could use an even simpler algorithm, since we know that our biggest number to check is less than 1000, still the structure of the algorithm stay the same.

Go to the full post

Max product of three

We have a vector of something in between three and 100,000 integers in the range [−1,000 ... 1,000]. We want to get the maximum product of three elements among all the possible choices.

I found this problem in the Codility train page, section Sorting, under the nickname Max-product-of-three.

It looks very simple, and actually it is. Just be careful in considering the properties of multiplication.

A few test cases (written in C++ for the GoogleTest framework) would help to implement correctly the solution:

int solution(std::vector<int>& input);

TEST(MPoT, Given) // 1
{
    std::vector<int> input;
    input.push_back(-3);
    input.push_back(1);
    input.push_back(2);
    input.push_back(-2);
    input.push_back(5);
    input.push_back(6);

    ASSERT_EQ(60, solution(input));
}

TEST(MPoT, AllNeg) // 2
{
    std::vector<int> input;
    input.push_back(-2);
    input.push_back(-3);
    input.push_back(-4);
    input.push_back(-1);

    ASSERT_EQ(-6, solution(input));
}

TEST(MPoT, SomeNeg) // 3
{
    std::vector<int> input;
    input.push_back(0);
    input.push_back(-3);
    input.push_back(-4);
    input.push_back(1);
    input.push_back(2);

    ASSERT_EQ(24, solution(input));
}

1. This test case is part of the Codility's problem description.
2. What if all the input elements are negative.
3. More interestingly, when a mix of positive and negative numbers are in the game, the weight of the biggest negative ones should be considered.

The first idea that I guess anyone would have is getting the three biggest elements and multiplying them. That works fine in the first two test cases, but fails spectacularly in the third one. We (or at least, I) have forget to consider the case when we have some big negative values.

Remembering that also negative values could contribute to the solution, as showed in the third test case, I came up with this piece of code:

int solution(std::vector<int>& input)
{
    std::sort(input.begin(), input.end(), std::greater<int>()); // 1
    int alpha = input[0] * input[1] * input[2]; // 2
    int beta = *input.rbegin() * *(input.rbegin()+1) * input[0]; // 3

    return alpha > beta ? alpha : beta; // 4
}

1. Actually, there is not any real advantage in sorting the array in descending order. Notice that, as required by the Codility conditions, I am modifying the input vector. This is usually not considered a bright idea.
2. Having sorted the vector from the biggest element downward, I multiply the three elements more to the left. Often this would give the result we are looking for.
3. We could have the case where taking the two most negative numbers in the sequence have a higher relevance of the second and third positive numbers. In this case beta would be bigger than alpha.
4. Now it is just a matter of comparing alpha and beta, and returning the biggest one.

Go to the full post

Triangle triplet

In a triangle, the sum of the catheti is bigger than the hypotenuse. Let's call "triangle triplet" a group of three integer numbers that represents the length sides of a triangle. We want to check if in an array of integers, that could longish (one million elements), there is at least one of such triplets.

You can find this problem in the Codility train page, section Sorting, under the nickname Triangle, where you can find also an extra requirement, the values in input could be also negative (even that doesn't make much sense, accordingly to the definition I gave), and could cover the complete range of signed 32 bit integers in C/C++.

Codility provides also a couple of test cases, described in informal language. I felt that at least a third one was missing. Here they are, written in C++ for the googletest xUnit testing framework:

TEST(Tria, Given) // 1
{
    std::vector<int> input;
    input.push_back(10);
    input.push_back(2);
    input.push_back(5);
    input.push_back(1);
    input.push_back(8);
    input.push_back(20);

    ASSERT_EQ(1, solution(input));
}

TEST(Tria, Given2) // 2
{
    std::vector<int> input;
    input.push_back(10);
    input.push_back(50);
    input.push_back(5);
    input.push_back(1);

    ASSERT_EQ(0, solution(input));
}

TEST(Tria, MaxInt) // 3
{
    std::vector<int> input;
    input.push_back(INT_MAX);
    input.push_back(INT_MAX);
    input.push_back(INT_MAX);

    ASSERT_EQ(1, solution(input));
}

1. Example of a sequence that we should accept.
2. This input sequence does not contain any triangle triplet.
3. The input represents an equilateral triangle. However it is so huge that we could overflow, if our code is not designed carefully.

The idea of the algorithm I implemented is quite simple. Get the three smallest values in input, if the sum of the first two is less than the third one we have got our positive solution. Otherwise we know that the smallest number is too small to be part of any good triplet. Discard it and try again with the new smallest number in the lot. Go on till all the triplets are checked.

This was my implementation at the time of writing this post, in the meantime, codility has upgraded its compiler to get C++11, my code here depends on int type actual definition, and so now it fails in a extreme condition. Please, have a look to the newer post I have written about it for a more robust solution:

int solution(const std::vector<int>& input)
{
    if(input.size() < 3) // 1
        return 0;

    std::vector<int> buffer(input.begin(), input.end()); // 2
    std::sort(buffer.begin(), buffer.end());

    for(unsigned i = 0; i < buffer.size() - 2; ++i) // 3
    {
        if(static_cast<int64_t>(buffer[i]) + buffer[i+1] > buffer[i+2])
            return 1;
    }

    return 0;
}

1. Trivial case.
2. Performance-wise, would make no sense to operate on an unsorted collection. Being the input constant, create a copy and sort it.
3. Apply the above described algorithm. To avoid overflow, we should store the sum of the tentative catheti in a 64 bit integer. The simplest way to do that is casting one of them to int64_t.

Go to the full post

Minimal element in subsequences

We have a non-empty string, that could contain up to 100,000 characters (only the 'A','C','G','T' ones). The user would specify a few ranges, up to 50 thousand of them, in the string, and he expects as output the indices relative to the lowest letter in that ranges.

The problem itself is not complicated, it just asks us to check carefully the requirements. You can found it in the Codility train page, prefix sums section, under the name Genomic-range-query.

We can expect only ACGT letters in input, because the idea is that it represents a DNA sequence. Pay attention to the fact that the user specifies the subsequence ranges as zero-based indices, and he expects the output as one-based indices. If 'A' is the answer, we have to output 1, and so on.

As tradition in these kind of problems, there is no stress at all in error handling. If this function should be used in a real world environment, we should check thoroughly the input. Here we can just assume anything is fine.

There is a test case in the problem description, I added a few more of them to help me understanding better how to design the code. Here are the original one and a couple of new ones, written in C++ for the xUnit GoogleTest framework:

std::vector<int> solution(const std::string& dna, // 0
        const std::vector<int>& begs, const std::vector<int>& ends);

TEST(GeRa, Simple) // 1
{
    std::string input("ACGT");

    std::vector<int> P;
    P.push_back(0);
    P.push_back(1);
    P.push_back(2);
    P.push_back(3);

    std::vector<int> Q;
    Q.push_back(0);
    Q.push_back(1);
    Q.push_back(2);
    Q.push_back(3);

    std::vector<int> output = solution(input, P, Q);
    ASSERT_EQ(4, output.size());

    EXPECT_EQ(1, output[0]);
    EXPECT_EQ(2, output[1]);
    EXPECT_EQ(3, output[2]);
    EXPECT_EQ(4, output[3]);
}

TEST(GeRa, Given) // 2
{
    std::string input("GACACCATA");

    std::vector<int> P;
    P.push_back(0);
    P.push_back(0);
    P.push_back(4);
    P.push_back(7);

    std::vector<int> Q;
    Q.push_back(8);
    Q.push_back(2);
    Q.push_back(5);
    Q.push_back(7);

    std::vector<int> output = solution(input, P, Q);
    ASSERT_EQ(4, output.size());

    ASSERT_EQ(1, output[0]);
    ASSERT_EQ(1, output[1]);
    ASSERT_EQ(2, output[2]);
    ASSERT_EQ(4, output[3]);
}

TEST(GeRa, Huge) // 3
{
    std::string input(100000, 'A');

    std::vector<int> P(50000);

    std::vector<int> Q(50000, input.size() - 1);

    std::vector<int> output = solution(input, P, Q);
    ASSERT_EQ(50000, output.size());

    for(unsigned i = 0; i < output.size(); ++i)
        EXPECT_EQ(1, output[i]);
}

0. Codility asks for a slightly different function interface, no parameter is marked there as const.
1. On a very short sequence ("ACGT"), I want to get the lower element on each single element subsequence, (0, 0), (1, 1), (2, 2), (3, 3). The expected result is a four-sized int vector containing 1, 2, 3, 4 (meaning 'A', 'C', 'G', 'T').
2. The Codility given test. From the input sequence "GACACCATA", we should return 1 for (0, 8), 1 for (0, 2), 2 for (4, 5), 4 for (7, 7).
3. This test case is meant to let the developer know how good is the chosen algorithm in term of time complexity. The input sequence is as long as possible, and we have to check the biggest possible number of subsequences that are all covering the entire range. I don't care much of the actual data, so in the input string I have only 'A', consequently as a result a 50,000 sized vector containing only 1's is expected.

Simple but slow solution

A natural solution would be require to check all the subintervals, looping from begin to end, looking for the lowest element.

Its obvious disadvantage is that it would be implemented with a for-in-a-for loop, that in the worst case would lead to an O(N*M) complexity, where N is the sequence size and M is the number of subsequences we need to check.

Here is a first naive implementation:

std::vector<int> solution(const std::string& dna,
        const std::vector<int>& begs, const std::vector<int>& ends)
{
    assert(begs.size() == ends.size()); // 1

    std::vector<int> result(begs.size()); // 2
    for (unsigned i = 0; i < begs.size(); ++i) // 3
    {
        char nucleotide = 'T'; // 4
        for (int j = begs[i]; j <= ends[i]; ++j) // 5
            if (dna[j] < nucleotide) // 6
                nucleotide = dna[j];

        result[i] = nucleotide == 'A' ? 1 : // 7
                    nucleotide == 'C' ? 2 :
                    nucleotide == 'G' ? 3 : 4;
    }
    return result;
}

1. Even if it is not required, I feel too bad not adding at least this minimal check on the input. If you are less paranoid than me, you can happily skip this line.
2. The output is going to be generated in this vector.
3. Check all the intervals.
4. Worst case, the current subsequence has a minimal nucleotide 'T'.
5. Loop on all the elements in the subsequence.
6. If the current element is less than the previous recorded nucleotide, it becomes the new minimal one.
7. I need to convert the selected nucleotide in its representative code.

There are a few improvements that we could operate on this piece of code, for instance, we can break the loop on (5) if we find that the current nucleotide is an 'A', because we have already found the best solution. However, this won't improve our worst case time complexity, it would only help its best and average cases.

We need to look for an altogether different approach.

Using more space to save time

I would like to get the minimal nucleotide in a subinterval just checking its intervals. To achieve this, I can use the partial sum algorithm, and check the difference between the values before entering and at the end of the interval.

A minor issue is that we have to keep track of four different values (A,C,G,T) and not a single variable. It is easy to overcome it, for example creating a vector for each nucleotide.

This generates a small nuisance, I need to convert the character that represent any nucleotide in its index in the vectors of partial sums. This is done with something similar to what I did at the point 7 above, but here I need a zero-based index. To keep the code readable, I created a tiny function to do this mapping:

int getType(char nucleotide)
{
    switch (nucleotide)
    {
    case 'A':
        return 0;
    case 'C':
        return 1;
    case 'G':
        return 2;
    default:
        return 3;
    }
}

What I am going to do is calculating the partial sum for each nucleotide, and then use them on each passed interval to get which is the minimal one present there.

Here is a possibile solution:

std::vector<int> solution(const std::string& dna, const std::vector<int>& begs,
        const std::vector<int>& ends)
{
    assert(begs.size() == ends.size());

    std::vector<std::vector<int> > psums(4); // 1
    for (int i = 0; i < 4; ++i)
        psums[i].resize(dna.size());

    for (unsigned i = 0; i < dna.size(); ++i) // 2
        psums[getType(dna[i])][i] = 1;

    for (int i = 0; i < 4; ++i) // 3
        std::partial_sum(psums[i].begin(), psums[i].end(), psums[i].begin());

    std::vector<int> result(begs.size());
    for (unsigned i = 0; i < begs.size(); ++i) // 4
    {
        int type = 3; // 5
        for (unsigned j = 0; j < 3; ++j) // 6
        {
            int left = begs[i] > 0 ? psums[j][begs[i] - 1] : 0; // 7
            int right = psums[j][ends[i]]; // 8
            if (right != left) // 9
            {
                type = j;
                break;
            }
        }

        result[i] = type + 1; // 10
    }
    return result;
}

1. I create four int vectors, same size of the input DNA sequence. They are initialized with the default int value, that is zero.
2. Scan the DNA sequence. Detect the nucleotide type (0..3) and put in the relative partial sum vector a 1 to mark its presence there.
3. Calculate the partial sum for each nucleotide.
4. Loop on all the subintervals.
5. As in the previous implementation, we already know that the worst case is 'T'. Let's check if there is any "lower" nucleotide. By the way, this also means that we don't really need the fourth psums vector. We could completely get rid of it and save some space.
6. Check of A, C, G nucleotides in this subinterval.
7. This line is a bit tricky. I need to know how many of the current nucleotide (j) have been already seen in the sequence before entering in the subsequence. This information is stored in the element to the immediate left of begs[i]. With one exception, when begs[i] is zero, meaning we are at the beginning of the full sequence. In that case we should use 0 instead, since obviously there were no j-nucleotide at that point.
8. No problem for the right element, just fetch the j-vector of psums, and fetch its ends[i] component.
9. If there is a variation between the final partial sum and the initial one, at least one j-nucleotide is in that interval. So we can set it as the minimal one and stop looping.
10. Last tweak. I stored the nucleotide type as a zero-based index, I need to increase it to match the user expectations.

Go to the full post

Fizz buzz

You won't believe it, but I had no idea what the fizz buzz game was. According to Wikipedia, it's a popular way in (some) English speaking countries to teach children what division is. And checking on the web, I have got the impression that it is also a not so uncommon problem asked during developer interviews. It is so simple, that I'd say it make some sense just for junior positions.

I bumped into it looking at the CodeEval's problems, where it is presented in a slightly more generalized way. We want to list all the numbers from 1 to a given maximum, but replacing the ones that could be divided by two taboo factors, fizz and buzz, with placeholders.

A couple of test cases (written in C++ for GoogleTest), should clarify the requisites.

std::string solution(int fizz, int buzz, int n);

TEST(FiBu, CaseGiven1) // 1
{
  std::string result = solution(3, 5, 10);

  ASSERT_EQ(result, "1 2 F 4 B F 7 8 F B");
}

TEST(FiBu, CaseGiven2) // 2
{
  std::string result = solution(2, 7, 15);

  ASSERT_EQ(result, "1 F 3 F 5 F B F 9 F 11 F 13 FB 15");
}

1. We want to analyze the numbers in the [1..10] interval, any number that has 3 among its factors should be replaced by a F, and 5 by B. Notice that there is no trailing blank after the last generated element.
2. Here the interval is [1..15], the factor 2 should lead to F, and 7 to B. 14 could be divided by both 2 and 7, so it should be replaced by FB.

There is not much to think about before writing the code. The main issue is how we can check if a given number has fizz (or buzz) among its factors. In C, C++ and related languages we normally use the arithmetic operator % (called "modulus" or "modulo") that returns the remainder of the integer division between the two numbers. When it returns zero, we have an integer divisor.

Here is a possible solution:

std::string solution(int fizz, int buzz, int n)
{
  std::ostringstream oss;
  for(int i = 1; i <= n; ++ i)
  {
    bool fb = false; // 1

    if(i % fizz == 0) // 2
    {
      oss << 'F';
      fb = true;
    }
    if(i % buzz == 0) // 3
    {
      oss << 'B';
      fb = true;
    }

    if(!fb) // 4
      oss << i;

    if(i != n) // 5
      oss << ' ';
  }

  return oss.str();
}

1. Flag for fizz or buzz detection.
2. If the remainder of the division of the current number by fizz is zero, it is one of its factors.
3. Same for buzz.
4. If nor fizz nor buzz has been detected, the current number in used.
5. I don't want a trailing blank at the end of the string, so I ensure I am not in the last iteration.

Go to the full post

Counting couples

We have a vector that contains up to one hundred thousand zeros and ones. We want to count the total number of ones that follows each zero.

This problem is better (?) described in the Passing-cars Codility test, part of their train page, prefix sums section.

In my opinion a few test cases would be a better way to clarify it. They are written for C++ with the Google Test framework, however I guess you can easily adapt to your preferred language/xUnit environment:

TEST(PaCa, Given) // 1
{
    std::vector<int> input;
    input.push_back(0);
    input.push_back(1);
    input.push_back(0);
    input.push_back(1);
    input.push_back(1);

    ASSERT_EQ(5, solution(input));
}

TEST(PaCa, Minimal1) // 2
{
    std::vector<int> input(2);
    input[0] = 1;

    ASSERT_EQ(0, solution(input));
}

TEST(PaCa, Huge3) // 3
{
    std::vector<int> input(50000);
    std::vector<int> more(50000, 1);
    input.insert(input.end(), more.begin(), more.end());

    ASSERT_EQ(-1, solution(input));
}

1. This is the test case provided by Codility. Our input is { 0, 1, 0, 1, 1 }. The first zero is followed by three ones, the second zero by another couple. Expected result is five.
2. A first minimal test case I have written (I spare you the other ones). Having in input { 1, 0 }, there is no one following the only zero available. Expected result is zero.
3. A test case that works on the biggest possible input. The first half of the elements are all set to zero, the second half includes only ones. Each zero is followed by 50,000 ones. Having 50,000 zeros, the expected result is 2,500,000,000. Two billion and an half. A clause in the problems says that if we have an output bigger than one billion we should return a sort of error code, minus one.

Inefficient solution

Anyone should find easily this solution. It works fine for small inputs, but it has a O(N**2) time complexity that makes less usable as the input size grows.

int solution(std::vector<int>& input) // 1
{
    if(input.size() < 2) // 2
        return 0;

    unsigned couples = 0; // 3
    for(unsigned i = 0; i < input.size() - 1; ++i) // 4
    {
        if(input[i]) // 5
            continue;

        for(unsigned j = i + 1; j < input.size(); ++j) // 6
        {
            if(input[j]) // 7
            {
                ++couples;
            }
        }
    }

    return couples > 1000000000 ? -1 : couples; // 8
}

1. Usually we would expect the input parameter to be passed as const reference. And here there is no real reason to to pass it as a non-const one, beside the fact that is a problem requirement. We'll see in the third proposed solution how this could be useful to save space complexity.
2. Trivial cases. In real life code, we should have also checked for size too big than expected - maybe throwing an exception.
3. The worst case tested in Huge3 tells us that a plain 32 bit signed int could be not enough to keep the number of couples. And for Codility compiler "int" means "32 bit int". The unsigned version of int, is more than enough for our purposes.
4. Loop on all the input elements but the last one. We are looking for zeros, and we don't care if the last element is one of them, since it can't be obviously followed by anything.
5. If the element is "not zero", it is not what we are looking for, get to the next iteration.
6. Loop on all the following elements looking for ones. Being an asymptotically O(N) loop in another O(N) loop, this is the weak instruction in this algorithm. We'll have to move it someway out of here.
7. If the current element is "not zero", we have found another couple. I could have put here the test on the one billionth couple. If you expect to have many huge inputs leading to a "-1" solution, that could be a better choice. I assumed instead that it is a relatively rare case, and I decided not to pay the price for this repeated check, and move it to (8). Another effect of having the check here, is that I could have given to couples the "plain int" status.
8. If the number of couples is too big, return the error value, otherwise implicitly cast couples to plain int (and I know I can do it) and return it.

Time (and space) linear solution

We have already seen how in previous problems how to trade time for space complexity. The idea here is creating a temporary buffer, same size of the input vector, where we are going to store the sum of all the ones to the right of each zero. Then it will only be a matter of adding the values we are interested in.

Here it comes handy the C++ STL numeric algorithm partial_sum(), that does exactly what we need, once that we pay attention to the fact that we have to look at the reversed input, starting from its rightmost element up to the leftmost one.

Here is the refactored solution:

int solution(std::vector<int>& input)
{
    if(input.size() < 2)
        return 0;

    std::vector<int> buffer(input.size());
    std::partial_sum(input.rbegin(), input.rend(), buffer.rbegin()); // 1

    unsigned couples = 0;
    for(unsigned i = 0; i < input.size() - 1; ++i) // 2
    {
        if(input[i] == 0)
        {
            couples += buffer[i];
        }
    }

    return couples > 1000000000 ? -1 : couples;
}

1. Scan the input vector in reverse order, and put the calculated partial sum likewise in the local buffer.
2. Having extracted the internal loop to place it in (1), the original eternal loop gets much simpler. If the current input element is a zero, I should add the current partial sum (in buffer) to the value I am going to return.

This is a good solution, and it is going to get an 100% by Codility. Still it shouldn't be accepted, if you read carefully their requirements, where it is stated that a O(1) worst-case space complexity is expected.

Dirtier but space constant

Noticing that we are allowed to modify in-place the input (usually not a good idea, since it could cause unpleasant surprises to the caller), we could think a way to combine input and buffer behavior in a single place. Again, this is usually not a good idea. We are violating the single responsibility principle / separation of concerns, and this is going to make our code less readable, and hence less robust to changes. However, let's assume we are in a case where memory is at a premium, and refactor a second time the code, this time to get rid of the local buffer.

We could think to twist the partial sum algorithm so that we store in-place the calculated prefix sum only if the current value is zero, otherwise we wipe out the value in input. The change in the algorithm is so radical that we can't reuse the STL function, but we have to write an our variation.

int solution(std::vector<int>& input)
{
    if(input.size() < 2)
        return 0;

    int sum = 0; // 1
    for(std::vector<int>::reverse_iterator it = input.rbegin(); it != input.rend(); ++it)
    {
        sum += *it; // 2
        *it = *it == 0 ? sum : 0; // 3
    }

    unsigned couples = std::accumulate(input.begin(), input.end(), static_cast(0)); // 4
    return couples > 1000000000 ? -1 : couples;
}

1. Keep track of the current partial sum value.
2. Add the current sequence value to the partial sum.
3. Here is the tricky bit. If the current value is zero, we put the partial sum in. Otherwise we flag it as uninteresting.
4. We need to sum up all the partial sums stored in the input vector. We are not interested in the uninteresting values, but since we have marked them with a zero, we could just sum all the elements in the vector. I'm using the STL function accumulate(), notice that I have explicitly said to it that it as to use as starting value zero as an unsigned value.

The other way round

As Wen-Kai suggests in his comment to this post, we get the same result counting the zeros instead. Think how each 1-element assumes a different weight accordingly to the number of zeros that are before it, since each of them is going to count it for its own sum.

If we perform a slightly customized version of partial sum on zeros, storing the results where the original one-elements were, we get a specular behavior of the previous solution:

int solution(std::vector<int>& input)
{
  int sum = 0; // 1
  for(std::vector<int>::iterator it = input.begin(); it != input.end(); ++it)
  {
      if(*it == 0) // 2
        ++sum;
      else
        *it = sum; // 3
  }

  unsigned couples = std::accumulate(input.begin(), input.end(), static_cast<unsigned>(0)); // 5
  return couples > 1000000000 ? -1 : couples;
}

1. Now "sum" is the number of zeros that we have already scanned.
2. A new zero detected.
3. The weight of the current element grows to keep track of the zeros that are before it.
4. As before, now is just a matter of summing up the partial results.

You can get a sleeker code combining the partial sum loop with the accumulate one, following the Wen-Kai suggestion here below.

Go to the full post

Partial sum

Given a numeric sequence, its partial summation is another sequence where each element represents the sum of all the elements of the original sequence till that point. So, given the sequence { 1, 1, 1, ... }, its partial sum is { 1, 2, 3, 4, ... }. It is also known as prefix sum, prefix reduction, scan, or cumulative sum.

There is a C++ STL function that implements this concept, std::partial_sum(), available in two overloads. One is a proper partial sum implementation, the other is a generic one, that lets the caller to specify which kind of operation applying to transform the original sequence.

I have written a few test cases (for GoogleTest) that should clarify better its usage.

Typical case

I have an input sequence, I want to get its partial sum in a new container:

TEST(TestParSum, CaseStandard)
{
  std::vector<int> input { 1, 3, 1, 4, 2, 3, 5, 4 }; // 1
  std::vector<int> output(input.size()); // 2
  std::partial_sum(input.begin(), input.end(), output.begin()); // 3

  ASSERT_EQ(8, output.size());
  ASSERT_EQ(1, output[0]);
  ASSERT_EQ(4, output[1]);
  ASSERT_EQ(5, output[2]);
  ASSERT_EQ(9, output[3]);
  ASSERT_EQ(11, output[4]);
  ASSERT_EQ(14, output[5]);
  ASSERT_EQ(19, output[6]);
  ASSERT_EQ(23, output[7]);
}

1. The original container. I have used the handy C++11 list initalizer to set it up on construction.
2. The generated result will be stored in this container. It should have the same size of the input sequence.
3. Calculate the partial sum for the passed sequence, putting the result starting from the beginning of the output container.

In place generation

The standard partial_sum() function is designed in such a way that it could overwrite the original data:

TEST(TestParSum, CaseOverwrite)
{
  std::vector<int> data { 1, 3, 1, 4, 2, 3, 5, 4 };
  std::partial_sum(data.begin(), data.end(), data.begin()); // 1

  ASSERT_EQ(8, data.size());
  ASSERT_EQ(1, data[0]);
  ASSERT_EQ(4, data[1]);
  ASSERT_EQ(5, data[2]);
  ASSERT_EQ(9, data[3]);
  ASSERT_EQ(11, data[4]);
  ASSERT_EQ(14, data[5]);
  ASSERT_EQ(19, data[6]);
  ASSERT_EQ(23, data[7]);
}

1. The output iterator is the same that the input starter iterator. We are about to lose the original data, but we are saving some space in memory.

Not only summations

We could need something similar to a partial sum, with the variation that the operator applied is not an addition:

TEST(TestParSum, CaseMultiply)
{
  std::vector<int> data { 1, 3, 1, 4, 2, 3, 5, 4 };
  std::partial_sum(data.begin(), data.end(), data.begin(),
    std::multiplies<int>()); // 1

  ASSERT_EQ(8, data.size());
  ASSERT_EQ(1, data[0]);
  ASSERT_EQ(3, data[1]);
  ASSERT_EQ(3, data[2]);
  ASSERT_EQ(12, data[3]);
  ASSERT_EQ(24, data[4]);
  ASSERT_EQ(72, data[5]);
  ASSERT_EQ(360, data[6]);
  ASSERT_EQ(1440, data[7]);
}

1. Instead of summing, I want to multiply the values. So I pass the STL multiplies functor as last parameter. Nothing else changes from a "normal" partial_sum() call.

Even more generalized

Obviously, we are not restricted to use STL arithmetic functors as a binary operation. We could use our own specifically tailored functor or, if our compiler is C++11 compliant, lambda function.

Here I just rewrite the previous test:

TEST(TestParSum, CaseMultiplyLambda)
{
  std::vector<int> data { 1, 3, 1, 4, 2, 3, 5, 4 };
  std::partial_sum(data.begin(), data.end(), data.begin(),
    [](int a, int b) { return a*b; });

  ASSERT_EQ(8, data.size());
  ASSERT_EQ(1, data[0]);
  ASSERT_EQ(3, data[1]);
  ASSERT_EQ(3, data[2]);
  ASSERT_EQ(12, data[3]);
  ASSERT_EQ(24, data[4]);
  ASSERT_EQ(72, data[5]);
  ASSERT_EQ(360, data[6]);
  ASSERT_EQ(1440, data[7]);
}

Go to the full post

Simple finite state automaton

This problem is about writing a trivial finite state machine that gets in input the number of elements that it should work with, and a vector of integers that represents the operations we want it to execute on them. As output we expect a vector containing its final state.

It comes from the Codility Train problems' collection, the section about Counting Elements, identified by the codename Max-Counters.

Our function has to create a vector of integers initialized to zero and then, accordingly to the the values specified in the input vector, increase by one the specified element, or set all of them to the current maximum value.

A test case (written in C++ and the Google Test framework) should clarify the expected behavior:

std::vector<int> solution(int N, std::vector<int> & input)

TEST(MaCo, Given)
{
    std::vector<int> input;
    input.push_back(3);
    input.push_back(4);
    input.push_back(4);
    input.push_back(6);
    input.push_back(1);
    input.push_back(4);
    input.push_back(4);

    std::vector<int> output = solution(5, input);
    ASSERT_EQ(5, output.size());

    ASSERT_EQ(3, output[0]);
    ASSERT_EQ(2, output[1]);
    ASSERT_EQ(2, output[2]);
    ASSERT_EQ(4, output[3]);
    ASSERT_EQ(2, output[4]);
}

The first parameter we pass to our function is a five. Meaning we want it to work on an int vector of five elements.
The second parameter is the vector containing the (sort of) program we want our function to execute. Each integer represent an instruction:
- 3: increase the third element, that should store now 1.
- 4: increase the fourth element, set to 1.
- 4: increase again the fourth element, now it is 2.
- 6: assign the top value (2) to all the elements.
- 1: increase the first element to 3.
- 4: increase the fourth element to 3.
- 4: increase the fourth element to 4.
The resulting vector should be { 3, 2, 2, 4, 2 }.

As it always happens in this kind of problem, we can avoid any error handling, trusting the user to pass correct input. Besides, we expect to work with reasonable small values (100,000 is the biggest integer we should see in input).

When you get what the problem is asking, you should be already close to get the solution. I could figure out rapidly a simple solution without the need of writing other test cases (even if I wouldn't recommend you to follow my example, and to play more safely instead, spending some time to improve the testing).

Inefficient solution

The codility evaluation for this solution is 77%. It works fine for small data input, it gets far too slow as N and M (the size of the input vector) grows. We should spot immediately that it has an O(N*M) time complexity:

vector<int> solution(int N, vector<int> & input)
{
    std::vector<int> result(N);

    int highest = 0; // 1
    for(int i = 0; i < input.size(); ++i)
    {
        int value = input[i];
        if(value > 0 && value <= N) // 2
        {
            int newValue = ++result[value-1];
            if(highest < newValue)
                highest = newValue;
        }
        else if(value == N + 1) // 3
        {
            for(int j = 0; j < result.size(); ++j) // 4
                result[j] = highest;
        }
    }
    
    return result;
}

1. The current highest value in the vector of results. It is going to be used for the "leveling" operation.
2. If the current value in the input vector is in [1..N-1], I should apply the "increase the specified element value" operation. If in this way I get a new highest value, I keep track of the change.
3. A less paranoid developer would have probably used a plain "else" here. I couldn't help to add a minimal error handling, checking if the value is actually the expected one before applying the leveling. In this way any unexpected input value would lead to a no-op.
4. This is the obvious weak point in the algorithm I have chosen. A loop in a loop that we should avoid.

Linear solution

We have immediately spot the first implementation issue, what we want to do is moving the loop for applying the leveling outside the loop that checks the operations we have to perform on our data.

If you think about it, this is nothing complicated. We just have to use another integer to remember which is the value we have used for the latest leveling operation.

The refactoring is not a complicated task:

std::vector<int> solution(int N, std::vector<int>& input)
{
    std::vector<int> result(N);

    int highest = 0;
    int watermark = 0; // 1

    for(unsigned i = 0; i < input.size(); ++i)
    {
        int index = input[i] - 1; // 2
        if(index >= 0 && index < N)
        {
            result[index] = std::max(watermark, result[index]) + 1; // 3
            highest = std::max(highest, result[index]); // 4
        }
        else if(index == N)
        {
            watermark = highest; // 5
        }
    }

    for(unsigned i = 0; i < result.size(); ++i) // 6
    {
        result[i] = std::max(result[i], watermark);
    }

    return result;
}

1. I think to the leveling operation like a sort of flooding. This variable keeps track of the level reached by the water.
2. It gets complicated to adjust the index as perceived by the user (based one) and as managed internally (based zero) in the following code, so I added an utility variable that should lead to a more pleasant reading.
3. The new value for the current element is determined by it previous value and the watermark. I get the highest value, and I increase it.
4. I could have left the if-check as in the original version, choose which one you feel is more readable.
5. This is the core of the change. I don't loop anymore, I just remember the new level. This leads to the nuisance in (3) but saves us load of time.
6. No more loop in a loop. However, we should pay attention to apply the watermark value only when required. We want to only level up the elements.

Go to the full post

Looking for a sequence in a vector

We want to write a function that checks if a vector of integers contains the natural sequence from 1 to a given value. If so, we should return the index of the highest element in the vector we need to complete the sequence. We are interested only in positive integer numbers, and we won't have to manage anything bigger than 100,000.

This same problem is expressed in a more colorful way in the Frog-River-One exercise in the Codility train page, Counting Elements section. There you could find the function prototype and the description of a test case, that I have ported to GoogleTest, preparing myself to write a C++ solution:

int solution(int x, const std::vector<int>& input);

TEST(TestFro, CaseSample)
{
  ASSERT_EQ(6, solution(5, { 1, 3, 1, 4, 2, 3, 5, 4 }));
}

Notice that I written the test case for a C++11 compiler (GCC 4.8.x) and not for the C++98 used by Codility. In this way I could use the handy list initializer constructor for STL vectors, that makes the code simpler and more readable. For this reason, I have also added a const attribute to the vector reference parameter in the function prototype. All this looks more natural to me, and shouldn't be a big issue for you to adapt it to the original requirements.

The problem looked quite straightforward to me, so I didn't spend anymore time writing more test cases. Beware that this could easily be a fatal mistake.

Inefficient solution

You should spot that this approach is flawed just thinking to it. It is obviously too expensive to be useful for other than trivial cases.

I mimicked what I would naturally do in a real life case. I'd check one by one all the numbers in the sequence, ensuring they are available, keeping track of which one is the rightmost element:

int solution(int x, const std::vector<int>& input)
{
  int result = -1; // 1
  for(int i = 1; i <= x; ++i) // 2
  {
    bool found = false;
    for(unsigned j = 0; j < input.size(); ++j) // 3
    {
      if(input[j] == i) // 4
      {
        if(result < static_cast<int>(j))
          result = j;
        found = true;
        break;
      }
    }
    if(!found) // 5
      return -1;
  }

  return result;
}

1. Initialize the result to the "not found" value.
2. Loop on all the sequence value.
3. Loop on the vector. This loop-in-the-loop makes this piece of code weak. In the worst case the time complexity is in the realm of the Big Oh N Squared family.
4. I am looking for the leftmost "i" elements, when I find it, I check if it is the current rightmost element of the sequence I have currently found, if so, I keep track of it.
5. If a value of the sequence is missing, there is no need of going on checking for the others.

Linear solution

A typical way of reducing the time complexity of an algorithm is buying time with space. And this second solution does just that. Instead of having a loop in a loop, I have one after the other, using a buffer to store the results of the first loop to make them available to the second one:

int solution(int x, const std::vector<int>& input)
{
  std::vector<int> buffer(x, -1); // 1

  for(unsigned i = 0; i < input.size(); ++i) // 2
  {
    unsigned pos = input[i] - 1; // 3
    if(x < (int)pos)
      continue;

    if(buffer[pos] == -1)
      buffer[pos] = i;
  }

  int time = -1;
  for(unsigned i = 0; i < buffer.size(); ++i) // 4
  {
    if(buffer[i] == -1)
      return -1;
    if(buffer[i] > time)
      time = buffer[i];
  }

  return time;
}

1. I'm looking for the natural sequence ranging from 1 to x, so I need a vector sized x. Each element is initialized to -1, meaning that the associated value has not been found yet.
2. First loop, I'm checking all the values in input.
3. I convert the currently checked value in input to an index for the buffer. If the value is out of range, we skip it. Otherwise, if the current value has not already been found, we keep track of its position.
4. Second loop, I'm checking all the value in the buffer. If I found a value not set, I return error. Otherwise I keep track of the highest value, and the end I return it.

[After one year, I came back to this problem. As often happens, I devised another solution. Maybe it would look more intuitive to you.]

Go to the full post

Is it a permutation?

We want to write a function that gets in input a vector and check if it contains a permutation of the sequence of natural integers starting with 1.

This is an input that we should accept: { 4, 1, 3, 2 }
And this one should be rejected: { 4, 1, 3 }

You could find this problem in the Codility Train page, section Counting Elements, codename Perm-Check. You can submit you solution in one of a few different supported programming languages, to check it against their acceptance test.

My solution is written in C++98 (alas C++11 is not supported) with test cases for GoogleTest.

Test cases

A couple of test cases are given in the problem presentation, I just added a couple of trivial ones more:

int solution(std::vector<int>& input); // 1

TEST(PeCe, GivenGood) // 2
{
    std::vector<int> input;
    input.push_back(4);
    input.push_back(1);
    input.push_back(3);
    input.push_back(2);

    EXPECT_EQ(1, solution(input));
}

TEST(PeCe, GivenBad) // 3
{
    std::vector<int> input;
    input.push_back(4);
    input.push_back(1);
    input.push_back(3);

    EXPECT_EQ(0, solution(input));
}

TEST(PeCe, OneGood) // 4
{
    std::vector<int> input(1, 1);

    EXPECT_EQ(1, solution(input));
}

TEST(PeCe, OneBad) // 5
{
    std::vector<int> input(1, 42);

    EXPECT_EQ(0, solution(input));
}

TEST(PeCe, OneBigBad) // 6
{
    std::vector<int> input(1, 1000000000);

    EXPECT_EQ(0, solution(input));
}

1. The function prototype we have to implement. For some reason, instead of returning a boolean, it returns an integer that would act as a K&R C boolean emulation, 0 means false, 1 true.
2. First given test, it should detect the permutation.
3. Second given test, no permutation in input.
4. Simplest possible good case.
5. Simplest possible bad case.
6. A curious case. We should consider the case we have huge integers in input. Up to one billion, actually. This is a bit strange, since the max expected size for the vector is just a modest 100 thousand, however we should expect some trickery in the Codility acceptance test based on this point.

A too expensive solution

What we can think is repeatedly scan the input up looking for all the sequence values. If we don't find an expected one, we return failure, otherwise the input is accepted. We should smell immediately something bad. Looping on all the N elements of a container, checking for N values, leads to a Big Oh N Squared worst case time complexity, that we should avoid.

In any case, here it is:

int solution(std::vector<int>& input)
{
    for(unsigned i = 1; i <= input.size(); ++i) // 1
    {
        bool found = false;
        for(unsigned j = 0; j < input.size(); ++j) // 2
        {
            if(static_cast<unsigned>(input[j]) == i) // 3
            {
                found = true;
                break;
            }
        }
        if(!found)
            return 0;
    }

    return 1;
}

1. Our scrambled sequence should contains all the integers from one up to the number of elements in input.
2. Let's look for the current value.
3. The vector in input should have been parametrized for unsigned values, Codility decided otherwise, so here I say explicitly to the compiler not to worry about comparison with what it perceives as objects of different types. Trust me, they are actually both unsigned ints.

The code works fine for small input, but it rapidly gets too slow to be acceptable when the input size grows.

Linear solution

Let's divide the job in two consecutive steps, and use a buffer to store the temporary result. In this way, instead of having one loop nested in another one, we'll have one loop after the other, reducing the time complexity to O(N). We pay this improvement increasing the algorithm space complexity, but we happy to pay this price:

int solution(std::vector<int>& input)
{
    std::vector<bool> buffer(input.size()); // 1

    for(unsigned i = 0; i < input.size(); ++i) // 2
    {
        unsigned value = input[i]-1;
        if(value < input.size()) // 3
            buffer[value] = true;
        else
            return 0;
    }

    return std::count(buffer.begin(), buffer.end(), false) == 0 ? 1 : 0; // 4
}

1. Flags for the expected values in the input vector. The elements of a container are initialized with the default value for the underlying type. So in this moment here we have a bunch of false elements.
2. Loop on all the input values.
3. The OneBigBad test case we have seen above was written to help me to remember to write this check. If input contains (at least) a value that is bigger than expected, we know that we have to reject it. Only if the value is in the correct range we set its flag to true. Moreover, notice that in the line above I have also silently converted the value from signed to unsigned, if a rouge input included a negative value, it has been converted there to a huge positive numbers, and here the anomaly is caught. Better would have been to impose that the input vector was of unsigned elements.
4. I have hidden the second loop that uses the buffer partial result in this call to the STL count() function. We expect all no flag set to false anymore. If this is what happens, we can return success.

Sleeker and (maybe) faster

Actually, as suggested by Karsten, see his comments below, there is no actual need of the final buffer checking, if we have already carefully checked each value in input. The advantage of moving the check here is that we can fast fail as soon as we detect a duplicate element in input:

int solution(std::vector<int>& input)
{
    std::vector<bool> buffer(input.size());

    for(unsigned i = 0; i < input.size(); ++i)
    {
        unsigned value = input[i]-1;
        if(value < input.size() && buffer[value] == false) // 1
            buffer[value] = true;
        else
            return 0;
    }

    return 1; // 2
}

1. If we see that the flag has been already set, it means that we have spotted a duplicate, so we can reject the current input.
2. If we get here, each element in input has been accepted, we just have to return success.

Go to the full post

Equilibrium in a vector

We have an array of integer in input, whose size we know for sure being in the range [2..100,000] and each element is in [−1,000..1,000]. We'd like to split it in two parts, so that the sum of the elements on both sides is as close as possible. For some weird reason, we are not asked to return the index for which such condition is fulfilled, but the minimal difference, as an absolute value, between left and right sum.

You could find this problem in the Codility Train page, in the Time Complexity section, under the nickname Tape-Equilibrium. You could submit there your solution for evaluation in one of the many languages supported. C++11 is not one of them, so I have written my code for C++98.

[Good news, now Codility supports C++11. I have refreshed the code accordingly. Please follow the link for details.]

Notice that the number of elements in the vector and the value for each element is such that we can happily work with plain integers, being the maximum sum we could get something about a mere hundred millions. The real issue in this problem is time complexity, we should strive for a linear solution, if we want to get a 100% score.

Firstly, some test cases (written for GoogleTest):

int equilibrium(std::vector<int>& input); // 1

TEST(TapEq, Given) // 2
{
    std::vector<int> input;
    input.push_back(3);
    input.push_back(1);
    input.push_back(2);
    input.push_back(4);
    input.push_back(3);

    EXPECT_EQ(1, equilibrium(input));
}

TEST(TapEq, TwoBig) // 3
{
    std::vector<int> input;
    input.push_back(-1000);
    input.push_back(1000);

    EXPECT_EQ(2000, equilibrium(input));
}

TEST(TapEq, HundredK) // 4
{
    std::vector<int> input(100000, 1000);

    EXPECT_EQ(0, equilibrium(input));
}

TEST(TapEq, AlmostHundredK) // 5
{
    std::vector<int> input(99999, 1000);

    EXPECT_EQ(1000, equilibrium(input));
}

1. I don't feel right to pass around a non-const reference to a STL container if there is not any compelling reason to do that. Here the reason is just that Codility wants us to do that.
2. This is the test given with the problem description. The equilibrium point is such that the vector is split between (3, 1, 2) and (4, 3) leading to a difference of 1, that is going to be returned to the caller.
3. Minimal case, just two elements.
4. Biggest vector I could get in input, each element has value 100,000, so the result is zero.
5. Almost like the case (4), but here we have an odd number of elements, all of them have the same value, so it is not possible to get a perfect equilibrium.

Bad solution O(N**2)

We could think of looping on all the elements in the vector, from the first to the last but one. That would be our pivot, dividing the vector in two parts. Then we'll all the left and right elements, and compare the results:

int equilibrium(std::vector<int>& input)
{
    int result = std::numeric_limits<int>::max(); // 1
    for(unsigned i = 0; i < input.size() - 1; ++i)
    {
        int left = 0; // 2
        for(unsigned j = 0; j <= i; ++j)
            left += input[j];

        int right = 0; // 3
        for(unsigned j = i + 1; j < input.size(); ++j)
            right += input[j];

        int difference = std::abs(left - right); // 4
        if(difference < result)
            result = difference;
    }

    return result;
}

1. In the beginning we have no result, let's remark this initializing the variable to the highest available value.
2. Sum up all the elements on the left of the current pivot (included).
3. Sum up all the elements on the right of the current pivot (excluded).
4. Get the difference between left and right, and keep it, if it is the current minimum value.

This algorithm is straightforward, but it has a major issue. Its time complexity is in the order of N squared, as we can see immediately, given the twin for-loops in a for-loop.

And, all this summing up is not required. Or better, we repeats many time the same addictions we have already done. A better solution would permit us to minimize them to bare necessity.

Linear solution

Let's start splitting the vector in two parts. One contains just the leftmost element, the other one all the other elements. Then it would be just a matter of adding the current border element to the left and simultaneously subtracting it to the right. We still have two O(N) for-loops, but they are one after the other, so the resulting time complexity falls to linear:

int equilibrium(std::vector<int>& input)
{
    std::vector<int>::iterator pivot = input.begin();
    int left = *pivot;
    int right = std::accumulate(++pivot, input.end(), 0); // 1
    int result = std::abs(left - right);

    for(; pivot < input.end() - 1; ++pivot) // 2
    {
        left += *pivot;
        right -= *pivot;
        int diff = std::abs(left - right);
        if(diff < result)
            result = diff;
    }

    return result;
}

1. The first for-loop is hidden in this call to STL accumulate() function.
2. Second for-loop. Notice that I loop till the last but one element, since I want the right side of the vector to contain at least an element.

Go to the full post

The missing element

An easy problem, that you can find also in the Codility Time Complexity train section. I have written my solution in C++, you could test and submit your one in your preferred language (when available). Its Codility namecode is Perm-Missing-Elem.

In a few words: we are given in input a vector size N containing all the integers (with no duplicates) but one in [1..(N+1)] range. Our job is returning the missing one. We know that N is 100.000 maximum, we want a linear time and a constant space complexity.

First step, I have written the function prototype and I have tried to figure out a few test cases to help me in its design and development (I use Google Test, but any xUnit framework would do):

int missing(const std::vector<int>& input);

TEST(TestPME, CaseSample) // 1
{
  ASSERT_EQ(4, missing({ 2, 3, 1, 5 }));
}

TEST(TestPME, CaseEmpty) // 2
{
  ASSERT_EQ(1, missing({ }));
}

TEST(TestPME, CaseThree) // 3
{
  ASSERT_EQ(3, missing({ 2, 4, 1 }));
}

1. This is the test case provided by Codility. The vector is sized four, the biggest value contained could be five, it easy to see how the missing element is four. A technicality, I have used the C++11 handy notation to create a vector on the fly by an initialization list.
2. It is better to keep an eye on special cases. Here I check what happens when an empty vector is passed. N is zero, so only 1 could be the missing value.
3. There's not much to speculate on this function behavior. But for reason that would become evident in a little while, it is better to test it for both even and odd input sizes.

Naive solution

We could think of repeatedly scanning the vector looking for the missing value. But we should immediately see that this is not a viable solution. We need to loop for each possible value on all the vector elements, that means we are in the domain of a O(N square) time complexity.

Buying time with space

We can radically improve the time complexity using some more memory. Instead of performing a loop in a loop, we perform one after the other, reducing the time complexity to a mere O(N). Sure we have to pay for it, since we have to store the intermediate result somewhere, increasing the space complexity.

The idea is to scan the input vector, using the value read as index in another vector, to keep track that we have found it.
Then we scan the buffer, as soon as we find an element not set, we know that the original element was not present.

Here is my implementation:

int missing(const std::vector<int>& input)
{
  std::vector<bool> check(input.size()+1); // 1

  for(unsigned i = 0; i < input.size(); ++i) // 2
    check[input[i] - 1] = true;

  for(unsigned i = 0; i < check.size(); ++i) // 3
    if(check[i] == false)
      return i + 1;

  return 0;
}

1. I use as buffer a vector of boolean. Remember that we need one element more of the original vector. This STL vector constructor ensures that all the elements contained are set to its default value, that for boolean is false.
2. Loop on the input vector. Quite frighteningly, I haven't done any error handling, trusting the user to provide good stuff in. This is usually a very bad idea. In any case, I read the current value, I decrease it by one, getting a C-style 0-based index, and set the associated element in the check buffer to true, meaning "yeah, we have got this element".
3. Loop on the check buffer. The first element I find set to false is the one I am interested in (actually, it should be the first and only one). Convert back from zero-based index to actual value, increasing by one, and return it.

This solution is good enough to pass the Codility check, but doesn't look satisfactory to me.

Cheaper and more elegant

We are not so much interested in all the values we have in the input container. The unique value that is missing is what we are really interested in. We can spot it reasoning on the input nature. What we get is basically a scrambled sequence of the first N + 1 natural numbers from which we removed a single value. It is very easy to calculate an arithmetic series, if we don't consider the removal. At least since Gauss explained the trick. His classic formula for arithmetic series is usually written like this:

x = 1/2 * top * (top + 1)

What we do is adding the first and last element, top + 1, and multiply it for half the number of the elements. Think about it. To get the sum of all the integer in [1 .. 10], you can add (1 + 10), (2 + 9), ..., (5 + 6). That is 5 * 11.

Once we get the expected result for the full sequence, and remove from it the result we get summing the actual elements, what we get is the missing value:

int missing(const std::vector<int>& input)
{
  int64_t actual = std::accumulate(input.begin(), input.end(), int64_t(0)); // 1

  int64_t top = input.size() + 1;
  int64_t expected = top * (top + 1) / 2; // 2

  return expected - actual; // 3
}

1. The STL accumulate function sums up all the elements in the passed interval, using the third parameter as initial value. I have used as type an explicit 64 bit integer to get rid of problem related to different platforms on which this code could run. Pay attention to the fact that the size of the container and its biggest element could be 100K, so we could easily reach values in the range of billions.
2. I have rewritten the Gauss formula in this slightly different form to avoid the nuisance of multiplying for a floating number (1/2 = 0.5) when I know for sure that I am actually working with integer numbers only.
3. Implicit conversion from 64 bit integer to a plain (whatever means in your environment) integer. Since I am sure that the result is smaller than 100K, I am playing safely.

Go to the full post

Subscribe to: Posts (Atom)