The missing element

An easy problem, that you can find also in the Codility Time Complexity train section. I have written my solution in C++, you could test and submit your one in your preferred language (when available). Its Codility namecode is Perm-Missing-Elem.

In a few words: we are given in input a vector size N containing all the integers (with no duplicates) but one in [1..(N+1)] range. Our job is returning the missing one. We know that N is 100.000 maximum, we want a linear time and a constant space complexity.

First step, I have written the function prototype and I have tried to figure out a few test cases to help me in its design and development (I use Google Test, but any xUnit framework would do):
int missing(const std::vector<int>& input);

TEST(TestPME, CaseSample) // 1
  ASSERT_EQ(4, missing({ 2, 3, 1, 5 }));

TEST(TestPME, CaseEmpty) // 2
  ASSERT_EQ(1, missing({ }));

TEST(TestPME, CaseThree) // 3
  ASSERT_EQ(3, missing({ 2, 4, 1 }));
1. This is the test case provided by Codility. The vector is sized four, the biggest value contained could be five, it easy to see how the missing element is four. A technicality, I have used the C++11 handy notation to create a vector on the fly by an initialization list.
2. It is better to keep an eye on special cases. Here I check what happens when an empty vector is passed. N is zero, so only 1 could be the missing value.
3. There's not much to speculate on this function behavior. But for reason that would become evident in a little while, it is better to test it for both even and odd input sizes.

Naive solution

We could think of repeatedly scanning the vector looking for the missing value. But we should immediately see that this is not a viable solution. We need to loop for each possible value on all the vector elements, that means we are in the domain of a O(N square) time complexity.

Buying time with space

We can radically improve the time complexity using some more memory. Instead of performing a loop in a loop, we perform one after the other, reducing the time complexity to a mere O(N). Sure we have to pay for it, since we have to store the intermediate result somewhere, increasing the space complexity.

The idea is to scan the input vector, using the value read as index in another vector, to keep track that we have found it.
Then we scan the buffer, as soon as we find an element not set, we know that the original element was not present.

Here is my implementation:
int missing(const std::vector<int>& input)
  std::vector<bool> check(input.size()+1); // 1

  for(unsigned i = 0; i < input.size(); ++i) // 2
    check[input[i] - 1] = true;

  for(unsigned i = 0; i < check.size(); ++i) // 3
    if(check[i] == false)
      return i + 1;

  return 0;
1. I use as buffer a vector of boolean. Remember that we need one element more of the original vector. This STL vector constructor ensures that all the elements contained are set to its default value, that for boolean is false.
2. Loop on the input vector. Quite frighteningly, I haven't done any error handling, trusting the user to provide good stuff in. This is usually a very bad idea. In any case, I read the current value, I decrease it by one, getting a C-style 0-based index, and set the associated element in the check buffer to true, meaning "yeah, we have got this element".
3. Loop on the check buffer. The first element I find set to false is the one I am interested in (actually, it should be the first and only one). Convert back from zero-based index to actual value, increasing by one, and return it.

This solution is good enough to pass the Codility check, but doesn't look satisfactory to me.

Cheaper and more elegant

We are not so much interested in all the values we have in the input container. The unique value that is missing is what we are really interested in. We can spot it reasoning on the input nature. What we get is basically a scrambled sequence of the first N + 1 natural numbers from which we removed a single value. It is very easy to calculate an arithmetic series, if we don't consider the removal. At least since Gauss explained the trick. His classic formula for arithmetic series is usually written like this:
x = 1/2 * top * (top + 1)
What we do is adding the first and last element, top + 1, and multiply it for half the number of the elements. Think about it. To get the sum of all the integer in [1 .. 10], you can add (1 + 10), (2 + 9), ..., (5 + 6). That is 5 * 11.

Once we get the expected result for the full sequence, and remove from it the result we get summing the actual elements, what we get is the missing value:
int missing(const std::vector<int>& input)
  int64_t actual = std::accumulate(input.begin(), input.end(), int64_t(0)); // 1

  int64_t top = input.size() + 1;
  int64_t expected = top * (top + 1) / 2; // 2

  return expected - actual; // 3
1. The STL accumulate function sums up all the elements in the passed interval, using the third parameter as initial value. I have used as type an explicit 64 bit integer to get rid of problem related to different platforms on which this code could run. Pay attention to the fact that the size of the container and its biggest element could be 100K, so we could easily reach values in the range of billions.
2. I have rewritten the Gauss formula in this slightly different form to avoid the nuisance of multiplying for a floating number (1/2 = 0.5) when I know for sure that I am actually working with integer numbers only.
3. Implicit conversion from 64 bit integer to a plain (whatever means in your environment) integer. Since I am sure that the result is smaller than 100K, I am playing safely.

No comments:

Post a Comment