Pages

Most/least common letters

This is a simple interview question that is often used to break the ice. We have a string, find the most used letters in it. As a variation, we could be interested in which letters appears only once in it. It is not difficult to work on the most general case, but to keep it simple we can assume only lowercase letters are used and that the input string is not an extremely long one.

Here is a linear C++11 solution for both questions. The idea is keeping track of the letters' frequency in an array where each element represents a letter and then scanning it to get the required information. Firstly, I defined a class interface:
class Tracker
{
private:
  std::array<int, 26> frequency_ {};  // 1
public:
  Tracker(const std::string& input, int watermark = std::numeric_limits<int>::max()); // 2

  std::string mostUsed(); // 3
  std::string uniques();
};
1. Element zero keeps track of a's, one of b's up to the twenty-five for 'z's. Notice the default C++11 in-class member initializer, an elegant way to set to zero all the array elements at object constructing time.
2. The constructor expects in input the string we want to analyze, and an optional int that defines when we have to stop counting. By default it is set to the maximum int value, just to avoid an overflow. In this way we could pass in input even huge strings, if we are ready to accept a possible imprecise answer.
3. For getting the most used letters in the input string.
4. Gives all the letters used only once.

A few test cases (written for Google Test, it should be easy to port them to any xUnit framework) that should clarify how the code should behave:
TEST(TestTracker, CaseEmpty) // 1
{
  Tracker tracker("");
  EXPECT_TRUE(tracker.uniques().empty());
  EXPECT_TRUE(tracker.mostUsed().empty());
}

TEST(TestTracker, CaseNormal) // 2
{
  Tracker tracker("something may be in here");
  EXPECT_EQ("abgorsty", tracker.uniques());
  EXPECT_EQ("e", tracker.mostUsed());
}

TEST(TestTracker, CaseWobly4) // 3
{
  Tracker tracker("woblywoblywoblywobly");
  EXPECT_TRUE(tracker.uniques().empty());
  EXPECT_EQ("blowy", tracker.mostUsed());
}

TEST(TestTracker, CaseWatermark2) // 4
{
  Tracker tracker("and something else may be here", 2);
  EXPECT_EQ("bdgilorty", tracker.uniques());
  EXPECT_EQ("aehmns", tracker.mostUsed());
}
1. Empty string in input, neither unique nor most used letters are expected.
2. A random string (but only with lowercase letters, and remember that any other characters - like blanks - are discarded) with a few unique letters and 'e' as the most used one. Notice that the Tracker methods return the requested letters in alphabetical order.
3. A word compose by unique letters repeated four times. No unique letter, all the letter in input are in the most used output.
4. Setting a watermark sort of truncate the check on the most used letters. All the ones having a frequency equal or greater than the watermark are returned.

Here is how I have defined the class methods:
Tracker::Tracker(const std::string& input, int watermark)
{
  for(unsigned i = 0; i < input.size(); ++i)
  {
    char c = input[i];
    if(std::islower(c) && frequency_[c - 'a'] < watermark) // 1
      ++frequency_[c - 'a'];
  }
}

std::string Tracker::uniques() // 2
{
  std::string result;

  for(unsigned i = 0; i < frequency_.size(); ++i)
  {
    if(frequency_[i] == 1)
      result += 'a' + i;
  }

  return result;
}

std::string Tracker::mostUsed() // 3
{
  std::string result;

  int top = 0;

  for(unsigned i = 0; i < frequency_.size(); ++i)
  {
    if(frequency_[i] == 0 || frequency_[i] < top) // 4
      continue;

    if(frequency_[i] > top) // 5
    {
      top = frequency_[i];
      result.clear();
    }

    result += 'a' + i; // 6
  }

  return result;
}
1. The constructor loops on all the characters in the input string. If the current one is a lowercase letter, and if the watermark has not been reached yet, its counting is increased. As said above, the 'a' frequency is stored in the first element of the array, etc.
2. Getting the unique letters is quite simple. It is just a matter of looping on the frequency array, when a 1 is found, we add the associated letter to the list (actually, a string) of the accepted cases. If no unique letter is found, an empty string is returned.
3. A bit more complicated the algorithm for the most used ones. A simpler approach would be performing two for loops. First one to identify the top frequency, second one to push all the letter with such a value back to the result string. Here I have decided to use a more involute algorithm just to have some fun. I do not expect any substantial difference in performance or complexity.
4. If the current letter was not in the string (its frequency is zero) or it is less popular than an already checked one, drop it, and pass to consider the next one.
5. If the current letter has an higher frequency than the previous ones, I keep track of its frequency (in top) and I reset the list of most common letters (the string result).
6. Push the current letter back to the result, to keep the alphabetic order. Notice that this instruction is executed both when we have a new top frequency (in this case the current letter would be the first listed one) and when the frequency is high as previously checked top letter (and in this case it would be the current last one).

No comments:

Post a Comment