Pages

Minimal element in subsequences

We have a non-empty string, that could contain up to 100,000 characters (only the 'A','C','G','T' ones). The user would specify a few ranges, up to 50 thousand of them, in the string, and he expects as output the indices relative to the lowest letter in that ranges.

The problem itself is not complicated, it just asks us to check carefully the requirements. You can found it in the Codility train page, prefix sums section, under the name Genomic-range-query.

We can expect only ACGT letters in input, because the idea is that it represents a DNA sequence. Pay attention to the fact that the user specifies the subsequence ranges as zero-based indices, and he expects the output as one-based indices. If 'A' is the answer, we have to output 1, and so on.

As tradition in these kind of problems, there is no stress at all in error handling. If this function should be used in a real world environment, we should check thoroughly the input. Here we can just assume anything is fine.

There is a test case in the problem description, I added a few more of them to help me understanding better how to design the code. Here are the original one and a couple of new ones, written in C++ for the xUnit GoogleTest framework:
std::vector<int> solution(const std::string& dna, // 0
        const std::vector<int>& begs, const std::vector<int>& ends);

TEST(GeRa, Simple) // 1
{
    std::string input("ACGT");

    std::vector<int> P;
    P.push_back(0);
    P.push_back(1);
    P.push_back(2);
    P.push_back(3);

    std::vector<int> Q;
    Q.push_back(0);
    Q.push_back(1);
    Q.push_back(2);
    Q.push_back(3);

    std::vector<int> output = solution(input, P, Q);
    ASSERT_EQ(4, output.size());

    EXPECT_EQ(1, output[0]);
    EXPECT_EQ(2, output[1]);
    EXPECT_EQ(3, output[2]);
    EXPECT_EQ(4, output[3]);
}

TEST(GeRa, Given) // 2
{
    std::string input("GACACCATA");

    std::vector<int> P;
    P.push_back(0);
    P.push_back(0);
    P.push_back(4);
    P.push_back(7);

    std::vector<int> Q;
    Q.push_back(8);
    Q.push_back(2);
    Q.push_back(5);
    Q.push_back(7);

    std::vector<int> output = solution(input, P, Q);
    ASSERT_EQ(4, output.size());

    ASSERT_EQ(1, output[0]);
    ASSERT_EQ(1, output[1]);
    ASSERT_EQ(2, output[2]);
    ASSERT_EQ(4, output[3]);
}

TEST(GeRa, Huge) // 3
{
    std::string input(100000, 'A');

    std::vector<int> P(50000);

    std::vector<int> Q(50000, input.size() - 1);

    std::vector<int> output = solution(input, P, Q);
    ASSERT_EQ(50000, output.size());

    for(unsigned i = 0; i < output.size(); ++i)
        EXPECT_EQ(1, output[i]);
}
0. Codility asks for a slightly different function interface, no parameter is marked there as const.
1. On a very short sequence ("ACGT"), I want to get the lower element on each single element subsequence, (0, 0), (1, 1), (2, 2), (3, 3). The expected result is a four-sized int vector containing 1, 2, 3, 4 (meaning 'A', 'C', 'G', 'T').
2. The Codility given test. From the input sequence "GACACCATA", we should return 1 for (0, 8), 1 for (0, 2), 2 for (4, 5), 4 for (7, 7).
3. This test case is meant to let the developer know how good is the chosen algorithm in term of time complexity. The input sequence is as long as possible, and we have to check the biggest possible number of subsequences that are all covering the entire range. I don't care much of the actual data, so in the input string I have only 'A', consequently as a result a 50,000 sized vector containing only 1's is expected.

Simple but slow solution

A natural solution would be require to check all the subintervals, looping from begin to end, looking for the lowest element.

Its obvious disadvantage is that it would be implemented with a for-in-a-for loop, that in the worst case would lead to an O(N*M) complexity, where N is the sequence size and M is the number of subsequences we need to check.

Here is a first naive implementation:
std::vector<int> solution(const std::string& dna,
        const std::vector<int>& begs, const std::vector<int>& ends)
{
    assert(begs.size() == ends.size()); // 1

    std::vector<int> result(begs.size()); // 2
    for (unsigned i = 0; i < begs.size(); ++i) // 3
    {
        char nucleotide = 'T'; // 4
        for (int j = begs[i]; j <= ends[i]; ++j) // 5
            if (dna[j] < nucleotide) // 6
                nucleotide = dna[j];

        result[i] = nucleotide == 'A' ? 1 : // 7
                    nucleotide == 'C' ? 2 :
                    nucleotide == 'G' ? 3 : 4;
    }
    return result;
}
1. Even if it is not required, I feel too bad not adding at least this minimal check on the input. If you are less paranoid than me, you can happily skip this line.
2. The output is going to be generated in this vector.
3. Check all the intervals.
4. Worst case, the current subsequence has a minimal nucleotide 'T'.
5. Loop on all the elements in the subsequence.
6. If the current element is less than the previous recorded nucleotide, it becomes the new minimal one.
7. I need to convert the selected nucleotide in its representative code.

There are a few improvements that we could operate on this piece of code, for instance, we can break the loop on (5) if we find that the current nucleotide is an 'A', because we have already found the best solution. However, this won't improve our worst case time complexity, it would only help its best and average cases.

We need to look for an altogether different approach.

Using more space to save time

I would like to get the minimal nucleotide in a subinterval just checking its intervals. To achieve this, I can use the partial sum algorithm, and check the difference between the values before entering and at the end of the interval.

A minor issue is that we have to keep track of four different values (A,C,G,T) and not a single variable. It is easy to overcome it, for example creating a vector for each nucleotide.

This generates a small nuisance, I need to convert the character that represent any nucleotide in its index in the vectors of partial sums. This is done with something similar to what I did at the point 7 above, but here I need a zero-based index. To keep the code readable, I created a tiny function to do this mapping:
int getType(char nucleotide)
{
    switch (nucleotide)
    {
    case 'A':
        return 0;
    case 'C':
        return 1;
    case 'G':
        return 2;
    default:
        return 3;
    }
}
What I am going to do is calculating the partial sum for each nucleotide, and then use them on each passed interval to get which is the minimal one present there.

Here is a possibile solution:
std::vector<int> solution(const std::string& dna, const std::vector<int>& begs,
        const std::vector<int>& ends)
{
    assert(begs.size() == ends.size());

    std::vector<std::vector<int> > psums(4); // 1
    for (int i = 0; i < 4; ++i)
        psums[i].resize(dna.size());

    for (unsigned i = 0; i < dna.size(); ++i) // 2
        psums[getType(dna[i])][i] = 1;

    for (int i = 0; i < 4; ++i) // 3
        std::partial_sum(psums[i].begin(), psums[i].end(), psums[i].begin());

    std::vector<int> result(begs.size());
    for (unsigned i = 0; i < begs.size(); ++i) // 4
    {
        int type = 3; // 5
        for (unsigned j = 0; j < 3; ++j) // 6
        {
            int left = begs[i] > 0 ? psums[j][begs[i] - 1] : 0; // 7
            int right = psums[j][ends[i]]; // 8
            if (right != left) // 9
            {
                type = j;
                break;
            }
        }

        result[i] = type + 1; // 10
    }
    return result;
}
1. I create four int vectors, same size of the input DNA sequence. They are initialized with the default int value, that is zero.
2. Scan the DNA sequence. Detect the nucleotide type (0..3) and put in the relative partial sum vector a 1 to mark its presence there.
3. Calculate the partial sum for each nucleotide.
4. Loop on all the subintervals.
5. As in the previous implementation, we already know that the worst case is 'T'. Let's check if there is any "lower" nucleotide. By the way, this also means that we don't really need the fourth psums vector. We could completely get rid of it and save some space.
6. Check of A, C, G nucleotides in this subinterval.
7. This line is a bit tricky. I need to know how many of the current nucleotide (j) have been already seen in the sequence before entering in the subsequence. This information is stored in the element to the immediate left of begs[i]. With one exception, when begs[i] is zero, meaning we are at the beginning of the full sequence. In that case we should use 0 instead, since obviously there were no j-nucleotide at that point.
8. No problem for the right element, just fetch the j-vector of psums, and fetch its ends[i] component.
9. If there is a variation between the final partial sum and the initial one, at least one j-nucleotide is in that interval. So we can set it as the minimal one and stop looping.
10. Last tweak. I stored the nucleotide type as a zero-based index, I need to increase it to match the user expectations.

No comments:

Post a Comment