Pages

Codility GenomicRangeQuery

Given a string representing a DNA sequence as a collection of 'A', 'C', 'G', 'T' letters, and number of subsequences in it, we should tell which is the minimal impact factor for each subsequence, where nucleotides A, C, G and T have impact factors of 1, 2, 3 and 4, respectively.

You could solve this problem on your own in the Codility Prefix Sum section. I have already given a couple of C++ solutions on a previous post, here I would just refactor the linear one to C++11.

Here is the Codility given test case, ported to GTest:
TEST(GenomicRangeQuery, Given)
{
    std::string dna { "CAGCCTA" };
    std::vector<int> first { 2, 5, 0 };
    std::vector<int> last { 4, 5, 6 };
    ASSERT_EQ(first.size(), last.size());

    std::vector<int> result = solution(dna, first, last);

    ASSERT_EQ(first.size(), result.size());
    EXPECT_EQ(2, result[0]); // 1
    EXPECT_EQ(4, result[1]);
    EXPECT_EQ(1, result[2]);
}
1. Subsequence 2, 4 is "GC", C has a lower impact factor, the result should be 2.
2. "T" leads to 4.
3. The full sequence has A's in it, so 1 is its minimal impact factor.

This is my C++11 solution:
int getType(char nucleotide) // 1
{
    switch (nucleotide)
    {
    case 'A':
        return 0;
    case 'C':
        return 1;
    case 'G':
        return 2;
    default:
        return 3;
    }
}

std::vector<int> solution(std::string& sequence, std::vector<int>& first, std::vector<int>& last)
{
    std::vector<std::vector<int>> parsums(4); // 2
    std::for_each(parsums.begin(), parsums.end(), [&sequence](std::vector<int>& cur)
    {
        cur.resize(sequence.size());
    });

    for (unsigned i = 0; i < sequence.size(); ++i) // 3
    {
        parsums[getType(sequence[i])][i] = 1;
    }

    for (int i = 0; i < 4; ++i) // 4
    {
        std::partial_sum(parsums[i].begin(), parsums[i].end(), parsums[i].begin());
    }

    std::vector<int> result(first.size()); // 5
    for (unsigned i = 0; i < first.size(); ++i)
    {
        int type = 3;
        for (unsigned j = 0; j < 3; ++j)
        {
            int left = first[i] > 0 ? parsums[j][first[i] - 1] : 0; // 6
            int right = parsums[j][last[i]];
            if (right != left) // 7
            {
                type = j;
                break;
            }
        }

        result[i] = type + 1; // 8
    }

    return result;
}
1. Trivial utility function to convert any nucleotide in a zero-based index.
2. I am going to put each nucleotide in its own vector. It would cost quite a lot in term of memory, but it is worthy in terms of execution time.
3. If I spot a T on position i, the element i on the T-vector is marked as found.
4. I convert each nucleotide specific vector in its own partial sum. In this way I am going to see just accessing first and last element in a give subsequence if that nucleotide is there. No need to loop in a loop, no quadratic complexity.
5. This is the vector I am going to give back to the caller.
6. We need to pay attention to the special case that occurs when the subsequence starts at the beginning of the sequence. Usually we need to get the value of the previous element, in this case there is not such a thing.
7. If the right partial sum has changed (actually, it could only be greater or equal) we have a nucleotide of the specific type. Since I am checking from the lowest value up, as soon as I find it, I break the loop.
8. The user wants a 1-based index to identify the solution.

2 comments:

  1. So NIce ! I am loving it. How about return move(result);

    ReplyDelete
  2. Also as you mentioned in earlier poast , saving space of T

    vector< vector > psum(3); // 3 x S.size() array

    //Form placement matrix of A,C,G in the sequence
    for (int i = 0; i < S.size(); ++i) //N
    {
    int ntd = getType(S[i]);
    if (ntd<3) psum[ntd][i] = 1; //Leave T for now
    }


    //Form partial sum of each nucleotide A, C, G.
    for_each(psum.begin(), psum.end(), [](vector& x) {
    partial_sum(x.begin(), x.end(), x.begin());
    });

    ReplyDelete