This Thread: July 2017

Installing and configuring Oracle 12c

It was a while since the last time I have installed an instance of the Oracle database for local use. Nothing much has changed in the process, with the noticeable exception of the introduction of pluggable databases. Here I jot down a few remarks I made during the setup of the version 12 release 2.

1. In the page Typical Installation we have to specify the name for our pluggable database. Remember your choice!
2. The Enterprise Manager Database Express is available on https://localhost:5500/em, access it through the SYS user (with the password specified during setup)
3. Add in the tnsnames.ora (in the network\admin directory under the database home) the network configuration settings to access the pluggable database we have created in (1). Something like:

PDBORCL =
  (DESCRIPTION =
    (ADDRESS = (PROTOCOL = TCP)(HOST = localhost)(PORT = 1521))
    (CONNECT_DATA =
      (SERVER = DEDICATED)
      (SERVICE_NAME = pdborcl)
    )
  )

4. Use sqlplus to connect to the pluggable database with SYS as sysdba:

sqlplus sys/Oracle_1@pdborcl as sysdba;

5. As usual, unlock the hr user to access the test tables:

alter user hr identified by hr account unlock;

So far so good. All standard stuff. A problem raises if you plan to stop the Oracle service, that should be named OracleServiceORCL. Since it is always memory hungry, sometimes I just want it stay out of my RAM. The first time I did it, I had the surprise of getting a nasty error when I tried to connect to the database after restarting. Basically, the issue was that the pluggable database was mounted but not started.

This is the recipe I followed to solve it:

1. From the shell, enter sqlplus without logging.

sqlplus /nolog

2. In sqlplus, connect as sysdba.

connect / as sysdba

3. Brutally shutdown the database.

shutdown abort

4. Restart it.

startup mount

5. Recover the database, then open it.

recover database 
alter database open;

6. Open all the pluggable database, and then ask them to remember their status, so that we don't have to repeat the maneuver each time.

alter pluggable database all open;
alter pluggable database pdb_name save state;

Go to the full post

CodeEval Minimum Path Sum

Given a squared matrix of integers, find the minimal path sum from top-left to bottom-right, assuming that we can only move down or right.
This is CodeEval problem #72.

For example, give this matrix

4 6
2 8

We have just two possible paths: 4 -> 6 -> 8 and 4 -> 2 -> 8. It is immediate to see that the second one leads to the minimum sum of 14.

The structure of the problem recalls a Dynamic Programming solution table. I found it quite straightforward to use this thought to write a first solution:

def solution(data):  # 1
    dp = [[float(math.inf)] * (len(data) + 1)]  # 2
    for i in range(len(data)):
        dp.append([math.inf])
        dp[i+1].extend(data[i])

    end = len(dp)
    for i in range(1, end):  # 3
        for j in range(1, end):
            if i == 1 and j == 1:  # 4
                continue
            dp[i][j] += min(dp[i-1][j], dp[i][j-1])  # 5
    return dp[i][j]  # 6

1. The input parameter data contains a list of list of integers, representing the input matrix.
2. I copy the input in a DP-style table, where the first row and column have fake values. Usually, in a DP problem, zero is used for those cells. However here we need a value that should not be a candidate as minimum value. In a C-family language, I would have chosen a constant representing the maximum available integer. Python 3 has nothing like that, so I use math.inf, the floating-point positive infinity representation.
3. Calculate all the possible minimal path sum, skipping the first row and column.
4. This check looks unsatisfactory to me, but it is the less confusing way I found out to say that I don't have to do anything on the left-topmost element of the original matrix.
5. All the other cells are adjusted adding the smallest element on the immediate left or up.
6. There is nothing else to do, but returning the value in the bottom-right cell.

I found this solution to be elegant enough and easy to be understood. However I decided to write an alternative one to get rid of the buffer table, and also to navigate the table backward, from bottom-right to top-left.

def solution_2(data):
    last = len(data) - 1
    for i in range(last, -1, -1):
        for j in range(last, -1, -1):  # 1
            if i == last and j == last:  # 2
                continue
            if i == last:  # 3
                data[i][j] += data[i][j+1]
            elif j == last:
                data[i][j] += data[i+1][j]
            else:
                data[i][j] += min(data[i + 1][j], data[i][j+1])
    return data[0][0]  # 4

1. The loop variables are initially set to the last row/column and are decreased down to zero.
2. Again, nothing has to be done for the first cell.
3. Not having the extra-row/column, I have to add an extra check in the code for both i and j.
4. In this case the solution is stored in the top-left cell.

Not using another table should lead to a performance improvement, at least if large matrices are expected in input, and its cost in terms of less readability looks bearable to me. Scanning the table backward doesn't seem to give any advantage and makes the code a bit more obscure, so I would probably get rid of it.

I pushed the full python code on GitHub.

Go to the full post

Using JdbcTemplate in Spitter Web App

In the previous post I developed a CommandLineRunner Spring App to show how to use JdbcTemplate for simplify the JDBC access to an embedded instance of a H2 database. Here I port the changes to the Spittr Web App I have built up previously.

Actually, there is not much to say. There is some inconsistency between the two apps that spiced up the process, but nothing really complicated. I'd say the most interesting part is in the POM file. Here I removed, whenever it was possible, the versioning for each single dependency, relying instead on the parent element, where they are defined.

<parent>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-parent</artifactId>
    <version>1.5.4.RELEASE</version>
    <relativePath />
</parent>

However, for some reason, Spring could not find the version for the JSP dependency, so I kept it as before

<dependency>
    <groupId>javax.servlet.jsp</groupId>
    <artifactId>javax.servlet.jsp-api</artifactId>
    <version>${jsp.version}</version>
    </dependency>

And left the versioning for it among the properties.

<properties>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <project.reporting.outputEncoding>UTF-8</project.reporting.outputEncoding>
    <java.version>1.8</java.version>
    <jsp.version>2.3.1</jsp.version>
</properties>

I pushed the new version of Spittr on GitHub.

Go to the full post

Spring and JdbcTemplate

Let's say that we want to write a Spring application that requires some simple database activity. We decide to use an embedded database, H2 and, instead of choosing a higher level persistency framework, we decide to go for the JDBC support, mediating it with JdbcTemplate, a Spring class aimed at simplifying its use.

I create a Spring project using the STS wizard (File >> New >> Spring Starter Project).
Since I want to use JDBC and the H2 database support, I select these two options from the dependencies page:

From database schema to POJOs

In the source/main/resources folder, I put a file named schema.sql, this is enough for Spring, that knows it has to get it and run it at startup, so there I put the SQL commands to create my tables. I create two tables, spitter and spittle, connected by the id of the first, that is a foreign key in the second.

The tables defined in the database are reflected in a couple of plain old Java classes, Spitter.java and Spittle.java, having as fields what in the tables are the columns.

There is one thing that I think is interesting. The SQL spittle table has, among the other fields, a datetime, postedTime. We need to find a counterpart for it in the Java world. Since we are not interested here in time zones and the such, just in a plain timestamp, I decided to use Java 8 java.time.Instant.

That's it. If the SQL stuff is simple, and I have just a couple of tables, I could give it for done. To see that, I have created a basic version for the SpringBootApplication that implements the CommandLineRunner interface:

public class UsingJdbcApplication implements CommandLineRunner {
    private static final Logger log = LoggerFactory.getLogger(UsingJdbcApplication.class);

    @Autowired
    private JdbcTemplate template;  // 1

    public static void main(String[] args) {
        SpringApplication.run(UsingJdbcApplication.class, args);
    }

    @Override
    public void run(String... args) throws Exception {
        Long counter = template.queryForObject("SELECT count(id) FROM spitter", Long.class);  // 2
        log.info("Rows on spitter table: " + counter);

        log.info("Pushing a spitter in");
        Object[] admin = new Object[] { "admin", "password", "Admin", "admin@example.dd" };
        template.update("INSERT INTO spitter(username, password, fullname, email) VALUES(?, ?, ?, ?)", admin);  // 3

        template.query("SELECT * FROM spitter WHERE username = ?", new Object[] { "admin" },  // 4
                (rs, rowNum) -> new Spitter(rs.getLong("id"), rs.getString("username"), rs.getString("password"),
                        rs.getString("fullname"), rs.getString("email"), rs.getBoolean("updateByEmail")))
                .forEach(spitter -> log.info("Spitter: " + spitter.toString()));
    }
}

1. Sort of magic. We ask to Spring to wire this JdbcTemplate object. Since the H2 jar is in the classpath, it understands it has to use it as database. We don't provide any settings, and it uses a whole bunch of defaults. Usually not a good idea for production code, but very handy for prototyping.
2. Let's run a simple query that returns a single value. This line shows that actually Spring has initialized the database using the script named schema.sql that it found in the main resources folder. And how JdbcTemplate makes our life so much simpler. Notice that the queryForObject() method has a first parameter that is the SQL statement we want to run, and the second one that is the type for the expected return value.
3. A parametrized update(). Its first parameter is a parametrized JDBC SQL insert statement, the second one a vector of objects containing the values that we want to usein the query.
4. Something a bit more complicated. The query() method execute a query in a way similar to the update() seen above but, as third parameter, it accepts a lambda function that maps each row in the resulting resultset to a new Spitter object. We tipically have no use for the rowNum parameter, just access any resultset field we need.
This query() call returns a list of Spitter. On that list we apply a forEach() so that each element in the list is logged. Actually, here we are expecting just one Spitter, however this is a common pattern that we should learn to recognize and use.

Running the Spring Boot App, I can see in the console window the expected log.

Repositories and configuration

Usually we don't want to scatter SQL statements all over our Java code. So we create two interfaces, one for Spittle and one for Spitter, that act as repository for them, and then we implement them for our JDBC provider.
The two classes are quite similar. Let's see some parts of JdbcSpittleRepository, that is a tad more complicated.

// ...
private static final String SELECT = "select sp.id, s.id as spitterId, s.username, s.password, s.fullname, s.email, s.updateByEmail, sp.message, sp.postedTime from Spittle sp, Spitter s where sp.spitter = s.id";
private static final String SELECT_RECENT = SELECT + " order by sp.postedTime desc limit ?";

public List<Spittle> findRecent(int count) {
    return template.query(SELECT_RECENT, new SpittleRowMapper(), count);  // 1
}

// ...
private long insertSpittle(Spittle spittle) {
    SimpleJdbcInsert insert = new SimpleJdbcInsert(template).withTableName("Spittle");  // 2
    insert.setGeneratedKeyName("id");
    Map<String, Object> args = new HashMap<String, Object>();
    args.put("spitter", spittle.getSpitter().getId());
    args.put("message", spittle.getMessage());
    args.put("postedTime", Timestamp.from(spittle.getPostedTime()));
    long spittleId = insert.executeAndReturnKey(args).longValue();
    return spittleId;
}

// ...
private static final class SpittleRowMapper implements RowMapper<Spittle> {  // 3
    public Spittle mapRow(ResultSet rs, int rowNum) throws SQLException {
        long id = rs.getLong("id");
        String message = rs.getString("message");
        Instant postedTime = rs.getTimestamp("postedTime").toInstant();
        long spitterId = rs.getLong("spitterId");
        String username = rs.getString("username");
        String password = rs.getString("password");
        String fullName = rs.getString("fullname");
        String email = rs.getString("email");
        boolean updateByEmail = rs.getBoolean("updateByEmail");
        Spitter spitter = new Spitter(spitterId, username, password, fullName, email, updateByEmail);
        return new Spittle(id, spitter, message, postedTime);
    }
}

1. Almost an innocent JdbcTemplate wrapper to a JDBC query, if we don't pay much attention to the second parameter, that would convert the ResultSet coming from the SQL select in a list of Spittles. See the SpittleRowMapper below for details.
2. Here we see how SimpleJdbcInsert helps us in writing readable code when we want to insert a row in a table, in the case where the database will take the burden of generating a key for us, and we need to return it to the caller. After specifying the table we are working on, we set the name of the field that is a generated key, and then, as a map, the field names and associated values. Finally we have just to pass them to the SimpleJdbcInsert by executeAndReturnKey().
3. The SpittleRowMapper seen in action in (1). Its mapRow() method maps a single row from the underneath table in a Spittle object.

In some way, Spring has to configure these repositories, passing to them a viable JdbcTemplate object. In this app, it is done by JdbcConfig, annotated as a Spring Configuration class. It generates a few beans (in the Spring sense of the term) representing the DataSource, JdbcTemplate, the repositories, and also a transaction manager. We'll see them at work in the tester. It interesting seeing how the DataSource is set, when we don't rely on the default initialization by Spring.

@Configuration
public class JdbcConfig {
    @Bean
    public DataSource dataSource() {
        return new EmbeddedDatabaseBuilder()
                .setType(EmbeddedDatabaseType.H2)
                .addScripts("schema.sql", "test-data.sql")
                .build();
    }
 
    // ...

This data source represents an embedded database, H2 type, that would run a couple of scripts at startup. We already seen the first one, schema.sql, here I added a second one, that contains a bunch of SQL insert for spitters and spittles rows.

Testing

There are a couple of JUnit test, one for Spitter the other one for Spittle repository. They are quite similar. Let's have a fast look at first one.

@RunWith(SpringJUnit4ClassRunner.class)
@ContextConfiguration(classes = JdbcConfig.class)  // 1
public class JdbcSpitterRepositoryTest {
    @Autowired
    JdbcSpitterRepository spitterRepository;  // 2

    // ...
    @Test
    @Transactional  // 3
    public void findAll() {
        List<Spitter> spitters = spitterRepository.findAll();  // 4
        assertEquals(4, spitters.size());
        assertSpitter(0, spitters.get(0));
        assertSpitter(1, spitters.get(1));
        assertSpitter(2, spitters.get(2));
        assertSpitter(3, spitters.get(3));
    }

    //...
}

1. Here I'm saying to Spring to fetch the JdbcConfig class and use it to configure itself for testing.
2. Thanks to (1) we can ask to Spring to autowire the spitter repository.
3. Having defined a PlatformTransactionManager, we can use it to keep very simple the transaction management. A transaction is started and completed, eventually rolled-back, behind the curtains.
4. The spitters come from the test-data.sql called at startup by JdbcConfig dataSource().

Reference: Hitting the database with Spring and JDBC, from Spring in Action, Fourth Edition by Craig Walls, chapter ten.

This complete Spring application is on GitHub.

Go to the full post

CodeEval Word Search

Given a board where in each cell there is a letter, check if word is in it. We could start anywhere, but we could move only horizontally and vertically. Each cell could be used only once. More details on the CodeEval page.

The board is fixed, and looks like this:

ABCE
SFCS
ADEE

I have converted the examples provided in the problem in a python test case:

class TestCodeEval(unittest.TestCase):
    def test_provided_1(self):
        self.assertEqual(False, solution('ASADB'))

    def test_provided_2(self):
        self.assertEqual(True, solution('ABCCED'))

    def test_provided_3(self):
        self.assertEqual(False, solution('ABCF'))

For instance, it should be easy to see that we can't follow ASADB on the board. After ASAD there is a 'F' that block the way to the final 'B'.

I have converted the board in a couple of dictionaries:

WHERE = {'A': {(0, 0), (2, 0)},
         'B': {(0, 1)},
         'C': {(0, 2), (1, 2)},
         'D': {(2, 1)},
         'E': {(0, 3), (2, 2), (2, 3)},
         'F': {(1, 1)},
         'S': {(1, 0), (1, 3)}}

NEIGHBORS = {
    (0, 0): {(0, 1), (1, 0)},
    (0, 1): {(0, 0), (1, 1), (0, 2)},
    (0, 2): {(0, 1), (1, 2), (0, 3)},
    (0, 3): {(0, 2), (1, 3)},

    (1, 0): {(0, 0), (1, 1), (2, 0)},
    (1, 1): {(1, 0), (0, 1), (2, 1), (1, 2)},
    (1, 2): {(1, 1), (1, 3), (0, 2), (2, 2)},
    (1, 3): {(1, 2), (0, 3), (2, 3)},

    (2, 0): {(1, 0), (2, 1)},
    (2, 1): {(2, 0), (2, 2), (1, 1)},
    (2, 2): {(2, 1), (1, 2), (2, 3)},
    (2, 3): {(2, 2), (1, 3)}
}

The first one, WHERE, is a pure utility, that let me easily find where each letter is on board.
The second one, NEIGHBORS, is a plain adjacency list, and should hint that I have seen in this problem description a graph. Given the rules of adjacency in this board it is not strictly a necessity, but it helped me to better think to the solving algorithm.

My solution is based on a graph search algorithm, adapted to the peculiarity of the problem. For each letter in the word, I check for the possible continuation, backtracking when I don't find a good way, until the word is consumed or I can't find any way to continue the search.

After some startup code, I only need a recursive function that moves from one step to the next one:

def find(node, visited, letters):  # 1
    if not letters:  # 2
        return True

    letter = letters.popleft()  # 3
    nodes = WHERE[letter] & NEIGHBORS[node]  # 4
    for node in nodes:
        if node not in visited:  # 5
            visited.add(node)
            if find(node, visited, letters):  # 6
                return True
            visited.remove(node)
    letters.appendleft(letter)  # 7
    return False

1. I pass to the function the current position on board, then a set of already visited nodes, and finally a deque containing the letters that should be still to be found.
2. No letter pending means I have already found the word. Job done.
3. Extract the first letter I have to find from the deque.
4. The intersection between the sets where the letter is present and the nodes adjacent to the current ones give me the set of nodes where I have to look for the next step.
5. But only the nodes that I have not already visited are good candidates.
6. Recursive call.
7. If this branch is not satisfactory, I push back the letter in the deque and return False.

Full code, both test and actual python script, is on GitHub.

Go to the full post

CodeEval Levenshtein Distance

We are given in input a short list of words, a separator, then another huge list of words. For each word in the first list, we have to return the numbers of elements in its "social network" from the second list. A word is "friend" of another one if their Levenshtein distance is one. Meaning, they have just one letter difference, as "aa" with "aaa", or "aaa" with "aba". Pay attention that the example provided by CodeEval, when tested against the file they pushed on GitHub, is slightly wrong.

The problem is conceptually simple, its main issue is about performance. Let's write firstly a straightforward solution that scores poorly even when implemented in C++, and don't even get accepted in my python implementation, due to its slowness.

The definition of friendship given in this problem should let us think about graphs and cliques. Fortunately we are not asked to get the maximum clique in the network, but only the maximum clique for the few specified nodes. So we should expect "just" an Big-Oh N squared time complexity on the number of elements in the full social network that, unfortunately, we know to be large.

Still, on top on that N x N, we should consider the cost of calculating the Levenshtein distance for any couple of words. This problem leads almost naturally to a Dynamic Programming solution, however, also in this case we should expect an expensive temporary complexity, give or take in the order of m x n, where m and n are the lengths of the two words.

A Levenshtein distance of one

But, wait a minute, really we are not interested in the Levenshtein distance, just on determine if two words have it equals to one. This is a much more simple problem, and we can even break it in two parts. If the two words have the same size, we could just check if there is one and only one different character in them, i.e., we can pass from one string to the other just with one char swap:

def is_swap(lhs, rhs):
    difference = False
    for left, right in zip(lhs, rhs):  # 1
        if left != right:  # 2
            if difference:
                return False
            difference = True
    return difference  # 3

1. In production code, I would have asserted the precondition len(lhs) == len(rhs), here I just rely on the caller to do it. Assuming the same size for lhs and rhs, I zip them, extracting the characters at the same position and putting them in the cycle variables, left and right.
2. If left is not equals to right, I raise the difference flag. But if it was already raised, I know the difference is too strong, and I can stop checking.
3. If there is one and only one difference I would return True.

Other part of my simplified Levenshtein check. If I have two strings with difference in their size of one, the longest one should be the same as the shorter, but having one character more.

def is_add(shorter, longer):
    if len(shorter) > len(longer):  # 1
        shorter, longer = longer, shorter

    x = 0  # 2
    i = 0  # 3
    while i < len(shorter):
        if shorter[i] != longer[i+x]:  # 4
            if x == 1:
                return False
            x += 1
        else:
            i += 1

    return True if x < 2 else False  # 5

1. I assume the caller ensures the precondition on the parameter size, but doesn't verify which is which. This simplify the caller code, making this function a bit less readable. Anyway, the point of this check is ensuring that the string named shorter refers to the actual shorter one in input. 2. Counter for the extra characters found in the longer string. 3. Loop variable on the shorter string. The longer one will be ahead of x positions. 4. I found a difference in the strings. Add the difference counter, but if I have already seen a difference, in that case I know the two strings are too different to be friends. 5. I return True if here x is both 0 or 1. Zero means a case like "aa" vs. "aax". Each comparison in the length of shorter succeeded, however longer has a character more. One means the difference is in the middle. More than one difference is trapped in the while loop. Setting up the graph My first idea was to generate the social network graph, and then looking on it for the friends of the requested nodes. Saying that words is a dictionary, each word I get in input would be pushed in it as key, initializing its value with an empty set:

for test in file:
    words[test.rstrip('\n')] = set()

Then I call a utility function, check_friends(), to push each friend in its place - seeing it as a graph, the node is in the key, the edges are in the value.

def check_friends():
    for lhs in words.keys():  # 1
        for rhs in words.keys():
            if (len(lhs) == len(rhs) and is_swap(lhs, rhs) or  # 2
                    abs(len(lhs) - len(rhs)) == 1 and is_add(lhs, rhs)):  # 3
                words[lhs].add(rhs)
                words[rhs].add(lhs)

1. The infamous N x N loop. I check each word against all the other ones. Actually, I let is_swap also to detect if the word is tested against itself. 2. Same size, let is_swap() decide if the words are friends. 3. One character difference, ask to is_add() for friendship. Getting the clique Checking if a given word is in the full social network is easy and fast, being stored in a dictionary. Then, I apply a Breadth-first search (BFS) algorithm to navigate the graph and collecting all the friends and friends of friends:

def solution(word):
    results = []  # 1
    visited = set()  # 2

    net = deque()  # 3
    if word not in words.keys():  # 4
        return 0
    net.append(word)  # 5
    visited.add(word)

    while net:  # 6
        friend = net.popleft()
        fofs = words[friend]
        for fof in fofs:  # 7
            if fof not in visited:
                net.append(fof)
                visited.add(fof)
        results.append(friend)  # 8

    return len(results)

1. Here I am going to push all the elements in the word social network. 2. This is to avoid infinite looping. It's equivalent of marking a node as visited. 3. In this queue I push the node in the personal net for the current node that I am checking. 4. It is not officially stated in the problem that the caller could ask only for a word in the list, so I added this control. In any case it is cheap. 5. Initialize the algorithm before starting. The passed word is ready to be processed, and it is marked as visited. 6. While there are nodes to be processed, loop. 7. Push each friend of the current friend in the net, mark it as visited. 8. Push the current friend in the results set. This approach works alright, and I'm quite happy with it. However, for the way the problem is conceived, it is too slow. We need to do something to improve performances. It is clear where the bottleneck is. Generating the graph for the full social network is obviously very expensive. Do we really need to do it? Usually it would be a good idea, but here we know that we need to calculate the social network just for a handful of words. So, let's avoid to do it. A slimmer algorithm First change, word is not anymore a dictionary, but just a set. This means there is no value associated to each element, when I read a word I just push it in the set, and that's it.

for test in file:
    words.add(test.rstrip('\n'))

Also, no need to call check_friends(), instead, I do almost the same job in a get_friends() function, but only for the relevant items:

def get_friends(item):
    result = []
    for word in words:
        if (len(item) == len(word) and is_swap(item, word) or
                abs(len(item) - len(word)) == 1 and is_add(item, word)):
            result.append(word)
    return result

The changes in solution() are minimal:

def solution(word):
    results = set()
    visited = set()

    net = deque()
    if word not in words:
        return 0
    net.append(word)
    visited.add(word)

    while net:
        friend = net.popleft()
        fofs = get_friends(friend)
        for fof in fofs:
            if fof not in visited:
                net.append(fof)
                visited.add(fof)
        results.add(friend)

    return len(results)

And now CodeEval accepts happily the code. I pushed both my python solutions to GitHub, the complete but slow one, and the sleeker one, tailored on the problem requisites.

Go to the full post

CodeEval Type Ahead

Given the size of an n-gram in input, and its first n-1 components, we should write a function that returns our predicted possible last words, sorted by their calculated probability (from higher to lower) and then alphabetically. The solution assumes we generate our n-gram against the nursery rhyme Mary had a little lamb that we could safely hard-code in our program, stripping it from all non-words. More details on the CodeEval page problem.

I plan to write a python solution. So, first thing, I write a python test case:

def test_provided_1(self):  # 1
    self.assertEqual('lamb,0.375;teacher,0.250;children,0.125;eager,0.125;rule,0.125', solution(2, 'the'))

def test_lamb(self):
    self.assertEqual('at,0.200;its,0.200;love,0.200;was,0.200;you,0.200', solution(2, 'lamb'))

def test_the_lamb(self):
    self.assertEqual('love,0.333;was,0.333;you,0.333', solution(3, 'the lamb'))

def test_at(self):  # 2
    self.assertEqual('school,1.000', solution(2, 'at'))

1. The given as example. Looking at the Mary & lamb rhyme, the possible resulting two-grams when the first word is 'the' could be 'the lamb', 'the teacher', 'the children', 'the eager', 'the rule'. The most probable one, using 'lamb', has an estimated 0.375 chance. The less probable ones, having the same chance, should be presented in alphabetical order.
2. Just a single two-gram could be generated from 'at'. Its probability should be obviously 1. Notice the format for probability, three decimal are always required.

The most boring part of the problem was converting the rhyme in a list of words, something like

TEXT = ['Mary', 'had', 'a', 'little', 'lamb', 'its', 'fleece', 'was',
# ...
        'teacher', 'did', 'reply']

I found out running the CodeEval tester that I don't have to worry about using some case insensitive comparison. Upper and lower case are as found in the rhyme.

Then I need a dictionary, where I will put all the n-grams in the rhyme

n_grams = {}

I decided to be lazy. Actually, I'd say that in this case being lazy is the only sensible thing to do, if I don't want to create n-grams that are not required by the user. I write instead a function that checks if a give n-gram family has been already generated. If not, it is created and pushed in the dictionary:

def check_n_grams(n):
    if n in n_grams:  # 1
        return

    item = defaultdict(lambda: defaultdict(int))  # 2
    for i in range(len(TEXT) - n):  # 3
        outer = ' '.join(TEXT[i:i + n])  # 4
        intern = TEXT[i + n]
        item[outer][intern] += 1  # 5

    n_grams[n] = item

1. the n_grams for the current n have been already generated, there is nothing to be done here.
2. Let's generate all the n-grams for the passed n. We need here a bit of a perplexing data structure, a dictionary of dictionaries (that is going to be pushed in a dictionary). Moreover, to simplify a bit the following code, I used defaultdicts. I can't pass defaultdict to initialize a defaultdict, because it is not a callable. So I used the clever trick of passing instead a lambda (that is a callable) that returns it.
3. Loop on all the words in TEXT, stopping a tad before the end, see below if you don't have already guessed why. This code spectacularly fails if the user has the insane, but legit, idea of asking for an extremely high n. For production code a check on its size should be mandatory.
4. Generate the current n-gram considering it divided in two parts, the first one, that I am going to store as key in the outer dictionary, is done joining n words starting from the current one on, the second, that the key in the intern dictionary, represents the last word in the n-gram. So, actually, I am storing in this maps all possible (n+1)-grams.
5. Push the n-gram in the dictionary of dictionaries, setting the initial value to one, or increasing it if such combination is already there.

Large part of the job is done. The solution function does not much more than calling the utility function defined above and formatting the output.

def solution(n, given):
    n -= 1  # 1
    check_n_grams(n)  # 2

    target = n_grams[n][given]  # 3

    values = list(target.items())  # 4
    values.sort(key=lambda x: x[0])  # 5
    values.sort(key=lambda x: x[1], reverse=True)  # 6

    results = []  # 7
    population = sum(target.values())
    for value in values:
        results.append('{},{:.3f}'.format(value[0], value[1] / population))  # 8
    return ';'.join(results)

1. As seen in check_n_grams(), I generate in the n_grams dictionary the (n+1)-grams, splitting them in the key-value on the internal dictionary. I felt that this trick made the code more readable, however here I pay the price of it, with this obscure decrement. Again, production code should perform some parameter checking to avoid runtime disasters.
2. Ensure the current n-grams have been generated.
3. Get the internal dictionary. For instance, if I am looking for the 2-grams for 'at', in target I expect to get a dictionary containing just one element, where the key is the last word in the n-gram and value is its hit counters. In this case 'school': 1.
4. I need to do some sorting now, so I convert the items in target to a list, data type sporting a handy function, sort() that is stable, in this way I can call it repeatedly to get the expected effect.
5. First I sort values by the zeroth component, representing the key of the internal dictionary.
6. Then I sort it by the one-th component, the word counter, in reversed mode, so that the most seen would be first.
7. Format the data, put them in this temporary list, join its components and return them.
8. As required, the probability is formatted as a floating number with three decimals.

Full code pushed on GitHub. Both the tester and the actual python script.

Go to the full post

This Thread

Pages

Installing and configuring Oracle 12c

CodeEval Minimum Path Sum

Using JdbcTemplate in Spitter Web App

Spring and JdbcTemplate

CodeEval Word Search

CodeEval Levenshtein Distance

CodeEval Type Ahead

Labels

Recent comments

Tutorials - Examples - Books

Have a look at these blogs ...