This problem should look to you suspiciously similar to the previous one, that asked to check about Roman numbers. And actually I have extracted it from the same source, Dive into Python 3, chapter 5, that is about Regular Expressions.
So, the problem is defining a good pattern that matches the expected input. Here is a first try:
pattern = re.compile(r'^(\d{3})-(\d{3})-(\d{4})$')I have asked to the compile() function in the python re library to compile it to a pattern object, so that I can use it to call on it its method search(), as you can see below.
Notice that I passed to compile() a "raw" string, signaled by an 'r' before its begin. It is a useful trick, so that we can avoid backslashing the backslashes to specify them as actual characters and not escape ones.
Then I say that my number should use all the characters in the string, starting from the beginning, as the caret '^' anchor specifies, to the end, given the dollar '$' sign.
In the string I define three groups, round brackets, of digits. I'm using the '\d' shortcut to mean each possible digit. The curly brackets after it give the numerosity of that element. In the first and second case we have exactly 3 digits, in the third case are four.
This pattern works alright for simple numbers, as I proved in a test case:
result = pattern.search('800-555-1212') self.assertIsNotNone(result) # 1 groups = result.groups() self.assertEqual(3, len(groups)) # 2 self.assertEqual('800', groups[0]) self.assertEqual('555', groups[1]) self.assertEqual('1212', groups[2])1. Seach succeeds.
2. In the result there are three groups, as expected, and each group contains the expected block of the phone number.
However, this pattern is too limited. We want a fourth optional group, representing the number extension; we need to accept any possible kind of separators, and even the total lack of them; and we should expect some extra leading characters that should be skipped.
Last issue is easily solved, it's enough to get rid of the caret at the beginning of the pattern. In this way the number is not forced to start at the beginning of the string.
Adding the extension group is not a problem at all. We should just be prepared to check it, expecting an empty string if it is not present.
Separators require deciding what could actually be accepted in that position. Let's be permissive and accept anything that is not a number.
These considerations lead to a new pattern:
pattern = re.compile(r'(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$')Notice the disappearing of the '^' and the mutation of the literal dash to '\D*', meaning any character that is not a digit - uppercase D, where lowercase d represents any possible digit - repeated zero or more times, that's the star '*'.
I have written a few tests that I pushed to GitHub, and I am quite satisfied with the result.
No comments:
Post a Comment