18.5. Optimizing List Operations
The third step in the Soundex algorithm is eliminating consecutive duplicate digits. What's the best way to do this?
Here's the code we have so far, in soundex/stage2/soundex2c.py:
Here are the performance results for soundex2c.py:
The first thing to consider is whether it's efficient to check digits[-1] each time through the loop. Are list indexes expensive? Would we be better off maintaining the last digit in a separate
variable, and checking that instead?
To answer this question, here is soundex/stage3/soundex3a.py:
soundex3a.py does not run any faster than soundex2c.py, and may even be slightly slower (although it's not enough of a difference to say for sure):
Why isn't soundex3a.py faster? It turns out that list indexes in Python are extremely efficient. Repeatedly accessing digits2[-1] is no problem at all. On the other hand, manually maintaining the last seen digit in a separate variable means we have two variable assignments for each digit we're storing, which wipes out any small gains we might have gotten from eliminating
the list lookup.
Let's try something radically different. If it's possible to treat a string as a list of characters, it should be possible
to use a list comprehension to iterate through the list. The problem is, the code needs access to the previous character
in the list, and that's not easy to do with a straightforward list comprehension.
However, it is possible to create a list of index numbers using the built-in range() function, and use those index numbers to progressively search through the list and pull out each character that is different
from the previous character. That will give you a list of characters, and you can use the string method join() to reconstruct a string from that.
Here is soundex/stage3/soundex3b.py:
Is this faster? In a word, no.
It's possible that the techniques so far as have been “string-centric”. Python can convert a string into a list of characters with a single command: list('abc') returns ['a', 'b', 'c']. Furthermore, lists can be modified in place very quickly. Instead of incrementally building a new list (or string) out of the source string, why not move elements around
within a single list?
Here is soundex/stage3/soundex3c.py, which modifies a list in place to remove consecutive duplicate elements:
Is this faster than soundex3a.py or soundex3b.py? No, in fact it's the slowest method yet:
We haven't made any progress here at all, except to try and rule out several “clever” techniques. The fastest code we've seen so far was the original, most straightforward method (soundex2c.py). Sometimes it doesn't pay to be clever.
Example 18.5. Best Result So Far: soundex/stage2/soundex2c.py
import string, re
allChar = string.uppercase + string.lowercase
charToSoundex = string.maketrans(allChar, "91239129922455912623919292" * 2)
isOnlyChars = re.compile('^[A-Za-z]+$').search
def soundex(source):
if not isOnlyChars(source):
return "0000"
digits = source[0].upper() + source[1:].translate(charToSoundex)
digits2 = digits[0]
for d in digits[1:]:
if digits2[-1] != d:
digits2 += d
digits3 = re.sub('9', '', digits2)
while len(digits3) < 4:
digits3 += "0"
return digits3[:4]
if __name__ == '__main__':
from timeit import Timer
names = ('Woo', 'Pilgrim', 'Flingjingwaller')
for name in names:
statement = "soundex('%s')" % name
t = Timer(statement, "from __main__ import soundex")
print name.ljust(15), soundex(name), min(t.repeat())