February 1, 2009

So You Think You Can Sort

Selection sort. Insertion sort. Quicksort. Mergesort. All are standard fare in your average basic-level CS class; the first two are inefficient but conceptually simple, whereas the last two are decent general sorting algorithms. This is all good and well, but no one uses them for performance-critical applications:
  • Often, a sort is not required; what you really want in most large-scale cases are the top k results for some relatively small k.
  • Real-world data has patterns that general algorithms like your standard library quicksort fail to exploit.
  • Comparisons and swaps aren't always cheap.
For you C++ programmers, the first point means little more than the difference between sort and partial_sort. The second point, however, is slightly more subtle.

As a simple example, suppose you have a billion integers to sort, but you know that each one is between 1 and 100. With that restriction alone, you can easily sort them in O(n) time. Other examples:
  • The data are k-sorted (i.e. A[i] >= A[i-k].)
  • The data aren't quite sorted, but they are unique; furthermore, abs(A[i] - A[i-1]) < k.
The third point might seem like a dubious proposition, especially that part about swaps. How can swaps be expensive? If I want to swap complex structures, I use pointers. Simple, right?

A while ago, I wrote a DHTML table sorter (for fun, naturally!) For reference, here's a standard quicksort implementation. Ignoring the horrendous colour scheme for a moment, why is the first one so much faster? Simple - it avoids swapping DOM elements. (There's a reason it's called permutation quicksort!)

Fine. So we can guarantee O(n) swaps in exchange for linear memory overhead. How do you limit comparisons? Remember that second point - notice patterns.

No comments:

Post a Comment