March 20, 2009

Yay for Efficiency!

(Caveat: This post is somewhat C++-centric, but its main points translate to any language.)

I've been continuously optimizing my first project so that I can run ever-larger experiments. The project boils down to two main problems:
  • searching a list of high-dimensional sparse vectors for nearest-neighbour pairs; and
  • keeping track of the sets generated by merging those pairs.
In the process, I've cut the runtime for runs against the GALE 2008 Chinese-English corpus from something like millions of years to a couple of hours.

(If the preceding part made absolutely no sense, you might want to stop reading now.)

Lesson The, er, Zeroth. Take advantage of sparse data.

This should be obvious - if I can ignore the holes in my data rather than trudging through each one, it can only help efficiency.

Lesson The First. boost::tr1::unordered_map is not always preferable to map.

The latter beats the former hands-down in iteration, and the ordering property of a map permits more efficient linear sweeps for a variety of iteration-based algorithms - sparse-vector arithmetic, for example. In some cases, a sorted array might be better still.

Lesson The Second. Use the structure that fits the problem best.

The second problem above is a classic use case for the disjoint sets structure. Using anything else is a recipe for gross inefficiency - you can't beat the amortized inverse Ackermann cost for all operations.

Lesson The Third. You can make file output faster.

std::endl flushes the write buffer - unless you need the output immediately, use "\n" instead. Dump large structures in compact binary formats where possible. If you go this route, test that your read/write operations undo each other; this sanity check prevents most nasty surprises. If you're really in a pinch, use <cstdio> instead of <iostream>.

Lesson The Fourth. Consider using heuristics.

Do you really need an exact answer? Most of the time - especially when dealing with massive datasets - the answer is no. Look for small assumptions or shortcuts that work for your dataset. For instance, when performing the nearest-pair search, I only consider pairs that have non-zeros in the same coordinate at least once. This works well with the sparse representation, allowing me to search over a sub-quadratic number of pairs.

No comments:

Post a Comment