I've been continuously optimizing my first project so that I can run ever-larger experiments. The project boils down to two main problems:

- searching a list of high-dimensional sparse vectors for nearest-neighbour pairs; and
- keeping track of the sets generated by merging those pairs.

(If the preceding part made absolutely no sense, you might want to stop reading now.)

**Lesson The, er, Zeroth.**Take advantage of sparse data.

This should be obvious - if I can ignore the holes in my data rather than trudging through each one, it can only help efficiency.

**Lesson The First.**boost::tr1::unordered_map is not always preferable to map.

The latter beats the former hands-down in iteration, and the ordering property of a map permits more efficient linear sweeps for a variety of iteration-based algorithms - sparse-vector arithmetic, for example. In some cases, a sorted array might be better still.

**Lesson The Second.**Use the structure that fits the problem best.

The second problem above is a classic use case for the disjoint sets structure. Using anything else is a recipe for gross inefficiency - you can't beat the amortized inverse Ackermann cost for all operations.

**Lesson The Third.**You

*can*make file output faster.

std::endl flushes the write buffer - unless you need the output immediately, use "\n" instead. Dump large structures in compact binary formats where possible. If you go this route, test that your read/write operations undo each other; this sanity check prevents most nasty surprises. If you're really in a pinch, use <cstdio> instead of <iostream>.

**Lesson The Fourth.**Consider using heuristics.

Do you really

*need*an exact answer? Most of the time - especially when dealing with massive datasets - the answer is no. Look for small assumptions or shortcuts that work for your dataset. For instance, when performing the nearest-pair search, I only consider pairs that have non-zeros in the same coordinate at least once. This works well with the sparse representation, allowing me to search over a sub-quadratic number of pairs.

## No comments:

## Post a Comment