Showing posts with label NRC. Show all posts
Showing posts with label NRC. Show all posts

March 20, 2009

Yay for Efficiency!

(Caveat: This post is somewhat C++-centric, but its main points translate to any language.)

I've been continuously optimizing my first project so that I can run ever-larger experiments. The project boils down to two main problems:
  • searching a list of high-dimensional sparse vectors for nearest-neighbour pairs; and
  • keeping track of the sets generated by merging those pairs.
In the process, I've cut the runtime for runs against the GALE 2008 Chinese-English corpus from something like millions of years to a couple of hours.

(If the preceding part made absolutely no sense, you might want to stop reading now.)

Lesson The, er, Zeroth. Take advantage of sparse data.

This should be obvious - if I can ignore the holes in my data rather than trudging through each one, it can only help efficiency.

Lesson The First. boost::tr1::unordered_map is not always preferable to map.

The latter beats the former hands-down in iteration, and the ordering property of a map permits more efficient linear sweeps for a variety of iteration-based algorithms - sparse-vector arithmetic, for example. In some cases, a sorted array might be better still.

Lesson The Second. Use the structure that fits the problem best.

The second problem above is a classic use case for the disjoint sets structure. Using anything else is a recipe for gross inefficiency - you can't beat the amortized inverse Ackermann cost for all operations.

Lesson The Third. You can make file output faster.

std::endl flushes the write buffer - unless you need the output immediately, use "\n" instead. Dump large structures in compact binary formats where possible. If you go this route, test that your read/write operations undo each other; this sanity check prevents most nasty surprises. If you're really in a pinch, use <cstdio> instead of <iostream>.

Lesson The Fourth. Consider using heuristics.

Do you really need an exact answer? Most of the time - especially when dealing with massive datasets - the answer is no. Look for small assumptions or shortcuts that work for your dataset. For instance, when performing the nearest-pair search, I only consider pairs that have non-zeros in the same coordinate at least once. This works well with the sparse representation, allowing me to search over a sub-quadratic number of pairs.

January 19, 2009

Second Week, Here We Come

I'm one week into my research internship here with the National Research Council's translation group in Gatineau. Like everything else even remotely connected to the federal government here in Canada, this is a bilingual environment; indeed, bilingualism is pervasive throughout the Ottawa-Hull region, and for obvious reasons. As such, I've been making a concerted effort to dust off my somewhat rusty French, left unused since my travels in Europe some four or so years ago. To finish off this post, I'll leave you with two tidbits of linguistic culture:
  • pourriel: spam, from a portmanteau of pourrir (to rot) and courriel (email, itself an abbreviated admixture of courrier and électronique).
  • The Québecois inherit from their Catholic past a legacy of religious-themed curse words. This famed stereotype has even reached far-off Mexico, where they are known colloquially as los tabernacos.
Other than that: I finally have a car, which is more or less essential if you want to do anything other than admire the vast snowbanks here in Gatineau. Google Maps can attest to the remoteness of my present location: searches for bars, supermarkets, and banks near here all turn up nothing within nearly two kilometres.

January 12, 2009

Another Term

A mere 36 hours after my hectic return from Japan, I'm kicking off the new term (and this blog!) in the ground-floor lounge of the National Research Council's Language Technologies Research Centre based in Gatineau. Unlike my previous attempts at blogging, I'm going to aim for a certain base level of regularity in my updates here - so keep posted!