quizzical quincunx: machine learning

Showing posts with label machine learning. Show all posts

March 31, 2009

Who Says You Can't Swear in C++?

I currently have the pleasure of using Och's YASMET tool for maximum entropy model optimization. (Quick background primer: maximum entropy is really a simple concept - assume nothing about the data beyond what you're given. In slightly more technical terms, you want the most uniform distribution that fits the constraints.) Looking at the source code is akin to reading one too many JAPH snippets or sifting through Google's obfuscated JavaScript - it's an experience both awe-inspiring and twitch-inducing. Of course, it's GNU GPL'd, so anyone who can navigate the mess of two-letter variable names, magic numbers, and minimally-whitespaced text is free to modify it! (A re-indenting utility might help here - though a quick run against the K&R style expands it to 362 lines and reveals the terrifying depth of the conditional statements lurking within. Fair warning.)

March 20, 2009

Yay for Efficiency!

(Caveat: This post is somewhat C++-centric, but its main points translate to any language.)

I've been continuously optimizing my first project so that I can run ever-larger experiments. The project boils down to two main problems:

searching a list of high-dimensional sparse vectors for nearest-neighbour pairs; and
keeping track of the sets generated by merging those pairs.

In the process, I've cut the runtime for runs against the GALE 2008 Chinese-English corpus from something like millions of years to a couple of hours.

(If the preceding part made absolutely no sense, you might want to stop reading now.)

Lesson The, er, Zeroth. Take advantage of sparse data.

This should be obvious - if I can ignore the holes in my data rather than trudging through each one, it can only help efficiency.

Lesson The First. boost::tr1::unordered_map is not always preferable to map.

The latter beats the former hands-down in iteration, and the ordering property of a map permits more efficient linear sweeps for a variety of iteration-based algorithms - sparse-vector arithmetic, for example. In some cases, a sorted array might be better still.

Lesson The Second. Use the structure that fits the problem best.

The second problem above is a classic use case for the disjoint sets structure. Using anything else is a recipe for gross inefficiency - you can't beat the amortized inverse Ackermann cost for all operations.

Lesson The Third. You can make file output faster.

std::endl flushes the write buffer - unless you need the output immediately, use "\n" instead. Dump large structures in compact binary formats where possible. If you go this route, test that your read/write operations undo each other; this sanity check prevents most nasty surprises. If you're really in a pinch, use <cstdio> instead of <iostream>.

Lesson The Fourth. Consider using heuristics.

Do you really need an exact answer? Most of the time - especially when dealing with massive datasets - the answer is no. Look for small assumptions or shortcuts that work for your dataset. For instance, when performing the nearest-pair search, I only consider pairs that have non-zeros in the same coordinate at least once. This works well with the sparse representation, allowing me to search over a sub-quadratic number of pairs.

January 30, 2009

Statement Of Content

Off a tip from a co-worker, I pointed my browser at the CS Distinguished Lecture Series 08/09 presentations at the University of Toronto's Knowledge Media Design Institute. If you have the chance, check out Dr. Raghavan's talk on Web Search. Coles Notes version: semantic web.

What does that mean? To the average computer user today, web search is a done deal: we have Google (or Yahoo, or (ugh) Microsoft Live Search.) They scarcely notice, for example, Google's push for universal search. Yet this shows that, even at current levels of quality, traditional page/click-driven search falls short of the goal: to understand and enable the user's intent.

Yes, Google will return flight results, weather, cinema listings, and maps. Yes, it serves up definitions, exchange rates, and simple computations. The problem lies in implementation: all of these are special exceptions, extra branches in a behemoth decision tree that is roughly equivalent to this:

Does this look like a location query? No? Does it look like a request for movie times? No? Does it...

This doesn't generalize well, for obvious reasons: the search engine knows nothing whatsoever about what I'm trying to accomplish. If it has a model for user intent, it's stunningly rudimentary. Sure, there's room for optimizations, like promoting results that look like results I've clicked on before - but these are optimizations on top of a fundamentally limited model.

So how do we form such a semantic web or, as Dr. Raghavan puts it, a web of objects? That remains an open question. On the other hand, thanks to the success of Google et al we now have massive datasets of user click patterns, queries, times spent on various pages - the list goes on. Perhaps we can harness that data to build this new semantic, intent-driven, user-centered web on top of the content-driven web we have now. In fact, I'd be truly surprised if there's a single big-name search engine out there that hasn't been actively researching this for years. If there is, they are certainly doomed to obsolescence.

quizzical quincunx

March 31, 2009

Who Says You Can't Swear in C++?

March 20, 2009

Yay for Efficiency!

January 30, 2009

Statement Of Content

Followers

Blog Archive

About Me