January 30, 2009

Statement Of Content

Off a tip from a co-worker, I pointed my browser at the CS Distinguished Lecture Series 08/09 presentations at the University of Toronto's Knowledge Media Design Institute. If you have the chance, check out Dr. Raghavan's talk on Web Search. Coles Notes version: semantic web.

What does that mean? To the average computer user today, web search is a done deal: we have Google (or Yahoo, or (ugh) Microsoft Live Search.) They scarcely notice, for example, Google's push for universal search. Yet this shows that, even at current levels of quality, traditional page/click-driven search falls short of the goal: to understand and enable the user's intent.

Yes, Google will return flight results, weather, cinema listings, and maps. Yes, it serves up definitions, exchange rates, and simple computations. The problem lies in implementation: all of these are special exceptions, extra branches in a behemoth decision tree that is roughly equivalent to this:

Does this look like a location query? No? Does it look like a request for movie times? No? Does it...

This doesn't generalize well, for obvious reasons: the search engine knows nothing whatsoever about what I'm trying to accomplish. If it has a model for user intent, it's stunningly rudimentary. Sure, there's room for optimizations, like promoting results that look like results I've clicked on before - but these are optimizations on top of a fundamentally limited model.

So how do we form such a semantic web or, as Dr. Raghavan puts it, a web of objects? That remains an open question. On the other hand, thanks to the success of Google et al we now have massive datasets of user click patterns, queries, times spent on various pages - the list goes on. Perhaps we can harness that data to build this new semantic, intent-driven, user-centered web on top of the content-driven web we have now. In fact, I'd be truly surprised if there's a single big-name search engine out there that hasn't been actively researching this for years. If there is, they are certainly doomed to obsolescence.

January 28, 2009

January 27, 2009

A Good Head-bash-ing

Recursive functions in bash. Everyone seems to have the same advice: use local variables, stacks hacked together from bash arrays, or exitcodes. Ugh. Why not echo the result to standard output, like just about every other UNIX program?

ackermann() {
  ((M==0)) && echo $((N+1)) && return
  ((N==0)) && ackermann $((M-1)) 1 && return
  ackermann $((M-1)) `ackermann $M $((N-1))`

This is somewhat contrived and more than a little inefficient. Not only that, but it's also incorrect: bash, like most "real" programming languages, suffers from integer overflow. Or does it?

function compute {
  echo "$1" | bc | sed ':start /^.*$/N;s/\\\n//g; t start'

function ackermann {
  ((M==0)) && compute "${N}+1" && return
  ((M==1)) && compute "${N}+2" && return
  ((M==2)) && compute "2*${N}+3" && return
  ((M==3)) && compute "2^(${N}+3)-3" && return
  ((N==0)) && ackermann $((M-1)) 1 && return
  ackermann $((M-1)) `ackermann $M $((N-1))`


Of course, most useful recursive bash scripts (like make!) don't abuse standard output in this horribly screw-brained way. Instead, they have side-effects (like compilation!) and reserve output for things like status or error messages.  But that's not the point. The above snippets show that you can deal with recursion (and extended-precision arithmetic!) in bash in a relatively clean manner. (Some might object that, as a CS student, my definition of "relatively clean" is skewed. I'm declining comment on that one.)

(Then again:

./ackermann3.sh 4 3
Runtime error (func=(main), adr=19739): exponent too large in raise

I suppose there are limits to everything.)


January 26, 2009

Seven Habits Of The Status Quo

I was just reading through the Seven Habits of Highly Effective Programmers. All was well until I hit this gem:

There is Such a Thing as a Stupid Question

Really, there are lots of stupid questions. [...] Asking for clarification about a specification shows you know how to find and read the spec and your ability to detect ambiguities. [...] Let everyone know that you read the documentation and googled the subject.

This seems like a perfectly reasonable statement. If some script kiddie n00b can't be bothered to RTFM, they aren't worth the time of day, right?


Consider this: how many programmer-hours are wasted every day reading man pages, library documentation, language references, and so on? To paraphrase: ask a stupid question once, shame on you. Ask a stupid question millions of times - well, you can't. That's the point. If I had a dollar (yes, even a Canadian dollar) for every time someone asked a question about the intricacies of UNIX find...

The above excerpt says less about effective programming than it does about the programmer mindset: developers first, users second. Let's collectively ask ourselves a question. What if, instead of rejecting these supposedly dull-minded inquiries, we tabulated them and created a centralized, easily searchable FAQ? This forks a handful of stones and slays an entire flock of birds with them: less experienced developers would stop asking stupid questions, and the seasoned soi-disant experts could respond effectively to any stray requests in constant time by directing the wayward souls to said FAQ.

Until that happens, happy man-page reading!

Gripe Of The Day

A simple one, really: there's no const version of std::map::operator[].

Now, I understand the reasoning here. After all:

#include <map>
#include <iostream>
using namespace std;
class A {
  map<int, int> m;
  void foo(int i) const { cout << m[i] << endl; }
int main(int argc, char* argv[]) {
  A a;
  a.foo(42); // what happens here?
return 0;

Our mythical const version of operator[] can't return a reference, as there's no object to refer to. It can't return a default value, either; there's no guarantee that the map's value type will have a default constructor (though in this case, the type int defaults to 0).

On the other hand, why not simply do as std::vector::operator[] does, and throw the programmer in the deep end if they access an invalid index? Thoughts? Comments? Heated counterarguments?

January 25, 2009

Level 3, Almost

93 problems and counting, including two of the 25 most recent (which means I get a nice red-coloured progress bar under Current Performance!) I'm shooting for 100 by the end of the week - enough for Level 3 status and a permanent place on the Project Euler scoreboard.

Aside from that, I'm currently optimizing my first project at work. After all, a new statistical machine translation feature function isn't much use unless you can run it across a cluster on corpora of at least 10 million phrase pairs within your lifetime. (In my defense, the goal of the first implementation was to get something working; judging from the unit tests, this objective seems to have been met.) Following discussion with co-workers at NRC-IIT, some intense whiteboard sessions, and profuse streams of invective launched (unfairly, perhaps) at valgrind, I'm confident that it can easily be improved to run within a day.

(Now that would be a good use of Python generators - (theoretically) infinite streams of invective, scraped from the bowels of our beloved Internet via httplib. Some would argue that the Internet's capacity to provide said invective is necessarily finite. I (and Bucket - NSFW) respectfully disagree.)

January 20, 2009

Sonic Boom!

I happened upon an interesting web app today - The Next Big Sound - which lets users play the role of record mogul and "discover" talented independent bands. More proof, if needed, that the screw-the-customer-at-all-costs model currently pursued by the RIAA and its ilk is about to draw its terminal breath. Of course, I have no problem with the RIAA running itself into the ground, as it seems hell-bent on doing. I just wish it would happen faster, and that the courts would stop upholding a sorely antiquated and essentially unenforceable copyright model.

January 19, 2009

Second Week, Here We Come

I'm one week into my research internship here with the National Research Council's translation group in Gatineau. Like everything else even remotely connected to the federal government here in Canada, this is a bilingual environment; indeed, bilingualism is pervasive throughout the Ottawa-Hull region, and for obvious reasons. As such, I've been making a concerted effort to dust off my somewhat rusty French, left unused since my travels in Europe some four or so years ago. To finish off this post, I'll leave you with two tidbits of linguistic culture:
  • pourriel: spam, from a portmanteau of pourrir (to rot) and courriel (email, itself an abbreviated admixture of courrier and √©lectronique).
  • The Qu√©becois inherit from their Catholic past a legacy of religious-themed curse words. This famed stereotype has even reached far-off Mexico, where they are known colloquially as los tabernacos.
Other than that: I finally have a car, which is more or less essential if you want to do anything other than admire the vast snowbanks here in Gatineau. Google Maps can attest to the remoteness of my present location: searches for bars, supermarkets, and banks near here all turn up nothing within nearly two kilometres.

January 12, 2009

Photographic Evidence!

Speaking of Japan, the pictures from that trip are now available for public consumption. Enjoy!

Another Term

A mere 36 hours after my hectic return from Japan, I'm kicking off the new term (and this blog!) in the ground-floor lounge of the National Research Council's Language Technologies Research Centre based in Gatineau. Unlike my previous attempts at blogging, I'm going to aim for a certain base level of regularity in my updates here - so keep posted!