Showing posts with label bash. Show all posts
Showing posts with label bash. Show all posts

September 3, 2009

Where Two Files Meet

(Note: I promise that I'll post something more generally interesting before this day draws to a close, for non-technical values of generally interesting.)

Let's say that you have two large unsorted files, each of which is essentially a long list of strings, and you want to find their intersection. There's the simple brute-force way:

while read line; do grep "$line" file2; done < file1

There's a problem - this is quadratic! Fortunately, we can cut this down pretty easily with the old time-memory tradeoff:

sort file1 > file1.sorted
sort file2 > file2.sorted
sort -m file1.sorted file2.sorted > combined
diff combined <(uniq combined)

April 4, 2009

Scripted Reality

As promised, here are the scripts.

April 3, 2009

The Sound of Scraping

After sitting on my CBC Radio 3 metadata for just over a week, I finally got around to throwing together a decent downloading script. Actually, the scraper/downloader is a loose federation of scripts, deliberately kept in separate modules so as to allow nice things like, say, running multiple copies thereof concurrently. I'll post a link to the source in the near future, along with a few words of explanation. Maybe I'll even write a README - after all, although CBC Radio 3 is afloat for now, there's no telling how long it will survive the budget axes of doom.

(And, to prevent the inevitable smartasses from chiming in with "you forgot wget, n00b" - nope, it's in there somewhere. That said, I think you'll find these scripts go a tiny bit further...)

March 26, 2009

And the total is...

76302. Of course, this is just the metadata - I wouldn't be so reckless as to hammer the CBC Radio 3 servers with 200 GB worth of download requests in a day! (Nor would my bandwidth permit me to suck down that much data in anything less than a couple of weeks. Oh well.) Next up: filter it down to a list of songs that I might actually want.

2017 Songs and Counting

Your guess as to what this does:

#!/bin/bash
for i in `seq 0 25`; do
echo "http://radio3.cbc.ca/nmc/artists.aspx?offset=${i}"
done | tee -a artists.log |\
./url-dumper 1.0 |\
egrep -o "/bands/[^\"]*" | uniq |
while read line; do
b=`basename "$line"`
echo "/play/band/${b}"
done | uniq | tee -a bands.log | ./cbc3-get-music-info 1.0

(Yes, I've left out some details - like what exactly those scripts do under the covers. I'll post about that when it's finished!)

Better make that 3156 songs. And counting.