September 3, 2009

Where Two Files Meet

(Note: I promise that I'll post something more generally interesting before this day draws to a close, for non-technical values of generally interesting.)

Let's say that you have two large unsorted files, each of which is essentially a long list of strings, and you want to find their intersection. There's the simple brute-force way:

while read line; do grep "$line" file2; done < file1

There's a problem - this is quadratic! Fortunately, we can cut this down pretty easily with the old time-memory tradeoff:

sort file1 > file1.sorted
sort file2 > file2.sorted
sort -m file1.sorted file2.sorted > combined
diff combined <(uniq combined)

No comments:

Post a Comment