On Progression
7/29/2005
This week in review: we find a lake full of ice water on Mars and a Japanese scientist creates a female android. I want my hoverboard now!
Writing a Collaborative Filtering Engine
7/28/2005
Translating a collaborative filtering engine that works on paper to actual code can be a bit of a hassle, especially if the language doesn't support "map" or "reduce" - two cool tools adored by functional programmers. I want to take the opportunity to explain the road that I'm cruising down now.
At first, I thought that the engine could be made without a particular language. One could calculate weighted correlation coefficients and determine k-NN, i.e. nearest neighbors/nodes, in any particular language (scripting or compiled). From the software angle, you need to be kinky with database joins, severe looping, deep structures, and awkwardly scoped vars. On the hardware front, you need to be able to do fast math - all that superstar code is great, but you need to take a step back and verify that the engine is ripping right data. What's the point if your engine spits badness?
After planning, I sat down and chose Python. It's a clean language, I'm relatively familiar with it (been using it for a few years now), and it has functional elements that I need to get the job done. I got a quarter of the way through, when I realized a significant problem. It was moving slower than a snail on a warm day.
In this engine, it all boils down to how fast you can churn through a set, store a summary, compare, and repeat. Memory takes a shoe-beating as the set grows. And we're not talking linear growth either - that would certainly save some cycles. So I went back to the drawing board and simplified the engine by compromising quality by a small deviation. I kept re-factoring and kept hitting brick walls.
I checked my code against people who've built this sort of application before. One of them, who'll go nameless, kept telling me that I should've been using Java. That is, in my opinion, a lame answer. When someone asks for code assistance, don't tell them that they have to go and learn a new language. Aside from that bump, I sat with others and simplified the engine further. After all this extra work, the results were the same.
I had started to think that maybe that engineer was right - maybe I need to be doing this in Java or C++. Of course I had doubts about her logic (scripting languages put up a good fight against 800 lb. gorillas), but I was in dismay. It was then when I decided to re-write a phase of the engine in Perl. I noticed a minor improvement, but nothing earth-shattering. Across various forums, people have documented that Perl executes a bit faster than Python - but not enough for my needs. I think I'll give Ruby a try next.
What's the point of this analysis? Once you build the engine, you have to keep it well-oiled. In software, it means choosing, and constructing thereafter, the right environment that scales appropriately with your master data repository. As of right now, I don't want to keep simplifying the engine mainly because I don't want to keep reducing the quality of the result set. Aside from experimenting with other languages, I'll check into my hardware - even though I have a gigabyte of ram and a solid processor. Hopefully I'm not biting off more than I can chew this time.
Google API + Nifty Hacking = Social Network?
7/22/2005
Came across this interesting post on Justin's blog:
When will we be able to extract the search data available in our Google search history via API? People who search the same key words can build connections just as well as people who blog about the same things.I too would like to see an API. I can already see a layered social network being built on top a user's search terms. But not just any social network - one that promotes productivity and research. Is that the natural progression for Google? They can index your site, your email, and things you look for. When can they give me recommendations or allow me to connect with other people that have also searched for something? What about showing search results that others have found helpful?
Another Graphic Novel?
7/18/2005
I woke up this morning and had this strange urge to whip open a notebook and start writing out a storyboard for another graphic novel. The first one I worked on was with two students at the Olin School of Engineering. So it's been about a year since we churned that out (two weeks of combined shooting and post-production), but I'm hungry to do another one. Since graduating, I've been programming, writing, and exploring other forms of media. Maybe I've been repressing the urge for a while now...
Right off the bat, I've already started to think of constraints. Like the previous novel, this one will also be photo manipulated. Perhaps it's time to think of a compelling story that is both linear and a quick read (no more than 30 pages). Let's see, what interests me today?
The Olympics Rock
7/18/2005
About two weeks ago, my brother posted his thoughts about the Olympics:
Really, I should be a fan of the Olympics, given that it is not dominated by a single country, rather a litany of contenders surprise the viewing audience every year. But, I just can't relate to the events: they bore me. I would much rather watch something else, even a re-run of Entourage for 20 minutes than sit through another one of those stink-bomb gymnastic events.Call me a sucker for mass-media and whatnot, but I love watching the Olympics. It's a spectacle to see the world's best athletes - arguably some of the most professionally dedicated and passionate people on the planet - gathered in on place in a civil fashion (don't see too much of that these days). They practice to put on a great performance and strive to break world records above representing their respective country. We should celebrate it, not as proud nationalists or advocates of sports, but as believers in healthy competition.
Caching Remote XML Feeds
7/15/2005
For those that don't know, I aggregate several feeds onto my portal (on root). However, I awoke this morning to discover that my main page was throwing an error while trying to parse one of those feeds. Apologies. It turns out WikiNews spit out an invalidated feed - at least that is what it appears like at the moment. This shows two problems:
1. I'm inept and not throwing my own errors
2. My aggregator is fragile
Without diving in to rant about point one, let's jump to the second one. Any aggregator is only as accurate as it's feeds. If feeds aren't there or invalid, then you're out of luck. What makes the matter worse is that you or I can't control the output of those feeds. I can parse a feed fine today, but it's possible that it might fail tomorrow. That's kind of frightening, in a lame sort of way.
So to address the problem, I'm going to keep a cached copy of the most recent XML content for every feed. In the event that there's an error, you won't see an ugly message, and there will be something to stare at. I'll write these tests later tonight. Should websites keep a cached copy or should there be a site where we can access a cached feed if something breaks down?
Google Zeitgeist RSS Feed
7/12/2005
So I made a screen scraper in Perl that puts writes the Google Zeitgeist Top 10 Gaining Queries to a valid RSS feed. You can download the feed or read more about it. I'm going to bed - good night.
Atom API Test
7/11/2005
Testing the Atom API from ColdFusion! Where's Waldo?
Greasemonkey for Site Usability
7/06/2005
If any hacker writes a Greasemonkey script for your site - you should think about rolling the "feature" into your application. As much as I enjoy reading Paul Graham's essays, I dislike the fact that his footnote links are not clickable. That's where this nifty Greasemonkey script comes in and turns all potential links into something that I can click on. All I'm saying is that if a user takes the time and writes a script to improve the usability on your site, you might want to take a step back and address the issue.
Slower Innovation
7/03/2005
I've been saying that innovation has slowed down for months now. It's cool that other people also see it. I want my flying car and hoverboard. Got it?