whatspop - Kunal Anand

Setting up a database for tagging

1/17/2006

Thanks for the encouraging email about the ColdFusion tagging library. Since I have received a lot of questions about setting up a database to handle tagging, I presumed that it would be better to write a walkthrough for everyone. Before jumping ahead, I should mention that I am by no means a database administrator (DBA) or guru. However, I have had to build and scale very large web applications and found many of the following techniques valuable throughout my career. With that disclaimer, let us move on.

We need to create an example for this walkthrough. Let us pretend that we are going to create a database for a del.icio.us clone (you know my thoughts about clones). Even though I actually wanted to write some CFML code and deploy it, I am going to address the topic of setting up the necessary database tables. Writing the CFML is actually trivial for building the clone. In fact, I have listed the necessary queries for making your own del.icio.us.

Chances are that you already have a current database for your application. If you plan on growing the number of users or want to handle large site activity, I would recommend upsizing from SQLite or Microsoft Access to MySQL or PostgreSQL. The rationale is that adding tagging queries will potentially introduce lots of joins, which will exponentially increase with more site usage. I should mention that I am not slamming the smaller databases. If you are in a small environment with low usage, then you probably have nothing to worry about. Just realize that there are more robust and open-source alternatives out there. I prefer MySQL as it is free, simple to install (binaries and source are available), ubiquitous, and easy to replicate across a database farm.

So when you have decided on your database client, you have to design your tables. Right off the bat, you should know that normalizing your database tables is not a rule, it is an option. While it may be considered great practice, it may get in the way of writing and scaling your application. In fact, it may be prudent to de-normalize your tables if performance is an issue or if you want your database to breathe once in a while. You should not feel bad about de-normalizing either - lots of large web sites, like Flickr, realize the technical benefits of it. I will post a link to that article with Cal Henderson once I find it in my del.icio.us bookmarks. Also, I will not go into setting up foreign keys between your tables. This is an architectural issue that can be answered by one question: do you want the database or application source code to handle referential integrity, such as cascade updates and deletes for links and tags? Yes, this issue could argued both ways forever.

Here is the mostly-normalized approach (you can normalize this further):

  • users [id, email, name, password_hash, created_on]
  • links [url, title]
  • user_links [link_id, user_id, description, created_on]
  • user_link_tags [user_link_id, tag]
Here is the de-normalized approach:
  • users [id, email, name, password_hash, created_on]
  • user_links [user_id, url, title, description, created_on]
  • user_link_tags [user_id, user_link_id, tag, created_on]
The first approach has most things compartmentalized in its right place whereas the second approach creates data redundancy (note the missing linking table). While the first is good for feng shui, the second is more practical for working with users and their respective tags, without having to perform joins, which can use up more processing power as the application scales. Using the de-normalized approach, here are ten simple and effective queries:

1. Get all the user's links:

SELECT url, title
FROM user_links
WHERE user_id = #

2. Get all the user's tags:

SELECT tag
FROM user_link_tags
WHERE user_id = #

3. Get the user's 10 most popular tags:

SELECT tag, count(tag) as count
FROM user_link_tags
WHERE user_id = #
GROUP BY tag
ORDER BY count DESC
LIMIT 10

4. Get all the link's tags:

SELECT tag
FROM user_link_tags
WHERE user_link_id = #

5. Get the total number of times a url has been linked:

SELECT count(url) as count
FROM user_links
WHERE url = #

6. Get the latest 10 links added to the site:

SELECT url, title
FROM user_links
ORDER BY created_on DESC
LIMIT 10

7. Get the 50 all-time popular links:

SELECT url, title, count(url) as count
FROM user_links
GROUP BY url, title
ORDER BY count DESC
LIMIT 50

8. Get the 50 all-time popular tags:

SELECT tag, count(tag) as count
FROM user_link_tags
GROUP BY tag
ORDER BY count DESC
LIMIT 50

But what if we want to find the recent popular links or tags, essentially re-creating del.icio.us popular? The following queries explicitly make use of exclusive MySQL functions and syntax.

9. Get most 50 popular links over the last 24 hours:

SELECT url, title, count(url) as count
FROM user_links
WHERE DATE_SUB(CURDATE(),INTERVAL 1 DAY) <= created_on
GROUP BY url, title
ORDER BY count DESC
LIMIT 50

10. Get the 50 popular tags over the last 24 hours:

SELECT tag, count(tag) as count
FROM user_link_tags
WHERE DATE_SUB(CURDATE(),INTERVAL 1 DAY) <= created_on
GROUP BY tag
ORDER BY count DESC
LIMIT 50

Using the built in MySQL date function proved to be faster than using ColdFusion to loop and store popular links in a temporary hash table. If you are using a different database package, you should check your manual to find related functions. If you have a lot of data nodes (records in this case), you might want to cache the queries, store the data in a temporary database table, or write the static/dynamic page using a scheduled task (perhaps cron).

Now you need to figure out a way to search your links, tags, and descriptions. Picking the best method revolves around how serious you want to get. If this is going to be a huge application with lots of deep searching, I suggest compiling Lucene, an open-source search engine ported to various languages. Lucene's success is a function of its agility and accuracy. If this seems more like a chore to you, or if you want to start small, or if you are stuck with Windows Server, you can also establish full-text indexes on the appropriate columns. While this is less work to set up, you should remember to repair your table indexes every now and then.

Alas, here are my final thoughts. Setup a normalized database if you are purist and like to keep things clean. Just remember that you might have an issue with scaling your app. I recommend the latter if scale or query performance is an issue. For those interested, let me know if you want the MySQL schema file - I will email it over.

About I am currently a Senior Engineer at MySpace. Feel free to check out my personal collective.

Archives
April 2005
May 2005
June 2005
July 2005
August 2005
September 2005
October 2005
November 2005
December 2005
January 2006
March 2006
April 2006
May 2006
June 2006
July 2006
August 2006
September 2006
October 2006
November 2006
January 2007
February 2007
April 2007
November 2007
December 2007
January 2008
March 2008
April 2008
May 2008
June 2008

Subscribe to my feed