Setting up a database for tagging
1/17/2006
Thanks for the encouraging email about the ColdFusion tagging library. Since I have received a lot of questions about setting up a database to handle tagging, I presumed that it would be better to write a walkthrough for everyone. Before jumping ahead, I should mention that I am by no means a database administrator (DBA) or guru. However, I have had to build and scale very large web applications and found many of the following techniques valuable throughout my career. With that disclaimer, let us move on.
We need to create an example for this walkthrough. Let us pretend that we are going to create a database for a del.icio.us clone (you know my thoughts about clones). Even though I actually wanted to write some CFML code and deploy it, I am going to address the topic of setting up the necessary database tables. Writing the CFML is actually trivial for building the clone. In fact, I have listed the necessary queries for making your own del.icio.us.
Chances are that you already have a current database for your application. If you plan on growing the number of users or want to handle large site activity, I would recommend upsizing from SQLite or Microsoft Access to MySQL or PostgreSQL. The rationale is that adding tagging queries will potentially introduce lots of joins, which will exponentially increase with more site usage. I should mention that I am not slamming the smaller databases. If you are in a small environment with low usage, then you probably have nothing to worry about. Just realize that there are more robust and open-source alternatives out there. I prefer MySQL as it is free, simple to install (binaries and source are available), ubiquitous, and easy to replicate across a database farm.
So when you have decided on your database client, you have to design your tables. Right off the bat, you should know that normalizing your database tables is not a rule, it is an option. While it may be considered great practice, it may get in the way of writing and scaling your application. In fact, it may be prudent to de-normalize your tables if performance is an issue or if you want your database to breathe once in a while. You should not feel bad about de-normalizing either - lots of large web sites, like Flickr, realize the technical benefits of it. I will post a link to that article with Cal Henderson once I find it in my del.icio.us bookmarks. Also, I will not go into setting up foreign keys between your tables. This is an architectural issue that can be answered by one question: do you want the database or application source code to handle referential integrity, such as cascade updates and deletes for links and tags? Yes, this issue could argued both ways forever.
Here is the mostly-normalized approach (you can normalize this further):
- users [id, email, name, password_hash, created_on]
- links [url, title]
- user_links [link_id, user_id, description, created_on]
- user_link_tags [user_link_id, tag]
- users [id, email, name, password_hash, created_on]
- user_links [user_id, url, title, description, created_on]
- user_link_tags [user_id, user_link_id, tag, created_on]
1. Get all the user's links:
SELECT url, title
FROM user_links
WHERE user_id = #
2. Get all the user's tags:
SELECT tag
FROM user_link_tags
WHERE user_id = #
3. Get the user's 10 most popular tags:
SELECT tag, count(tag) as count
FROM user_link_tags
WHERE user_id = #
GROUP BY tag
ORDER BY count DESC
LIMIT 10
4. Get all the link's tags:
SELECT tag
FROM user_link_tags
WHERE user_link_id = #
5. Get the total number of times a url has been linked:
SELECT count(url) as count
FROM user_links
WHERE url = #
6. Get the latest 10 links added to the site:
SELECT url, title
FROM user_links
ORDER BY created_on DESC
LIMIT 10
7. Get the 50 all-time popular links:
SELECT url, title, count(url) as count
FROM user_links
GROUP BY url, title
ORDER BY count DESC
LIMIT 50
8. Get the 50 all-time popular tags:
SELECT tag, count(tag) as count
FROM user_link_tags
GROUP BY tag
ORDER BY count DESC
LIMIT 50
But what if we want to find the recent popular links or tags, essentially re-creating del.icio.us popular? The following queries explicitly make use of exclusive MySQL functions and syntax.
9. Get most 50 popular links over the last 24 hours:
SELECT url, title, count(url) as count
FROM user_links
WHERE DATE_SUB(CURDATE(),INTERVAL 1 DAY) <= created_on
GROUP BY url, title
ORDER BY count DESC
LIMIT 50
10. Get the 50 popular tags over the last 24 hours:
SELECT tag, count(tag) as count
FROM user_link_tags
WHERE DATE_SUB(CURDATE(),INTERVAL 1 DAY) <= created_on
GROUP BY tag
ORDER BY count DESC
LIMIT 50
Using the built in MySQL date function proved to be faster than using ColdFusion to loop and store popular links in a temporary hash table. If you are using a different database package, you should check your manual to find related functions. If you have a lot of data nodes (records in this case), you might want to cache the queries, store the data in a temporary database table, or write the static/dynamic page using a scheduled task (perhaps cron).
Now you need to figure out a way to search your links, tags, and descriptions. Picking the best method revolves around how serious you want to get. If this is going to be a huge application with lots of deep searching, I suggest compiling Lucene, an open-source search engine ported to various languages. Lucene's success is a function of its agility and accuracy. If this seems more like a chore to you, or if you want to start small, or if you are stuck with Windows Server, you can also establish full-text indexes on the appropriate columns. While this is less work to set up, you should remember to repair your table indexes every now and then.
Alas, here are my final thoughts. Setup a normalized database if you are purist and like to keep things clean. Just remember that you might have an issue with scaling your app. I recommend the latter if scale or query performance is an issue. For those interested, let me know if you want the MySQL schema file - I will email it over.