Keeping Your Sanity When Using Lookup Tables
Note: Originally posted on 2004-02-17 under the permalink 4aa1a488-fa83-4df5-83e8-a2b218e75d75… someone requested that this one be resurrected from the archives. Someday I’ll write a script to convert all that old content to WordPress! *sigh*
Marc Shiker has seen the light on how to address the issue of lookup tables.
The problem here is not one of identities primary key vs. storing the value, it is one of whether you should have a relation in the first place. Since the only attribute of Marc’s Gender table other than the ID is the name of the gender, this relation serves no purpose at all and should be rolled back into the Employee table.
Marc asks:
Without any performance testing to back this up, it just seems more efficient to filter rows on an int value than a varchar. […] Does an index on Gender in David’s table overcome this? What about physical disk space? 1000 rows of an int vs 1000 rows of a varchar(7)? How about a million rows?
My answers:
- Especially considering encoding comparisons and memory transfers, string comparisons are more expensive.
- That said, table joins to get the string back when you only have the ID are orders of magnitude more expensive. The only time they would be compared is when you are checking a constraint (i.e., on INSERT or UPDATE). You are much more likely to SELECT them and be forced into using a table join to look up your value by its ID.
- The processing of your SQL statement to turn “GenderID=1” into a query and the extra table locks easily trump the difference in looking up a string vs. an integer. String indexes are also very efficient, especially when the choices are unique within the first few characters.
- Physical disks are cheap and fast. Worst cast, the string takes 3x as much space as the int (assuming string in unicode + length integer). No big deal.
- You could also make FirstName a relation, but wouldn’t that be silly? If the only determining factor is the number of valid choices, they don’t belong in a related table.
Now the question becomes “how do you give the user a set of valid choices?” Here a few approaches, in order of my own preferences:
- Hard-coded in code. Bad. Bad. Bad. Choices should never be enumerated inside the code, period. They should ideally be cached in memory, but never compiled. “Magic” strings should be avoided here as well.
- Hard-coded in config files. Better, but no dice. There is no validation at the database level.
- Individual lookup tables. Better since constraints can be defined and choices can be changed without compilation. But now you have a bunch of tiny little tables, all alike.
-
Master lookup table. I’ve used this approach with much success in the past. Consider something like the following schema:
CREATE TABLE [Choices] [Category] [varchar] (150) NOT NULL, [Value] [varchar] (255) NOT NULL, [SortOrder] [int] NULL CONSTRAINT [DF_Choices_SortOrder] DEFAULT (0), CONSTRAINT [PK_Choices] PRIMARY KEY CLUSTERED ([Category],[Value]) ON [PRIMARY] ) ON [PRIMARY]
The advantages to this approach: a single table definition, check constraints and triggers are easily duplicated and applied, and you can manually override the sort order (e.g., “month” should come after “week”). The clustered index makes it lightning-fast, the actual values are stored in the other table, and there is no false relation involved. Any changes to the structure (for instance, extending the length of the allowed value, or adding a “context” filter) is system-wide with a single table change. A few other simple schema changes could allow you to have the choices translated into the user’s native language automatically.
- Dynamic selection. This is my current favorite, and it is really the simplest: to find out the valid choices, just do a SELECT DISTINCT columnname on the table. Disadvantages: you lose the extra sorting capability and it requires a representative corpus of records to get the choices populated. Advantages: with B-Tree indexing, the DISTINCT should be as fast or faster than a master lookup table, regardless of the number of rows in the table (it only scans the index for distinct values, not the table), the choices are never out of sync with the extant data, and new choices only require the INSERT/UPDATE that uses them. Also, since there are no constraints or triggers involved, this works beautifully with just about any RDBMS from Jet upward. Another advantage: you can filter the user’s choices easily based on other attributes of the record, such as only showing choices that are from records the user has access to.