SQL Server needs dynamic enums

2015-01-25 by richard in Software Development

I manage a number of databases at work that have varchar/nvarchar columns with very restricted values — usually fewer than a dozen valid choices.

The problem is that storing and indexing these values in Microsoft SQL Server is highly inefficient.

Enterprise Edition users can work around this by partitioning their tables on such a column, but (a) that only works for one column and (b) partitioning a table isn’t necessarily good for performance overall unless the partition function aligns well with the queries you tend to run on the table.

Anyone who has studied database theory knows the pat answer — create another relation (table) that has two columns: one with a meaningless unique int value, the other with the string value (also unique). Then in the main table, refer to each allowed value by its number, not its actual value, and join the tables for output. However, there are a few problems with this approach:

It’s a pain in the ass — extra joins and aliasing for SELECT, translation on INSERT/UPDATE, etc.
Adding new “valid” values over time requires inserting into the lookup table. That may work for, say, a list of countries, but is less useful for, say, a list of cities that is discrete but grows over time.
If you have a number of columns like this, you’ll end up adding a ton of these list-lookup tables, which is messy. And if you decide to be clever and use an EAV approach of one lookup table with the ID number, list name, and list value, you’ll run into problems setting up standard referential integrity constraints when a table has multiple “pick-list” columns.

MySQL has an alternative — ENUM columns. These store a numeric value, but act like varchar() columns, and the list of valid choices is part of the column definition. While it looks tempting at first, the evils with this approach are many.

Instead, I would recommend that MSSQL server add a new keyword for varchar/nvarchar column definitions called LOW_CARDINALITY*. This would tell MSSQL to to the following in the background (transparently):

Maintain a dictionary of “used” values in the column;
Assign each unique value a surrogate integer value;
In the table or indexes covering the column, store the integer value, not the string.

This would result in highly efficient storage of columns that contain a limited number of choices, without needing to rely on relational joins, “magic” numbers (usually managed by the business logic), or other clumsy workarounds. It could make a huge difference for table and index storage, memory usage, and query performance. And since it is transparent to all queries, it’s the best kind of new feature — one that can be added to an existing database without having to make additional changes to views, stored procedures, or code that connects to the database.

I’m sure the devil is in the details, particularly when it comes to things like replication, but the benefits would far outweigh the effort.

_* I’m sure someone at Microsoft could come up with a better keyword than LOWCARDINALITY. Something like LIST might work but seems a bit colloquial. DISCRETE or SET are other options, though SET is already a reserved word.