Natural Keys are a Fallacy
Originally posted 2004-02-18, part of my slow effort to convert old dasBlog content to WordPress.
On the big argument going around about the use of identity() fields as surrogate primary keys, I’ll just make a few comments from my own experience. My overall opinion: in a real-world system, your list of candidate keys is constantly being reduced, and composition required to generate uniqueness is constantly increasing. If you start using a natural candidate key, you might as well plan on redefining the PK at some point in the future, period. You can’t possibly know all of the goofy-but-valid data you will have to store in the future. Some observations:
-
Every single database I’ve written that relied solely on natural primary keys has inevitably had to be redesigned when valid duplicates (due to member keys being duplicated or being null) suddenly started coming in. A few examples:
- In an online education system I wrote, we initially used the member’s SSN as the primary key for that table. It was a good choice at the time because there was state-mandated reporting that required SSNs on the form. The problem: to torture Dorthy’s line, we soon weren’t in Texas anymore. The non-Texas folks didn’t like handing over their SSNs, the people signing up for the for-credit trial didn’t either, and the non-US people didn’t have one. The problem was, the SSN was entrenched across nearly every table in the database and tens of thousands of lines of code. It was a major, expensive deal to straighten this mess out.
- In a CRM database I wrote for a nationwide dialup ISP, we decided on the customer’s home telephone number as the primary key. It was an elegant choice in terms of database theory: logic dictated that dialup Internet service requires phone service, the customer knew their own numbers, and “username” wasn’t a good choice for corporate accounts with multiple usernames. But within a year, we ran into problems: customers who moved away and canceled service and whose phone numbers ended up reassigned to new customers (it does happen), people sharing phones but not accounts (e.g., roommates, college students using their parents’ permanent phone number but needing simultaneous, separately-billed service), etc. We added a suffix to the phone number for duplicates rather than redesign the database.
- In a database designed to store federal and state regulations, I decided to use the regulation’s citation (e.g., “40 CFR 60.112b(a)(1)(vi)”) as the PK for the regulation table. Well, less than a month ago I redesigned to use an identity()-based field. Why? The need to handle requirements with unssigned or no valid citation (e.g., various site-specific permit conditions or corporate mandates) and issues relating to updates when citations change (our lawmakers could care less about primary keys when they renumber a chapter).
- If the user can change the key, they can easily invalidate an existing screen in the interface. Maybe not a huge deal in the rich client world, but consider a trivial web app: a table of search results linked to the edit page for each record. When the user edits a record, you have to pass the entire PK (potentially 3 or more strings) to the edit page as parameters. This could include fields the user is uncomfortable seeing in the URL. Also, once they or someone else saves a change to one of the record’s PK attributes, the search page is suddenly invalid. But they could still be working from that page (using tabs, the Back button, opening new windows, using a page left open from yesterday, etc.). The disconnected nature of web applications makes immutable IDs a must.
- Composite keys (which most PKs eventually evolve into) screw up caching. Most developers try to at least do some object caching in web apps. But most caching mechanisms rely on having a single hashed key. What are you going to do, concatenate your PK fields with pipe characters into one big string? Yuck. Databases can take advantage of intersecting multiple indices, but a single hash can’t.
- Same thing goes for fields in web interfaces: if you are choosing from among many objects, are you going to put the entire mess of key fields in the value attribute? What if one of those fields contains a comma, a double-quote, or the character you choose to delimit your field values? You also have to consider here the amount of data you are passing across the wire.
- Composite keys make some SQL impossible. In particular, the IN comparison operator is useful for dynamically-generated queries.
- Changes to primary key definitions must be propogated throughout the entire application. Even if you are a staunch stored procedure user, you’ll have to change every selection SP to add parameters and every place the code calls them.
- Because of the above, natural primary keys may be a poor choice for applications developed using agile methodologies. Refactoring digs deep into the database structure, especially near the beginning.
- I am a fan of enforcing uniqueness, but the PK isn’t the only way to add a constraint to a table.
- If uniqueness is already guaranteed (say, due to a timestamp and a userid field) or cannot be evaluated but can be assumed and needs no constraint (e.g., a table with little but a text field), a surrogate is a perfectly acceptable way to handle the data internally rather than wasting cycles and code on useless matching.
Background: yes, I paid attention in my database theory classes. But I have the luxury (or curse) of being architect, engineer, developer, and DBA of most apps I am involved with. I think this gives me a perspective that non-coder DBAs simply don’t have: that the database is there to serve the application, not the other way around.