Randomize

Richard Tallent’s occasional blog

Natural Keys are a Fallacy

Originally posted 2004-02-18, part of my slow effort to convert old dasBlog content to WordPress.

On the big argument going around about the use of identity() fields as surrogate primary keys, I’ll just make a few comments from my own experience. My overall opinion: in a real-world system, your list of candidate keys is constantly being reduced, and composition required to generate uniqueness is constantly increasing. If you start using a natural candidate key, you might as well plan on redefining the PK at some point in the future, period. You can’t possibly know all of the goofy-but-valid data you will have to store in the future. Some observations:

  1. Every single database I’ve written that relied solely on natural primary keys has inevitably had to be redesigned when valid duplicates (due to member keys being duplicated or being null) suddenly started coming in. A few examples:
    • In an online education system I wrote, we initially used the member’s SSN as the primary key for that table. It was a good choice at the time because there was state-mandated reporting that required SSNs on the form. The problem: to torture Dorthy’s line, we soon weren’t in Texas anymore. The non-Texas folks didn’t like handing over their SSNs, the people signing up for the for-credit trial didn’t either, and the non-US people didn’t have one. The problem was, the SSN was entrenched across nearly every table in the database and tens of thousands of lines of code. It was a major, expensive deal to straighten this mess out.
    • In a CRM database I wrote for a nationwide dialup ISP, we decided on the customer’s home telephone number as the primary key. It was an elegant choice in terms of database theory: logic dictated that dialup Internet service requires phone service, the customer knew their own numbers, and “username” wasn’t a good choice for corporate accounts with multiple usernames. But within a year, we ran into problems: customers who moved away and canceled service and whose phone numbers ended up reassigned to new customers (it does happen), people sharing phones but not accounts (e.g., roommates, college students using their parents’ permanent phone number but needing simultaneous, separately-billed service), etc. We added a suffix to the phone number for duplicates rather than redesign the database.
    • In a database designed to store federal and state regulations, I decided to use the regulation’s citation (e.g., “40 CFR 60.112b(a)(1)(vi)”) as the PK for the regulation table. Well, less than a month ago I redesigned to use an identity()-based field. Why? The need to handle requirements with unssigned or no valid citation (e.g., various site-specific permit conditions or corporate mandates) and issues relating to updates when citations change (our lawmakers could care less about primary keys when they renumber a chapter).
  2. If the user can change the key, they can easily invalidate an existing screen in the interface. Maybe not a huge deal in the rich client world, but consider a trivial web app: a table of search results linked to the edit page for each record. When the user edits a record, you have to pass the entire PK (potentially 3 or more strings) to the edit page as parameters. This could include fields the user is uncomfortable seeing in the URL. Also, once they or someone else saves a change to one of the record’s PK attributes, the search page is suddenly invalid. But they could still be working from that page (using tabs, the Back button, opening new windows, using a page left open from yesterday, etc.). The disconnected nature of web applications makes immutable IDs a must.
  3. Composite keys (which most PKs eventually evolve into) screw up caching. Most developers try to at least do some object caching in web apps. But most caching mechanisms rely on having a single hashed key. What are you going to do, concatenate your PK fields with pipe characters into one big string? Yuck. Databases can take advantage of intersecting multiple indices, but a single hash can’t.
  4. Same thing goes for