UUIDs as Primary Keys

Aaron Poffenberger

Recently I've been developing a website for a customer that integrates with SugarCRM. When you look at the SugarCRM data model one thing that immediately leaps out to almost anyone but especially to experienced SQL developers is the use of UUIDs as primary keys. One's first reaction is something like "But why!".

It's not that we're unaware of UUIDs. If you've done any work with Microsoft SQL Server you know that to use Replication you need a ROWGUIDCOL to ensure that all rows across the various replicants are unique. This requirement makes plenty of sense since many, perhaps most, database tables use artificial primary keys rather than natural keys. And the artificial key of choice is almost always the monotonically increasing integer data type. Of course the problem with merging data from multiple databases all with primary keys ranging from 1 … n is to maintain uniqueness without clobbering data. Enter the UUID.

The UUID ("Universally Unique Identifier"), a.k.a GUID on Microsoft systems, is a 32 byte hexadecimal value usually presented as 5 groups separated by 4 dashes to make for a 36-byte value: e.g., 550e8400-e29b-41d4-a716-446655440000. With 128 bits of data UUIDs have a range of approximately 3 x 1038. This huge range of values and the associated methods for generating a unique value make UUIDs near perfect for uniquely identifying data. 36 bytes of storage-space requirements make them less so. Non-monotonic generation almost makes them unusable as primary keys where ordered indices are the difference between performance and table scans.

Enter the COMB UUID. The COMB UUID was invented by Jimmy Nilsson to bridge the gap between uniqueness and orderliness.[1] They're a clever solution to the problem. He calls them "COMBs" because they're "COMBined" UUIDs. What Nilsson realized was that if he truncated the last 6 bytes of a UUID and then appended a 6-byte timestamp he would have a value that would increase sequentially (at least for the next 77 years) in the low bytes and would be (almost universally) unique because of the high bytes. He further found that overhead of inserting lots of COMBs was about 1/30th that of inserting UUIDs — in fact the overhead is only slightly higher than that of using integers. Win-win!

There are still issues to consider when adopting UUIDs — whether COMBs or random values — as primary keys. The first is storage. When inserted as text values they consume a whopping 36 bytes (or 32 bytes when the dashes are stripped)! Even when converted to 16-byte binary values that's more space than most system-integer values. Second is ease of use. It's easy to look at most integer values and determine the correct ordering. COMBs are easier in this regard but only up to a point. True UUIDs are inscrutable in this regard. Integers are also easier to remember, at least with low values.

UUIDs do have their benefits as well. The first, of course, is almost guaranteed uniqueness. Where merging data from disparate hosts is likely or required using UUIDs is almost required, whether as primary key or secondary key. There is the possibility of collision but it's very remote and could be managed with exception handling. Another very interesting benefit is the ability to generate unique IDs at the application or client side without first inserting a record in the db. If all interested parties use compatible generation algorithms then it's not only possible but there's little downside. (At least for server-side generation. I'm less likely to use a value generated on the client side, especially for a web app where the client is very remote.)

What's the answer, then? While I like the idea of uniqueness for now I'm sticking with integer generation for single-host systems. That decision is mostly driven by a desire to write libraries to make using UUIDs seamless. There's no doubt, however, that I'll use UUIDs for multi-host systems where there's the possibility of data aggregation. Once my libraries are written and I've tested them out a bit more thoroughly I foresee moving almost exclusively to UUIDs as primary keys for heterogenous data. For common data, e.g., lookup tables, I'll likely stick with integer values.