sasbury.com

The IDs Have It

I was talking to my wife the other day about an application she uses at work. The application is used to track parts, and provides each part with a part number. She was talking about how hard it is to find parts, and my immediate thought was that the unique parts number would make it easy. The only problem is, the part numbers aren't guaranteed to be unique. They can be the same for different projects and those different projects are all in the same database.

So I asked her if there was another, unique, identifier that she could use to find parts in the system. She said that she didnt' know of one. Which got me thinking. Did the guys that designed this system really not provide a unique identifier? And that got me thinking about how important unique identifiers are, and what an important part of software architecture they play.

There are really two kinds of unique identifers at play in a software system: user-visible and implementation focused. User visible identifiers are things like part numbers. These may be memorized in some cases and could be used to look up items. Implementation ids are generally hidden from a user and might be used internally, for example, they may provide a unique field for database joins. Implementation ids are pretty standard, but user visible identifiers, like the part numbers in my wife's experience, require some more thought.

User Visible Identifiers

When building a REST API, or a web application, user visible identifiers (ids) are often used in URIs. For example, a part tracking application might have a URI like:

http://host:port/project/parts/856

where 856 is a unique part id. This example is really important because it demonstrates one of the key aspects of a user-visble identifier. Not only can a user see the id, they can memorize the id, or in this case bookmark a URI with the id. The public nature of this id makes it something that you do not want to change. It becomes an immutable property of the part in question.

Immutability has a price.

Imagine that you were writing a parts tracking system that only supported one project. You might assign every part a unique identifier starting from 1 (or maybe 0) and incrementing from there. Further, suppose that your system is used for 2 projects, but that they work separately. So each installation has its own counter and its own set of unique part numbers.

Now the boss comes along and asks you to combine the systems. You have a quandry. If the same part number appears in both projects, and you combine the data, the part numbers are no longer unique. In other words, if Project A has a part numbered 1 and Project B has a part numbered 1 you can't build a unique set of parts for A+B by reusing the numbers you already have.

Lest you think this is a made up example, our architecture for FogBugz is open to this specific problem. Currently each customer of FogBugz has a unique number for each case in the system. These unique values are customer dependent. There are as many case #1's as there are customers. If we had two customers ask to combine their databases, we would have to re-number at least one set of cases, making the currently public (immutable) id change.

One advantage of the FogBugz design is that customer's experience their case ids in a very friendly manner. Your first case is #1, your second case is #2, etc.. Moreover, unless you make a lot of cases, the numbers will be small and not to overwhelming, perhaps even memorizable.

Simple User-Visible IDs are more user friendly

This got me thinking about two questions: If you already have a system, how can you keep user-visible IDs unique when combining two databases? When you are designing a new system how do you provide future-proofed user-visible identifiers?

Architecting for User Visible IDs

The easiest way to combine two database with overlapping ids is probably just to re-assign IDs. This isn't the nicest solution but it certainly works. Simply build a new database and assign a new unique id for every item as you add it. The disadvantage of this solution is that the user suffers. If there are any bookmarks or other documentation that links to an item by its ID that link will be broken. Consider the case of FogBugz, if someone entered a comment saying "see case #32" that comment would become invalid. So the "nice" way to perform this upgrade would also require finding all of these comments and fixing them.

Re-numbering is simple, if you aren't nice, but could be very complex if you try to be nice. There is a second solution that provides some insight into how we could design these ids better. Suppose that when you combine two databases you simple copy the data as is, perhaps providing a new Implementation Identifier to make the items unique in the database. Now, instead of trying to use the ID alone, you update your code to use a second, or possible more, part of the data to identify the right item. For example, you might use the logged in user as a key to help find the correct item in the database. If Joe looks for item 1 you know Joe is part of Project A so you find item 1 from Project A.

This idea works if you are combining databases but not combining the items visible to Joe. If you are combining items visible to Joe, you are stuck with the problem that he will see two items labelled 1.

Can we keep two items labelled 1 and allow Joe to find them both easily?

The answer is maybe. If the GUI or API that provides access to items was designed with a bit more information then you might be able to combine two databases and still provide unique access. Suppose that your product uses URIs of the form:

http://host:port/projectName/items/itemNumber

Now there are really two unique items in the URI, the project name and the item number. It doesn't matter if you store the data in a single database or multiple databases these URIs can be resolved to the correct items.

But, you are back to the situation my wife is in. If someone says "can you check part #2", that isn't suffecient information to find the part. They have to say "can you check part #2 in project B". If the person talking you doesn't know that, then you can't find the part.

Future Proofing

So how can you future proof your user-visible IDs?

One possible lesson from the previous discussion is to build IDs from multiple pieces of information. Perhaps the simplest way to think of this is to build ids of the form:

projectName-itemNumber

So you might have an item A-1 and an item B-1. If you combine the two projects, you would perhaps create new IDs of the form C-1 where, C is the new project name. Or if you reuse one of the project names, you would have A-1 - A-100 and B-1 - B-50, and all the new IDs would have the form A-x, where x starts at 101.

Multi-part IDs work, but they have a different problem. To really work well, and future proof against the combining problem, the project names have to be unique. So now you have the problem that two customers can't have a project A.

There is another way to craft this solution. If you have access to a shared database that cuts across all of the projects, you can assign unique project IDs to projects, and use them in the user visible item IDs:

projectNumber-itemNumber

Now you can be sure the project number's are unique, and the item numbers are unique per-project, so combining two project databases can allow you to keep the visible IDs the same. Then again, if you have a shared database you could just assign unique ids to everything across the entire world.

Well, you can keep the visible ids the same if you combine the projects into a single databse,, but you may have to do some database shenanigans. The weakness to this design is that you made the project numbers visible. So now if you combine two projects, you can't assign the same project ID to all of the items, because there are different visible project IDs.

Ultimately, the problem is the relationship between the database and the user. If the user sees data that is used as a primary key in the database, or even a key for joins, then you start to run into problems when that user's data is combined with another user's data that might have different visible keys.

This got me thinking about a couple of ideas. Perhaps the least user-friendly is to assign every item with a public globally unique id or GUID. These are long, but you could be pretty sure that you would avoid conflicts if you had to combine two databases.

The idea of a GUID led me to another idea. What if you encode your ids? GUIDs are generally encoded as Base64 or some as hexadecimal. What if we make a shorter id and encode it. For example, we could take a single byte, or even a character, for a flag to tell us the id format followed by a set of bytes for something like the project id and then a simple incrementing part number. This could lead to something like:

flag - database id - part number

or specifically:

0 - 23 - 461

which could then be encoded into hexadecimal or base64 with some padding to ensure that the boundaries work out. This only leaves the question of what to use for the database id. For a commercial product, that could be a customer number or similar. It could possibly be some combination of a time and a random number. This had a small chance of collisions, but it would be extremely rare.

What I like about this encoding idea is that it would let you make the ids small or large, depending on what you think the user will be ok with. It can use flags to indicate different contents. So a flag of 0 might mean "customer id" while a flag of 1 might mean "time" to follow. Finally, this format hides the real data from the user. They won't see the 461, so even thought that is our unique id, they see an encoded format of it. Thus allowing us to remap that id if we have to later.

Some Other Thoughts

About fifeteen years ago, when 32-bit was the standard of the day, I went to a seminar about a new object oriented database. One of the things the vendors mentioned was that each object in a database had a unique identifier. These were all 32-bit ints, which put a limit of around four billion objects in a database. I asked "what if you have more objects?". There response was a very salesy "if you have that much data we want to talk to you."

But I am not a sales person, so that was a BS response to me. I mean, realistically I won't have more data than that for most things, but I might. So I wanted to at least know that they had thought about the problem. Maybe they had. But their response said sorry dude you're on your own.

This brings me to today. Today, I think ids should be either self encoding integer types, which means that they can be any length but use a bit of their space to say how long they are, or longs. A long can encode a gazillion values. Most of the time you won't need a gazillion, but sometimes you will. And the pain of crossing that boundary is not one that you want to experience.

Conclusions

So those are some thoughts on identifiers. I have used my encoding id a bit, but not extensivily. I am looking forward to a time when I can use it. If you have some other ideas for how to deal with IDs, and more especially combining IDs, please leave them in the comments.

- Stephen (2013-11-13)

architecture

Back to notebook