Generating Order Numbers in a high-volume distributed system.
The first system I built was in university where my database professor taught us to use AUTO INCREMENT fields or IDENTITY columns. These columns made excellent identifiers, starting at a number and counting up with each new row 100, 101, 102… This technique worked well into my professional career and was the option I leveraged for IDs for many years.
For order numbers, there are additional considerations that need to be considered:
- It should be as short as possible.
- It should be easy to read, avoid ambiguous characters.
- It must be unique for the company.
- It should not be predictable.
- It should not expose any company information.
An IDENTITY column covers most of these requirements, but fails horribly in being both predictable and exposing order volume. Having an incrementing number allowed competitors order from your store on a set frequency, say every Monday at 9am, and calculate how many orders were placed.
These final requirements can be covered by hashing the IDENTITY field to an alpha-numeric string. This would obscure the additional information that can be gleaned from an incrementing order number.
These hashed values are an excellent approach, covering the main requirements, but it does require a centralized system to track the IDENTITY column, incrementing on each order. This poses a challenge in a distributed system.
Database Ticket Server
One approach to handling the issue of decentralized systems… is to centralize your ID creation. Stand-up a service that provides IDs on demand. This service can track the current value and increment appropriately.
Flickr did an excellent write-up on the concept - https://code.flickr.net/2010/02/08/ticket-servers-distributed-unique-primary-keys-on-the-cheap/
This can be an effective approach. The only drawback is you are now creating a central bottleneck to your data creation. If the ID generation service fails no orders can be created across the entire system.
This can be scaled through some creative logic, having one server issue odd values, and a secondary system even values. This creates some redudency, but does not allow for additional scaling without changing the logic, the current value must also be preseved should a server fail and need to be replaced.
UUID
UUIDs or GUIDs (https://en.wikipedia.org/wiki/Universally_unique_identifier) are the perfect solution, they were created with this express purpose -- Universally Unique Identifier.
UUID v4 generated random strings that can be used as resource identifiers. These strings are long enough, that the probability of collision is almost non-existent. You have probably seen these IDs before:
79280292-c2e1-4bea-abc4-59b5dbc66afd
While longer than an IDENTITY column, they are the perfect replacement for distributed high-volume systems and should be the first choice in creating a system identifier.
Unfortunately, they use ambiguous characters, are very hard to read, and are definitely not as short as possible.
UUIDs are designed to be unique globally regardless of the purpose. This length is required to avoid ID conflicts. We know from the birthday problem (https://en.wikipedia.org/wiki/Birthday_problem) that short values hit conflicts much sooner than expected.
How many people do you need in a group for there to be a 50% chance that two of them share a birthday?
Only 23
UUIDs do point us in a good direction though. Version 1 was significantly shorter and leveraged the date-time and MAC address of the computer to help enforce uniqueness. The random section had less chance of conflict when limited to a single computer and millisecond in time.
Order Numbers only need to be unique for a specific company, two companies can use the same order number, so we can leverage the same ideas, but at a smaller scale, to create shorter and readable Order Numbers.
Solution
A new NPM package that leverages the current time and a random string to create new Order Numbers.
The package can leverage time in milliseconds or seconds, use a configurable length of random characters, and a custom character set to best fit the business needs.
To keep the identifier short, the epoch time is converted to a new base (identified by character set) keeping it relatively short. So an order number leveraging millisecond epoch, 4 random characters, and any alpha-numeric character would be 10 characters long (ex. 1kmJDM6CIO)
The probability of ID collision will depend on the epoch granularity, character set, random string length, and most importantly the company's order volume.
This solution handles all of the criteria:
- It should be as short as possible.
- It should be easy to read, avoid ambiguous characters.
- It must be unique for the company.
- It should not be predictable.
- It should not expose any company information.
- Can be leveraged in a high-volume distributed system.
You can find the source code and additional details on GitHub.
Bonus Random Fact
Why is Number abbreviated to No?
No. comes from the abbreviation of "numero", ablative case of the Latin "numerus"