Skip to content
This repository has been archived by the owner on Jul 9, 2024. It is now read-only.

Yet another UUID layout #34

Closed
edo1 opened this issue Aug 19, 2021 · 12 comments
Closed

Yet another UUID layout #34

edo1 opened this issue Aug 19, 2021 · 12 comments
Labels
Out of Scope Topics not in scope for the RFC/Draft

Comments

@edo1
Copy link

edo1 commented Aug 19, 2021

Goals:

  1. Be collision-free
    It is most important! If the format is to be widely used, even low collision rates can be dangerous. UUID with a timestamp has much less entropy than UUIDv4. 128 bits (really 122 or 125) isn't that much;
  2. Be sortable
    This is the main goal why a new standard is developed;
  3. Be as monotony as possible (even with UUIDs from different sources);
  4. Be easy to use and avoid pitfalls;
  5. Be interoperable with RFC4122.

Some decisions:

  1. UUID should be only an identifier, not a storage box for a timestamp, machine ID, or whatever. The reader is prohibited from relying on any internals Treat UUID like a black box #28 (goal n.4);
  2. Make the Epoch reasonably short (goal n.1)
    As mentioned in Discussion: Unix Timestamp Granularity in UUIDv7 #23, this will reduce collision probability by several times (or even an order of magnitude);
  3. Do not include any machine-id or other pre-configured id (goal n.4)
    Bad id generation/selection can greatly increase the collision probability compared to a random of the same size;
  4. Do not include an explicit sequence id (goal n.1)
    A sequence id contains too little entropy, which increases the collision probability;
  5. Always leave a large enough part of UUID random so that the next UUID is unpredictable (goal n.4);
  6. Do not enable monotonicity guarantees by default (optional only)
    Monotonic sequence generation is difficult to scale/maintain because locks/atomic operations are required to prevent race conditions.
    Therefore, by default, we do not try to be monotony. But the high-resolution timestamp should provide good ordering (UUIDs will be monotonic or near monotonic in most cases);

Sample format

   0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 
   |                    unix_timestamp_100ns_u56                   |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                    unix_timestamp_100ns_u56   |  filling_u8   |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   | var |                     random_u61                          |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                           random_u61                          |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
  • part1 (64 bit):
    unix_timestamp_100ns_u56: unix_timestamp in 100ns resolution, stripped down to 56 bit + BE byte order for sortablity;
    filling_u8: usually random (can be non-random to keep monotonicity).
  • part2 (64 bit):
    var: 0b111 for interporability with RFC4122;
    random_u61: must be random.

I prefered to use variant = 0b111, it leaves 125 bits for timestamp + random.
But the format can be easily adapted for variant = 0b10x, version = 0b0111 (122 effective bits) by shrinking filling_u8.

Timestamp length/resolution can also be adjusted during the standard review.

With the current proposal, the timestamp overflow will occur in 2198. Nothing bad should happen at this moment.
Timestamp reuse will occur at about 2250. After that, the UUID will still work, only the collision resistance will be slightly reduced (but still better than ULID and other proposals that have a large epoch). There may be minor issues (e.g. reusing old partitions in the DBMS if UUID is used as partitioning key), but I assume that larger UUIDs will be used on this day;

The proposed format is mostly ULID-like, but:

Pseudocode

Generation without guaranteed monotonicity

It is very simple

unix_timestamp_100ns_u56 = tv_sec*10000000 + tv_nsec/100
filling_u8 = random
part1 = unix_timestamp_100ns_u56<<8 | filling_u8
random_u61 = random
part2 = 0b111<<61 | random_u61
UUID = part1 <<64 | part2

Generation with guaranteed monotonicity

unix_timestamp_100ns_u56 = tv_sec*10000000 + tv_nsec/100

// the following line is for illustration only. real code must contain correct unsigned comparsion 
// if (part1_prev - 256e6) < (unix_timestamp_100ns_u56<<8) <= part1_prev {
if part1_prev - (unix_timestamp_100ns_u56<<8) < (unsigned) 256e6 {
	part1 = part1_prev + 1
} else {
	filling_u8 = random
	part1 = unix_timestamp_100ns_u56<<8 | filling_u8
}
part1_prev = part1
random_u61 = random
part2 = 0b111<<61 | random_u61
UUID = part1 <<64 | part2

This allows:

  • constantly generate not less than 2 monotonic UUIDs per nanosecond;
  • keep monotonicity in case of burst UUID generation;
  • keep monotonicity in case of small clock correction.
@fabiolimace
Copy link

I agree with all goals.

As for the decisions taken:

  1. Disagree. At least the timestamp must be reliable;
  2. Agree;
  3. Agree;
  4. Agree;
  5. Agree;
  6. Agree;

As for the sample format, I have to think.

@nerg4l
Copy link

nerg4l commented Aug 19, 2021

I think UUIDs should be opaque. If you want to read the time, version, var random from it then do it. The RFC shouldn't contain any suggestion on how or how not to process the data of the UUID it should focus on the layout and provide pseudo code for generating UUIDs.

Since time from Windows is only reliable up to 100ns then I think that should be an acceptable precision. But, what should happen in languages where 100 ns precision is not possible (eg. JavaScript ms precision)? Should they simply multiply the best available time or can they use a fake clock or high resolution time eg. process.hrtime? At least a suggestion should be added for this case.

I like the idea of filling_u8 which can be random on monotonic depending on the implementation/needs.

Could you provide the exact layout for the 10 var case? If filling_u8 is shrinked to 4 bit then that section will be ver_4bit|timestamp_segment_8bit|filling_4bit which from a hex perspective might work (4-8-4).

@bradleypeabody
Copy link
Contributor

@edo1 Please have a look at the updated UUIDv7 from here: https://github.com/uuid6/uuid6-ietf-draft/blob/master/LATEST.md. The layout is very similar except it uses an unsigned 64-bit int of number of nanoseconds since unix epoch. And what you're describing with filling_u8 should be allowed per the spec (language needs to be updated) - if you nanosecond-precise timestamp info is not available then it is should be fine (allowed per the spec) to fill the rest with random data.

   0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                    unix_nano_timestamp_u64                    |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                    unix_nano_timestamp_u64                    |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   | var |  ver    |         seq_rand_etc                          |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                         seq_rand_etc                          |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Does this address at least some of your concerns?

@sergeyprokhorenko
Copy link

sergeyprokhorenko commented Aug 19, 2021

filling_u8 is too short as clock sequence, when real clock precision is 1 millisecond. You should add 7 bits more.
Random part is too short either.

@edo1
Copy link
Author

edo1 commented Aug 21, 2021

Since time from Windows is only reliable up to 100ns then I think that should be an acceptable precision. But, what should happen in languages where 100 ns precision is not possible (eg. JavaScript ms precision)?

A general rule of thumb is to add a random number to fill the gap, for example, 0-9999 range should work for 1ms precision.

Should they simply multiply the best available time or can they use a fake clock or high resolution time eg. process.hrtime?

Perhaps process.hrtime is a better option (provides both some ordering and unpredictability). We need to figure this out.

@edo1
Copy link
Author

edo1 commented Aug 21, 2021

Could you provide the exact layout for the 10 var case? If filling_u8 is shrinked to 4 bit then that section will be ver_4bit|timestamp_segment_8bit|filling_4bit which from a hex perspective might work (4-8-4).

You get the idea. But I would prefer to add a new 125-bit layout (with variant = 0b111 and without version) to have as many random bits as possible.

@edo1
Copy link
Author

edo1 commented Aug 21, 2021

@bradleypeabody @broofa
I have studied your proposals #33 #35
I do not like some of the points in both of them.

  1. The UUID is intended initially to be globally unique. It is already widely used and may get much more widespread soon. There are billions and billions of chips around the world that can generate UUIDs.
    Therefore, an ultra-low collision probability is very important.
    I'm surprised why you are concerned about 10k AD compatibility rather than collision rate over the next several decades. I am sure that today's standards will lose their relevance in a few hundred years (and even more so in thousands of years).
  2. Not sure how big the optimal timestamp resolution is. But 1 second is definitely too much and 1 nanosecond seems to be too little.
    2.1. What's wrong with one-second resolution? This does not guarantee ordering if UUIDs are generated across multiple hosts. The timestamp resolution should definitely not be greater than the typical network latency/NTP accuracy (1-10 ms).
    Moreover, in 10/25G LAN, latencies of 20-30 μs are achievable. For UUIDs generated concurrently on the same host, even sub-microsecond timestamp resolution can be useful.
    2.2. What's wrong with one-nanosecond resolution? Nothing wrong at first glance.
    But the one-nanosecond resolution is useless in terms of time-ordering: the jitter in IPC communications is much higher, and in RPC communications it is many orders of magnitude higher.
    The weak point is that the high-resolution timestamp field is wider, so the random part should be truncated. Not all time sources provide a high-quality timestamp, and the least significant bits of the timestamp often do not contain sufficient entropy.
    IMO 100ns resolution allows more entropy to be stored in the UUID without any drawbacks. And this resolution is already used in RFC4122.

@sergeyprokhorenko
Copy link

sergeyprokhorenko commented Aug 21, 2021

IMO 100ns resolution allows more entropy to be stored in the UUID without any drawbacks. And this resolution is already used in RFC4122.

Good point for 160-bit UUID. But 1 ms is better for 128-bit UUID because of the longer random part. Therefore 128-bit UUID should be considered outdated or for limited use

@bradleypeabody
Copy link
Contributor

bradleypeabody commented Aug 22, 2021

TL;DR:

  • Global uniqueness cannot be guaranteed except by implementations sharing knowledge. It’s important to address this head-on. We should explain the tradeoffs instead of assuming a certain amount of collision resistance is "enough". (We have an option to create or adopt some global registry of namespaces like MAC addresses etc (see Discussion: Shared Knowledge Schemes and Node Identifiers #36) - but short of that, there is no other way to guarantee uniqueness - doesn't matter how many bits we use for what.)
  • Collision probability is a vital factor but there is no one right answer, implementors will need to make their own decisions and decide the length for their application (this is why I feel strongly about allowing variable length)
  • nanoseconds are more common than 100ns precision, it’s just more familiar (it also aligns well in a uint64).
  • Yes clocks will be wrong and lack granularity, the implementation is allowed to compensate for this by using random data instead.

@edo1 I understand and share the concerns. Here’s my take on it and how I propose we deal with it in the draft:

  1. Global uniqueness.

The basic problem is that global uniqueness is impossible to guarantee without some sort of shared knowledge (prearranged agreement about how otherwise independent implementations will each have certain parts set a certain way.) The MAC address in RFC4122 was an attempt at this shared knowledge approach by delegating the problem to the organization that assigns MAC addresses. I’m not aware of any perfect shared knowledge system that we can use that will be applicable to every use case.

It’s also worth noting that many applications don’t need global uniqueness. If you’re making primary keys for a database table, the uniqueness requirements are pretty lax (only uniqueness within a single table is actually required from a functional "what would actually break" perspective). I think this should be allowed for such use cases, otherwise someone will make an implementation for a database that is totally workable but not "globally unique enough" and we’ll have the “that’s not really UUID” argument when it literally does not matter in any real world sense for the application in question.

In the absence of shared knowledge, all we can do is reduce collision probability. And again, we run into the issue that how much collision resistance is “good enough” is subjective and application specific. There simply is no one right answer to this.

So, instead of trying to prescribe one solution for this, I’m thinking we instead:

A) explain in the document that shared knowledge is the only way to guarantee uniqueness and allow and encourage this if it is warranted. (we could provide suggestions around using MAC addresses or these days IPv6 addresses, but these still have possible problems and so I'm tempted to just leave it as "here's the problem, pick the solution that fits your needs") (UPDATE: see #36 - warrants it's own discussion)

B) clearly outline the collision probabilities in the document and allow implementations to lower them by adding more bytes at the end.

Instead of trying to solve the problem for everyone, the spec should explain the problem and encourage the implementation to choose an appropriate approach for the given use case.

  1. Timestamp resolution.

I agree with all of your points on timestamps except the mention of not needing time stamps more precise than network latency - I can think of plenty of cases where UUIDs could be created in rapid succession without networks coming into the matter.

That said, I don’t see the issue with using nanosecond precision (a bit more precise than is practically useful, as you were saying above), and just let implementations decide if they want to fill in whatever least significant portion with random data.

I.e. if you only need/want millisecond precision, then take the millisecond timestamp, multiply by 1 million to make it a nanosecond timestamp, and add a random number between 1 and a million. Easy, fast, simple. And it addresses this concern about “I’d rather use those extra timestamp bits for random data” - Im saying that’s totally fine and the spec should allow that, using the approach above.

Many time implementations deal with even time divisions of 1000, eg https://man7.org/linux/man-pages/man2/clock_gettime.2.html (the nsec field) or Go’s time package etc. that’s really the motivation for using nanoseconds - it aligns more closely to what other software already provides (it also happens to fit well in a uint64 and provides for good alignment). The fact that the clock is often not useful at X precision, while I agree, is highly environment dependent and whatever decision is made by the implementation will have to be in the context of what system it’s running on, the time source available and the intended use of the UUID.

Does that help clarify? I think the concerns are valid, but I also think that are covered by this latest concept (or at least each of the things that have been mentioned as an issue are possible to solve with a correct implementation)

@sergeyprokhorenko
Copy link

1 ns precision instead of 100 ns precision would cost 7 bits of UUID length. Thease usially null (empty) bits are between timestamp and clock sequence. So they can not contain random numbers. Therefore thease 7 bits would increase probability of collision or volume of DB.

Concurrent 1 ns UUID generation for the same DB table (dictionary) is a bad design. UUID generation for different tables (dictionaries) will not lead to UUID collisions, especially if entity types are used in UUID.

@edo1
Copy link
Author

edo1 commented Aug 22, 2021

In the absence of shared knowledge, all we can do is reduce collision probability. And again, we run into the issue that how much collision resistance is “good enough” is subjective and application specific

Agree. But this does not mean that the standard should not provide the lowest possible collision probability.
"Extra" collision-resistance never hurts.

UUID variant 1 has 122 effective bits, this layout has 125 effective bits, your one has only 120 effective bits #33
This layout provides 1e9/100*2^69≈5.9e27 different values per second, your 1e9*2^56≈7.2e25
The probability of collision will be about two orders of magnitude higher.
In fact, this is not an entirely fatal flaw. But I do not understand what the pros are.
A place to have a lot of new versions in the new variant? Variant 1 has only 5/15 used.
1ns resolution? I do not see any advantage in this.

Yes clocks will be wrong and lack granularity, the implementation is allowed to compensate for this by using random data instead.

There is at least one drawback. The application may be not aware of the real timer resolution. There is clock_getres(2), but it is unix-only and there is no guarantee that the returned value is correct.

Many time implementations deal with even time divisions of 1000, eg man7.org/linux/man-pages/man2/clock_gettime.2.html (the nsec field) or Go’s time package etc. that’s really the motivation for using nanoseconds - it aligns more closely to what other software already provides (it also happens to fit well in a uint64 and provides for good alignment).

The 100 ns resolution adds no complexity. I adapted your example code for this layout

  var v [16]byte
  binary.BigEndian.PutUint64(v[:8], uint64(time.Now().UnixNano()/100)<<8)
  rand.Read(v[7:])
  v[8] |= 0xE0

And this proposal defines a monotonic version as well. It is very simple too, 64-bit arithmetics is enough. Will publish my proposal for a reference implementation if anyone is interested.

@edo1
Copy link
Author

edo1 commented Aug 24, 2021

  1. Disagree. At least the timestamp must be reliable;

If you want to read the time, version, var random from it then do it. The RFC shouldn't contain any suggestion on how or how not to process the data of the UUID it should focus on the layout and provide pseudo code for generating UUIDs.

UUID parsing is an additional opportunity to shoot oneself in the foot.
A reader may fail with an unexpected UUID variant/version, with some UUID-like identifier such as ULID, etc.

I agree with this point of view:

UUIDs are supposed to be as opaque as possible. I would also wager than much of the code that is trying to extract version numbers and perform validation is probably doing something not terribly relevant to what most applications need anyway. Why are people checking the version? (can't you just use the opaque value) Why are they checking if a UUID is valid? (are you sure you can't just compare to all zeros to determine if there's a UUID here or not?)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Out of Scope Topics not in scope for the RFC/Draft
Projects
None yet
Development

No branches or pull requests

6 participants