-
Notifications
You must be signed in to change notification settings - Fork 11
Yet another UUID layout #34
Comments
I agree with all goals. As for the decisions taken:
As for the sample format, I have to think. |
I think UUIDs should be opaque. If you want to read the time, version, var random from it then do it. The RFC shouldn't contain any suggestion on how or how not to process the data of the UUID it should focus on the layout and provide pseudo code for generating UUIDs. Since time from Windows is only reliable up to 100ns then I think that should be an acceptable precision. But, what should happen in languages where 100 ns precision is not possible (eg. JavaScript ms precision)? Should they simply multiply the best available time or can they use a fake clock or high resolution time eg. I like the idea of Could you provide the exact layout for the 10 var case? If |
@edo1 Please have a look at the updated UUIDv7 from here: https://github.com/uuid6/uuid6-ietf-draft/blob/master/LATEST.md. The layout is very similar except it uses an unsigned 64-bit int of number of nanoseconds since unix epoch. And what you're describing with
Does this address at least some of your concerns? |
filling_u8 is too short as clock sequence, when real clock precision is 1 millisecond. You should add 7 bits more. |
A general rule of thumb is to add a random number to fill the gap, for example,
Perhaps |
You get the idea. But I would prefer to add a new 125-bit layout (with |
@bradleypeabody @broofa
|
Good point for 160-bit UUID. But 1 ms is better for 128-bit UUID because of the longer random part. Therefore 128-bit UUID should be considered outdated or for limited use |
TL;DR:
@edo1 I understand and share the concerns. Here’s my take on it and how I propose we deal with it in the draft:
The basic problem is that global uniqueness is impossible to guarantee without some sort of shared knowledge (prearranged agreement about how otherwise independent implementations will each have certain parts set a certain way.) The MAC address in RFC4122 was an attempt at this shared knowledge approach by delegating the problem to the organization that assigns MAC addresses. I’m not aware of any perfect shared knowledge system that we can use that will be applicable to every use case. It’s also worth noting that many applications don’t need global uniqueness. If you’re making primary keys for a database table, the uniqueness requirements are pretty lax (only uniqueness within a single table is actually required from a functional "what would actually break" perspective). I think this should be allowed for such use cases, otherwise someone will make an implementation for a database that is totally workable but not "globally unique enough" and we’ll have the “that’s not really UUID” argument when it literally does not matter in any real world sense for the application in question. In the absence of shared knowledge, all we can do is reduce collision probability. And again, we run into the issue that how much collision resistance is “good enough” is subjective and application specific. There simply is no one right answer to this. So, instead of trying to prescribe one solution for this, I’m thinking we instead: A) explain in the document that shared knowledge is the only way to guarantee uniqueness and allow and encourage this if it is warranted. (we could provide suggestions around using MAC addresses or these days IPv6 addresses, but these still have possible problems and so I'm tempted to just leave it as "here's the problem, pick the solution that fits your needs") (UPDATE: see #36 - warrants it's own discussion) B) clearly outline the collision probabilities in the document and allow implementations to lower them by adding more bytes at the end. Instead of trying to solve the problem for everyone, the spec should explain the problem and encourage the implementation to choose an appropriate approach for the given use case.
I agree with all of your points on timestamps except the mention of not needing time stamps more precise than network latency - I can think of plenty of cases where UUIDs could be created in rapid succession without networks coming into the matter. That said, I don’t see the issue with using nanosecond precision (a bit more precise than is practically useful, as you were saying above), and just let implementations decide if they want to fill in whatever least significant portion with random data. I.e. if you only need/want millisecond precision, then take the millisecond timestamp, multiply by 1 million to make it a nanosecond timestamp, and add a random number between 1 and a million. Easy, fast, simple. And it addresses this concern about “I’d rather use those extra timestamp bits for random data” - Im saying that’s totally fine and the spec should allow that, using the approach above. Many time implementations deal with even time divisions of 1000, eg https://man7.org/linux/man-pages/man2/clock_gettime.2.html (the nsec field) or Go’s time package etc. that’s really the motivation for using nanoseconds - it aligns more closely to what other software already provides (it also happens to fit well in a uint64 and provides for good alignment). The fact that the clock is often not useful at X precision, while I agree, is highly environment dependent and whatever decision is made by the implementation will have to be in the context of what system it’s running on, the time source available and the intended use of the UUID. Does that help clarify? I think the concerns are valid, but I also think that are covered by this latest concept (or at least each of the things that have been mentioned as an issue are possible to solve with a correct implementation) |
1 ns precision instead of 100 ns precision would cost 7 bits of UUID length. Thease usially null (empty) bits are between timestamp and clock sequence. So they can not contain random numbers. Therefore thease 7 bits would increase probability of collision or volume of DB. Concurrent 1 ns UUID generation for the same DB table (dictionary) is a bad design. UUID generation for different tables (dictionaries) will not lead to UUID collisions, especially if entity types are used in UUID. |
Agree. But this does not mean that the standard should not provide the lowest possible collision probability. UUID variant 1 has 122 effective bits, this layout has 125 effective bits, your one has only 120 effective bits #33
There is at least one drawback. The application may be not aware of the real timer resolution. There is clock_getres(2), but it is unix-only and there is no guarantee that the returned value is correct.
The 100 ns resolution adds no complexity. I adapted your example code for this layout var v [16]byte
binary.BigEndian.PutUint64(v[:8], uint64(time.Now().UnixNano()/100)<<8)
rand.Read(v[7:])
v[8] |= 0xE0 And this proposal defines a monotonic version as well. It is very simple too, 64-bit arithmetics is enough. Will publish my proposal for a reference implementation if anyone is interested. |
UUID parsing is an additional opportunity to shoot oneself in the foot. I agree with this point of view:
|
Goals:
It is most important! If the format is to be widely used, even low collision rates can be dangerous. UUID with a timestamp has much less entropy than UUIDv4. 128 bits (really 122 or 125) isn't that much;
This is the main goal why a new standard is developed;
Some decisions:
As mentioned in Discussion: Unix Timestamp Granularity in UUIDv7 #23, this will reduce collision probability by several times (or even an order of magnitude);
Bad id generation/selection can greatly increase the collision probability compared to a random of the same size;
A sequence id contains too little entropy, which increases the collision probability;
Monotonic sequence generation is difficult to scale/maintain because locks/atomic operations are required to prevent race conditions.
Therefore, by default, we do not try to be monotony. But the high-resolution timestamp should provide good ordering (UUIDs will be monotonic or near monotonic in most cases);
Sample format
part1
(64 bit):unix_timestamp_100ns_u56
: unix_timestamp in 100ns resolution, stripped down to 56 bit + BE byte order for sortablity;filling_u8
: usually random (can be non-random to keep monotonicity).part2
(64 bit):var
: 0b111 for interporability with RFC4122;random_u61
: must be random.I prefered to use
variant = 0b111
, it leaves 125 bits fortimestamp + random
.But the format can be easily adapted for
variant = 0b10x, version = 0b0111
(122 effective bits) by shrinking filling_u8.Timestamp length/resolution can also be adjusted during the standard review.
With the current proposal, the timestamp overflow will occur in 2198. Nothing bad should happen at this moment.
Timestamp reuse will occur at about 2250. After that, the UUID will still work, only the collision resistance will be slightly reduced (but still better than ULID and other proposals that have a large epoch). There may be minor issues (e.g. reusing old partitions in the DBMS if UUID is used as partitioning key), but I assume that larger UUIDs will be used on this day;
The proposed format is mostly ULID-like, but:
Pseudocode
Generation without guaranteed monotonicity
It is very simple
Generation with guaranteed monotonicity
This allows:
The text was updated successfully, but these errors were encountered: