Add Monotonicity #306

sergeyprokhorenko · 2018-12-20T19:04:33Z

No description provided.

ahawker · 2018-12-28T17:52:09Z

Monotonicity wasn't actually part of the spec when I wrote this library.

I'll take a deeper look into the details.

https://github.com/ulid/spec#monotonicity

sergeyprokhorenko · 2018-12-28T18:42:14Z

Andrew,
You'd better take the Monotonicity from the popular implementation of ULID in Golang. It's much more secure, because it's impossible to find a valid ID by increment or decrement of the known ID.

Sergey

ahawker · 2019-01-05T00:31:31Z

Based on my reading, you are limited to a random number of values you can generate within the same millisecond. If your first randomness value is sufficiently high, you can only generate within that limited window until you overflow.

The implementation in https://github.com/oklog/ulid appears to do this, so it's really best effort but unlikely you'll be able to create 2^80 values within the same millisecond.

Further reading lead me to ulid/spec#11 which brings up the issue/discrepancy.

If we implement a best effort approach, it would require the following:

Changes to the following to use an entropy source instead of os.urandom directly:

api.new
api.from_timestamp

The big addition here is going to be a stateful entropy source. The concurrency guarantees of this entropy source are still unknown and don't appear to be defined within the spec. The default JS implementation does not appear to be thread safe. Investigation of other implementations is necessary to see what guarantees other libraries are attempting to make. Based on this investigation, I can imagine this entropy source being a global instance, thread safe via locks, and created at import time
when there is less likely a change for race condition. The other implementation could be using a thread local in which the entropy source would maintain the monotonicity requirements per-thread but not per-process. This appears to be what the default implementation does but I'm not really a fan of this approach.

Next Steps:

- Investigate other implementations to see (or lacking) concurrency guarantees.

sergeyprokhorenko · 2019-01-07T07:56:49Z

To solve these problems, I would advise to calculate monotonic ULID as

ttttttttttsssrrrrrrrrrrrrr
(string representation in Crockford's base32)

where
t is Timestamp (10 characters), UNIX-time in milliseconds (UTC)
s is Sequence (3 characters), generated for each ULID with the same database and Timestamp
r is Randomness (13 characters), generated in advance by true random number generator, separately for each ULID

vincent-grosbois · 2019-05-02T09:16:41Z

Would it be possible to just loop to create ULIDs until the new ULIDs is bigger than the previous one?
Then it would always be monotonic... at worst you would loop during 1ms until the initial part is incremented

sergeyprokhorenko · 2019-11-03T14:37:38Z

Just look at this implemetation of guaranteed multiple increasing ULIDs per millisecond: ULID with sequence

neg3ntropy · 2020-06-23T12:33:40Z

Just an idea, sorry if it's stupid, but since time sources can have microsecond precision, how about taking anything under the millisecond as the most significant random bits? Wouldn't it improve monotonicity by quite a lot in practice and still be very efficient and simple?

sergeyprokhorenko · 2020-06-24T18:33:00Z

Hi Andrea,
Microseconds are better than milliseconds. However microseconds, unlike Sequence, will not solve the problem of short-term peak loads.
For example, during one microsecond thousands of records may appear for which we need to generate ULIDs, but before and after this microsecond the flow of new records is much weaker. Adding microseconds will not solve this problem, but Sequence will solve it.

neg3ntropy · 2020-06-24T20:24:23Z

I guess generating a ULID takes more than a microsecond even on a fast CPU, so that covers the single thread case.

In [1]: from time import time                                                                                      

In [2]: a,b=(time(),time())                                                                                        

In [3]: a                                                                                                          
Out[3]: 1593030169.206382

In [4]: b                                                                                                          
Out[4]: 1593030169.2063825

When you have multiple threads, already the order of operations is indeterminate to a point, so probably nobody would care. Furthermore that in python such a case is even possible is also to be demonstrated.

So microsecond precision does not solve the issue in theory but I would say the case where it does ruin somebody's day, just does not exist.

sergeyprokhorenko · 2020-06-25T10:16:58Z

Taking into account that generating a ULID on the fly takes more than a microsecond, you should in advance generate 32768 ULIDs with Sequence to the buffer for every millisecond and use them from the buffer including at peak load.

But your proposed use of a microsecond counter as a sequence will result in gaps in the sequence of microseconds recorded to ULIDs. That is, the three characters allocated for microseconds will be used less efficiently than the same three characters allocated for the Sequence. Fewer ULIDs will be generated than possible. And besides, thousands of new record will have to wait for generating new ULIDs on the fly: one ULID per more than a microsecond.

I think the best place fo ULID generator is DBMS, not Python. But unfortunately DBMS developers have the wrong priorities.

neg3ntropy · 2020-06-26T05:08:13Z

I hope I am not being irritating, but this is an interesting discussion, so I will say one more thing.

The gap in the space of possible ULID is a non issue. ULID are already very sparse and use 128 bits, while a db sequence would very easily work with 32 bits, 64 at most. Lots of possible ULIDs will never be used just because you don't create objects every millisecond, and even if you would, the randomness would still make the space very sparse.

The bits used for microsecond precision would have the same efficiency as the random bits, as more or less they are random. Also they would need to be 10 bits as there are 1000 microseconds to a millisecond, but we don't need to worry that we are eating into the randomness, because we are still reducing the chance of collision with those bits.

We accept the sparse nature of ULIDs (and UUIDs), because it's distributed id generation: we don't need to hit a database to get one. Everything can create a ULID and expect no collision.

This is also why I would not go down the sequence road: it assumes you have shared memory between consumers, but that could very well not be the case with multiple processes. In python we do use processes more than threads because of the GIL.

In conclusion, I think that microseconds would work better in distributed environments, while being easier to implement and faster. The sequence approach is valid when you don't have microsecond precision available.

ahawker · 2020-06-26T14:58:04Z

@neg3ntropy I had come back to this issue a week ago or so and actually started researching the same thing; definitely not a stupid idea! It's another approach to trading a bit of entropy for a value that is guaranteed to be incrementing, in this case, the system clock. (Ignoring the fact that the system clock can be changed, thus breaking any montonic property of time.time).

The reasons I have not (yet) tried implemented it:

It's against the ULID spec. This is a concern, albeit, a small one as we're not getting much/any guidance from the spec author.
It's more prone to the whims of the platform the code is running on. Windows was notorious for only have 16ms accuracy on their clocks. This PEP 564 annex outlines some of the observed values by platform. If these numbers are accurate, I believe there would be some cases where trading randomness bits for timestamp bits (that don't change fast enough because of clock frequency) would mean you might have more chances for collision/out of order values.

Neither of these issues are true blockers, I just wanted to post where my current research and thinking was.

sergeyprokhorenko · 2020-06-26T16:55:24Z

Hi Andrea,

I don't care of the gap between two sequential ULIDs in the space of possible ULIDs at all. I mean exactly the gap in the sequence of microseconds recorded to ULIDs. These two kinds of gap are very different things.

Imagine you are near the end of a millisecond, and suddenly peak load occured. You don't have enough remaining microseconds in this millisecond to generate sufficient number of ULIDs. But you have enough variants of the Sequence in pre-generated ULIDs for this millisecond.

Ordered IDs are important for databases only to shorten the search time. ULIDs with Sequence are intended for generation within database, and ULIDs with sequence are better than microseconds in this case (see previous paragraph). If you merge records with ULIDs with Sequence from several databases, you should sort the records within every millisecond in the buffer or generate own ULIDs within the target database. The sorting would be easier with microseconds (and the best available time resolution is 100 ns) instead of milliseconds+Sequence, but lack of Sequence may cause problems in source databases. If the sources are not databases, and you don't need quick search in them, than microseconds (without Sequence) would be enough.

ahawker · 2020-08-25T17:16:35Z

Monotonic implementation was added in #473 with the goal of implementing it "as defined by the spec". As discussed here and in other GH issues/threads, there are some problems/questions with that spec. However, the goal at this point should be interoperability with other ULID implementations with future effort put into proposing changes to the spec.

I'll try and get a new version on pypi today with these additions.

neg3ntropy · 2020-08-26T11:42:06Z

I have reviewed providers/monotonic.py only. I still would have favoured a solution with timestamp extra bits as randomness, maybe only platform where such precision is available, but no thread locks and state. I think that would be better for performance as well as ordering in distributed/multiprocessing environments.

However I am happy for the development and this already covers my use case well enough.

ahawker · 2020-08-26T17:17:18Z

I took into account the discussions in this thread during design and I believe implementing a microsecond based ULID would be straight forward with a new provider implementation. I'll probably take a swing at it one of these weekends to see if that assumption is correct.

ahawker · 2020-08-30T21:24:13Z

Implementation using microseconds is in #476

ahawker added the ready label Dec 28, 2018

ahawker self-assigned this Dec 28, 2018

This was referenced Jan 2, 2019

Add monotonic entropy for ULIDs created in the same millisecond? valohai/ulid2#7

Closed

monotonic entropy for ULIDs created in the same ms? mdomke/python-ulid#1

Open

ahawker added investigation and removed ready labels Jan 5, 2019

RobThree mentioned this issue Jan 9, 2019

Multiple ULIDs per millisecond not guaranteed ulid/spec#11

Open

ahawker mentioned this issue Aug 17, 2020

Add monotonic randomness support #473

Merged

6 tasks

ahawker closed this as completed in #473 Aug 25, 2020

ahawker removed the investigation label Aug 25, 2020

ahawker mentioned this issue Aug 30, 2020

Add microsecond provider #476

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Monotonicity #306

Add Monotonicity #306

sergeyprokhorenko commented Dec 20, 2018

ahawker commented Dec 28, 2018

sergeyprokhorenko commented Dec 28, 2018

ahawker commented Jan 5, 2019 •

edited

Loading

sergeyprokhorenko commented Jan 7, 2019 •

edited

Loading

vincent-grosbois commented May 2, 2019

sergeyprokhorenko commented Nov 3, 2019 •

edited

Loading

neg3ntropy commented Jun 23, 2020

sergeyprokhorenko commented Jun 24, 2020

neg3ntropy commented Jun 24, 2020

sergeyprokhorenko commented Jun 25, 2020

neg3ntropy commented Jun 26, 2020

ahawker commented Jun 26, 2020 •

edited

Loading

sergeyprokhorenko commented Jun 26, 2020

ahawker commented Aug 25, 2020

neg3ntropy commented Aug 26, 2020

ahawker commented Aug 26, 2020

ahawker commented Aug 30, 2020 •

edited

Loading

Add Monotonicity #306

Add Monotonicity #306

Comments

sergeyprokhorenko commented Dec 20, 2018

ahawker commented Dec 28, 2018

sergeyprokhorenko commented Dec 28, 2018

ahawker commented Jan 5, 2019 • edited Loading

sergeyprokhorenko commented Jan 7, 2019 • edited Loading

vincent-grosbois commented May 2, 2019

sergeyprokhorenko commented Nov 3, 2019 • edited Loading

neg3ntropy commented Jun 23, 2020

sergeyprokhorenko commented Jun 24, 2020

neg3ntropy commented Jun 24, 2020

sergeyprokhorenko commented Jun 25, 2020

neg3ntropy commented Jun 26, 2020

ahawker commented Jun 26, 2020 • edited Loading

sergeyprokhorenko commented Jun 26, 2020

ahawker commented Aug 25, 2020

neg3ntropy commented Aug 26, 2020

ahawker commented Aug 26, 2020

ahawker commented Aug 30, 2020 • edited Loading

ahawker commented Jan 5, 2019 •

edited

Loading

sergeyprokhorenko commented Jan 7, 2019 •

edited

Loading

sergeyprokhorenko commented Nov 3, 2019 •

edited

Loading

ahawker commented Jun 26, 2020 •

edited

Loading

ahawker commented Aug 30, 2020 •

edited

Loading