Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Monotonicity #306

Closed
sergeyprokhorenko opened this issue Dec 20, 2018 · 17 comments · Fixed by #473
Closed

Add Monotonicity #306

sergeyprokhorenko opened this issue Dec 20, 2018 · 17 comments · Fixed by #473
Assignees

Comments

@sergeyprokhorenko
Copy link

No description provided.

@ahawker
Copy link
Owner

ahawker commented Dec 28, 2018

Monotonicity wasn't actually part of the spec when I wrote this library.

I'll take a deeper look into the details.

https://github.com/ulid/spec#monotonicity

@ahawker ahawker added the ready label Dec 28, 2018
@ahawker ahawker self-assigned this Dec 28, 2018
@sergeyprokhorenko
Copy link
Author

Andrew,
You'd better take the Monotonicity from the popular implementation of ULID in Golang. It's much more secure, because it's impossible to find a valid ID by increment or decrement of the known ID.

Sergey

@ahawker
Copy link
Owner

ahawker commented Jan 5, 2019

Based on my reading, you are limited to a random number of values you can generate within the same millisecond. If your first randomness value is sufficiently high, you can only generate within that limited window until you overflow.

The implementation in https://github.com/oklog/ulid appears to do this, so it's really best effort but unlikely you'll be able to create 2^80 values within the same millisecond.

Further reading lead me to ulid/spec#11 which brings up the issue/discrepancy.

If we implement a best effort approach, it would require the following:

Changes to the following to use an entropy source instead of os.urandom directly:

  • api.new
  • api.from_timestamp

The big addition here is going to be a stateful entropy source. The concurrency guarantees of this entropy source are still unknown and don't appear to be defined within the spec. The default JS implementation does not appear to be thread safe. Investigation of other implementations is necessary to see what guarantees other libraries are attempting to make. Based on this investigation, I can imagine this entropy source being a global instance, thread safe via locks, and created at import time
when there is less likely a change for race condition. The other implementation could be using a thread local in which the entropy source would maintain the monotonicity requirements per-thread but not per-process. This appears to be what the default implementation does but I'm not really a fan of this approach.

Next Steps:

  • - Investigate other implementations to see (or lacking) concurrency guarantees.

@sergeyprokhorenko
Copy link
Author

sergeyprokhorenko commented Jan 7, 2019

To solve these problems, I would advise to calculate monotonic ULID as

ttttttttttsssrrrrrrrrrrrrr
(string representation in Crockford's base32)

where
t is Timestamp (10 characters), UNIX-time in milliseconds (UTC)
s is Sequence (3 characters), generated for each ULID with the same database and Timestamp
r is Randomness (13 characters), generated in advance by true random number generator, separately for each ULID

@vincent-grosbois
Copy link

Would it be possible to just loop to create ULIDs until the new ULIDs is bigger than the previous one?
Then it would always be monotonic... at worst you would loop during 1ms until the initial part is incremented

@sergeyprokhorenko
Copy link
Author

sergeyprokhorenko commented Nov 3, 2019

Just look at this implemetation of guaranteed multiple increasing ULIDs per millisecond: ULID with sequence

@neg3ntropy
Copy link

Just an idea, sorry if it's stupid, but since time sources can have microsecond precision, how about taking anything under the millisecond as the most significant random bits? Wouldn't it improve monotonicity by quite a lot in practice and still be very efficient and simple?

@sergeyprokhorenko
Copy link
Author

Hi Andrea,
Microseconds are better than milliseconds. However microseconds, unlike Sequence, will not solve the problem of short-term peak loads.
For example, during one microsecond thousands of records may appear for which we need to generate ULIDs, but before and after this microsecond the flow of new records is much weaker. Adding microseconds will not solve this problem, but Sequence will solve it.

@neg3ntropy
Copy link

I guess generating a ULID takes more than a microsecond even on a fast CPU, so that covers the single thread case.

In [1]: from time import time                                                                                      

In [2]: a,b=(time(),time())                                                                                        

In [3]: a                                                                                                          
Out[3]: 1593030169.206382

In [4]: b                                                                                                          
Out[4]: 1593030169.2063825

When you have multiple threads, already the order of operations is indeterminate to a point, so probably nobody would care. Furthermore that in python such a case is even possible is also to be demonstrated.

So microsecond precision does not solve the issue in theory but I would say the case where it does ruin somebody's day, just does not exist.

@sergeyprokhorenko
Copy link
Author

Taking into account that generating a ULID on the fly takes more than a microsecond, you should in advance generate 32768 ULIDs with Sequence to the buffer for every millisecond and use them from the buffer including at peak load.

But your proposed use of a microsecond counter as a sequence will result in gaps in the sequence of microseconds recorded to ULIDs. That is, the three characters allocated for microseconds will be used less efficiently than the same three characters allocated for the Sequence. Fewer ULIDs will be generated than possible. And besides, thousands of new record will have to wait for generating new ULIDs on the fly: one ULID per more than a microsecond.

I think the best place fo ULID generator is DBMS, not Python. But unfortunately DBMS developers have the wrong priorities.

@neg3ntropy
Copy link

I hope I am not being irritating, but this is an interesting discussion, so I will say one more thing.

The gap in the space of possible ULID is a non issue. ULID are already very sparse and use 128 bits, while a db sequence would very easily work with 32 bits, 64 at most. Lots of possible ULIDs will never be used just because you don't create objects every millisecond, and even if you would, the randomness would still make the space very sparse.

The bits used for microsecond precision would have the same efficiency as the random bits, as more or less they are random. Also they would need to be 10 bits as there are 1000 microseconds to a millisecond, but we don't need to worry that we are eating into the randomness, because we are still reducing the chance of collision with those bits.

We accept the sparse nature of ULIDs (and UUIDs), because it's distributed id generation: we don't need to hit a database to get one. Everything can create a ULID and expect no collision.

This is also why I would not go down the sequence road: it assumes you have shared memory between consumers, but that could very well not be the case with multiple processes. In python we do use processes more than threads because of the GIL.

In conclusion, I think that microseconds would work better in distributed environments, while being easier to implement and faster. The sequence approach is valid when you don't have microsecond precision available.

@ahawker
Copy link
Owner

ahawker commented Jun 26, 2020

@neg3ntropy I had come back to this issue a week ago or so and actually started researching the same thing; definitely not a stupid idea! It's another approach to trading a bit of entropy for a value that is guaranteed to be incrementing, in this case, the system clock. (Ignoring the fact that the system clock can be changed, thus breaking any montonic property of time.time).

The reasons I have not (yet) tried implemented it:

  1. It's against the ULID spec. This is a concern, albeit, a small one as we're not getting much/any guidance from the spec author.
  2. It's more prone to the whims of the platform the code is running on. Windows was notorious for only have 16ms accuracy on their clocks. This PEP 564 annex outlines some of the observed values by platform. If these numbers are accurate, I believe there would be some cases where trading randomness bits for timestamp bits (that don't change fast enough because of clock frequency) would mean you might have more chances for collision/out of order values.

Neither of these issues are true blockers, I just wanted to post where my current research and thinking was.

@sergeyprokhorenko
Copy link
Author

Hi Andrea,

I don't care of the gap between two sequential ULIDs in the space of possible ULIDs at all. I mean exactly the gap in the sequence of microseconds recorded to ULIDs. These two kinds of gap are very different things.

Imagine you are near the end of a millisecond, and suddenly peak load occured. You don't have enough remaining microseconds in this millisecond to generate sufficient number of ULIDs. But you have enough variants of the Sequence in pre-generated ULIDs for this millisecond.

Ordered IDs are important for databases only to shorten the search time. ULIDs with Sequence are intended for generation within database, and ULIDs with sequence are better than microseconds in this case (see previous paragraph). If you merge records with ULIDs with Sequence from several databases, you should sort the records within every millisecond in the buffer or generate own ULIDs within the target database. The sorting would be easier with microseconds (and the best available time resolution is 100 ns) instead of milliseconds+Sequence, but lack of Sequence may cause problems in source databases. If the sources are not databases, and you don't need quick search in them, than microseconds (without Sequence) would be enough.

@ahawker
Copy link
Owner

ahawker commented Aug 25, 2020

Monotonic implementation was added in #473 with the goal of implementing it "as defined by the spec". As discussed here and in other GH issues/threads, there are some problems/questions with that spec. However, the goal at this point should be interoperability with other ULID implementations with future effort put into proposing changes to the spec.

I'll try and get a new version on pypi today with these additions.

@neg3ntropy
Copy link

I have reviewed providers/monotonic.py only. I still would have favoured a solution with timestamp extra bits as randomness, maybe only platform where such precision is available, but no thread locks and state. I think that would be better for performance as well as ordering in distributed/multiprocessing environments.

However I am happy for the development and this already covers my use case well enough.

@ahawker
Copy link
Owner

ahawker commented Aug 26, 2020

I took into account the discussions in this thread during design and I believe implementing a microsecond based ULID would be straight forward with a new provider implementation. I'll probably take a swing at it one of these weekends to see if that assumption is correct.

@ahawker ahawker mentioned this issue Aug 30, 2020
3 tasks
@ahawker
Copy link
Owner

ahawker commented Aug 30, 2020

Implementation using microseconds is in #476

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants