Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calculate the Uniqueness #19

Closed
ekelvin opened this issue Apr 27, 2020 · 3 comments · Fixed by #27
Closed

Calculate the Uniqueness #19

ekelvin opened this issue Apr 27, 2020 · 3 comments · Fixed by #27

Comments

@ekelvin
Copy link

ekelvin commented Apr 27, 2020

It is really great to use shorter unique id, however as we all know this comes with a price.
It would be good to be able to know based on the options what is the probability to have the same id again.
This is really important as will increase easily the usage number of this library once the probability of your usage is easily known.
example for UUID V4 is: ...Thus, the probability to find a duplicate within 103 trillion version-4 UUIDs is one in a billion (ref: https://en.wikipedia.org/wiki/Universally_unique_identifier )

@jeanlescure
Copy link
Collaborator

jeanlescure commented Apr 27, 2020

Hi @ekelvin! Really glad you find the library to be useful :)

I had actually included a task in issue #11 to document why 6 was chosen as the default character length. But now that you mention it, I might as well add a function that returns a probability that a collision may be encountered, what you called uniqueness.

There are two values needed to calculate this, which are:

  • the total number of possible UUIDs (Hashes) in relation to the given dictionary (let's call this number H)
  • and the expected number of values we have to choose before finding the first collision (let's call this quantity Q(H))

H is simply the number n of unique characters in the dictionary to the power of the UUID length l:

And Q(H) can be approximated as:

(source)

So uniqueness (in the case of this library) in a scale from 0 to 1 would be defined as:

@ekelvin, I shall add this task to our v3 proposals (#11) and will keep this issue open until we merge to master (max. by May 14th, although it seems we might be ready to release this week 🤞 ).

Cheers!

@jeanlescure jeanlescure mentioned this issue Apr 27, 2020
14 tasks
@jeanlescure
Copy link
Collaborator

I've started implementing this feature.

Just wanted to note that the aforementioned uniqueness value assumes that one will perform H rounds of UUID generation. In other words as stated the uniqueness value would be the answer to the problem:

I have a set of n characters and am allowed to create "words" of length l. If I iterate times and on each iteration I generate a new "word" by selecting random characters from the set, what is the additive inverse of the probability of generating a "word" I had previously generated (a duplicate) at any given iteration?

As such, the value is useful in of itself as a score of sorts, but I'm sure people will be more inclined to ask:

If I use this lib and expect to perform at most r rounds of UUID generations, what is the probability p that I will hit a duplicate UUID?

The answer would be approximately:

So we should probably implement a function that receives a number of rounds as input and generates the aforementioned probability 🤓

@ekelvin
Copy link
Author

ekelvin commented Apr 28, 2020

That is fantastic :)

This was referenced May 1, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants