Calculate the Uniqueness #19

ekelvin · 2020-04-27T08:11:22Z

It is really great to use shorter unique id, however as we all know this comes with a price.
It would be good to be able to know based on the options what is the probability to have the same id again.
This is really important as will increase easily the usage number of this library once the probability of your usage is easily known.
example for UUID V4 is: ...Thus, the probability to find a duplicate within 103 trillion version-4 UUIDs is one in a billion (ref: https://en.wikipedia.org/wiki/Universally_unique_identifier )

jeanlescure · 2020-04-27T12:26:12Z

Hi @ekelvin! Really glad you find the library to be useful :)

I had actually included a task in issue #11 to document why 6 was chosen as the default character length. But now that you mention it, I might as well add a function that returns a probability that a collision may be encountered, what you called uniqueness.

There are two values needed to calculate this, which are:

the total number of possible UUIDs (Hashes) in relation to the given dictionary (let's call this number H)
and the expected number of values we have to choose before finding the first collision (let's call this quantity Q(H))

H is simply the number n of unique characters in the dictionary to the power of the UUID length l:

$H=n^l$

And Q(H) can be approximated as:

$Q(H)\approx\sqrt{\frac{\pi}{2}H}$

(source)

So uniqueness (in the case of this library) in a scale from 0 to 1 would be defined as:

$1-\frac{Q(H)}{H}$

@ekelvin, I shall add this task to our v3 proposals (#11) and will keep this issue open until we merge to master (max. by May 14th, although it seems we might be ready to release this week 🤞 ).

Cheers!

jeanlescure · 2020-04-27T14:45:34Z

I've started implementing this feature.

Just wanted to note that the aforementioned uniqueness value assumes that one will perform H rounds of UUID generation. In other words as stated the uniqueness value would be the answer to the problem:

I have a set of n characters and am allowed to create "words" of length l. If I iterate $n^{l}$ times and on each iteration I generate a new "word" by selecting random characters from the set, what is the additive inverse of the probability of generating a "word" I had previously generated (a duplicate) at any given iteration?

As such, the value is useful in of itself as a score of sorts, but I'm sure people will be more inclined to ask:

If I use this lib and expect to perform at most r rounds of UUID generations, what is the probability p that I will hit a duplicate UUID?

The answer would be approximately:

$p(r; H)\approx\frac{\sqrt{\frac{\pi}{2}r}}{H}$

So we should probably implement a function that receives a number of rounds as input and generates the aforementioned probability 🤓

ekelvin · 2020-04-28T08:30:55Z

That is fantastic :)

jeanlescure mentioned this issue Apr 27, 2020

Version 3 #11

Closed

14 tasks

This was referenced May 1, 2020

Uniqueness jeanlescure/short_uuid#3

Merged

Contributors #14

Closed

jeanlescure mentioned this issue May 12, 2020

Release v3.0.0 #27

Merged

jeanlescure closed this as completed in #27 May 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calculate the Uniqueness #19

Calculate the Uniqueness #19

ekelvin commented Apr 27, 2020 •

edited

Loading

jeanlescure commented Apr 27, 2020 •

edited

Loading

jeanlescure commented Apr 27, 2020

ekelvin commented Apr 28, 2020

Calculate the Uniqueness #19

Calculate the Uniqueness #19

Comments

ekelvin commented Apr 27, 2020 • edited Loading

jeanlescure commented Apr 27, 2020 • edited Loading

jeanlescure commented Apr 27, 2020

ekelvin commented Apr 28, 2020

ekelvin commented Apr 27, 2020 •

edited

Loading

jeanlescure commented Apr 27, 2020 •

edited

Loading