Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve uniqueness metric #92

Open
tsegall opened this issue May 8, 2024 · 0 comments
Open

Improve uniqueness metric #92

tsegall opened this issue May 8, 2024 · 0 comments
Assignees

Comments

@tsegall
Copy link
Owner

tsegall commented May 8, 2024

Currently FTA only counts uniqueness where the cardinality of the set is less than maxCardinality (12,000 by default). So assuming the number of distinct elements in the set is less than 12,000 which is typically highly likely then the uniqueness percentage (and hence the uniqueness count) should be perfect. Once the number of distinct elements exceeds the defined maximum then FTA simply reports -1 to indicate it does not know the answer.

The obvious solution to this problem is to use something like Hyperloglog (see https://en.wikipedia.org/wiki/HyperLogLog) - note: need to use a variant that supports merging. This would then allow us to implement an approx_count_distinct and hence generate this uniqueness metric even if the cardinality of the set is high. This would be extremely close but not perfect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant