Improve uniqueness metric #92

tsegall · 2024-05-08T13:45:04Z

Currently FTA only counts uniqueness where the cardinality of the set is less than maxCardinality (12,000 by default). So assuming the number of distinct elements in the set is less than 12,000 which is typically highly likely then the uniqueness percentage (and hence the uniqueness count) should be perfect. Once the number of distinct elements exceeds the defined maximum then FTA simply reports -1 to indicate it does not know the answer.

The obvious solution to this problem is to use something like Hyperloglog (see https://en.wikipedia.org/wiki/HyperLogLog) - note: need to use a variant that supports merging. This would then allow us to implement an approx_count_distinct and hence generate this uniqueness metric even if the cardinality of the set is high. This would be extremely close but not perfect.

tsegall added the enhancement label May 8, 2024

tsegall self-assigned this May 8, 2024

tsegall mentioned this issue May 8, 2024

Unique Count of the values for a column/field #82

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve uniqueness metric #92

Improve uniqueness metric #92

tsegall commented May 8, 2024

Improve uniqueness metric #92

Improve uniqueness metric #92

Comments

tsegall commented May 8, 2024