-
Notifications
You must be signed in to change notification settings - Fork 246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow options to be passed to GlobalIndexUidAggregator #1091
Allow options to be passed to GlobalIndexUidAggregator #1091
Conversation
Currently, the only way to change the max number of UIDs kept in the GlobalIndexUidAggregator's protocol buffer is to change the code and recompile. It would be nice to be able to configure this on the iterator using the shell. This changes modifies PropogatingIterator [sic] to support passing options to the configured combiner(s). Note that there is a side-effect of this change that stems from the fact that the aggregators used by PropogatingIterator aren't really used as Combiners (init is never called, for example) but rather as the old removed Aggregator. The side effect is that if you wish to configure the max UID list size on GlobalIndexUidAggregator then you must also set the "all" option on GlobalIndexUidAggregator since it is a Combiner (even though the all option won't actually be used, the options validation requires that it or a specific column list be set). Another consideration for anyone using this feature is the impact of changing the max UID list size after data has been loaded. Decreasing the max size will likely cause UID lists to be purged as they now exceed the max UID count. Increasing also won't quite behave as you'd expect for any lists that had already exceeded the previous max count and had their lists cleared. Really the main use for this is when setting up a new cluster.
...use/ingest-core/src/main/java/datawave/ingest/table/aggregator/GlobalIndexUidAggregator.java
Show resolved
Hide resolved
My gut reaction here is that it will become too easy to totally screw up an existing index on a system. I am wondering if we can update the aggregator to not be destructive: if there are already N UIDs in the list, and the configuration is M where M < N, then the UIDs will remain for that list. |
Sure, that could be done, but it would lead to some inconsistencies. Any new terms would "follow the rules" and their protocol buffers would stop keeping UIDs at the new max count whereas existing terms whose count were over the new max would have more UIDs. I suppose old terms whose count was below the new max would also behave like new terms. Just trying to think of other options. What if you had another parameter that has to be set to true in order for the maxUIDs parameter to be honored? It's like a confirmation at that point, and not really different from "droptable shard". :) Of course there are plenty of ways to screw up a system if you don't know what you're doing. "deleterows -t shard -f" doesn't require confirmation... |
The base branch was changed.
Currently, the only way to change the max number of UIDs kept in the
GlobalIndexUidAggregator's protocol buffer is to change the code and
recompile. It would be nice to be able to configure this on the iterator
using the shell.
This changes modifies PropogatingIterator [sic] to support passing
options to the configured combiner(s).
Note that there is a side-effect of this change that stems from the
fact that the aggregators used by PropogatingIterator aren't really
used as Combiners (init is never called, for example) but rather as
the old removed Aggregator. The side effect is that if you wish to
configure the max UID list size on GlobalIndexUidAggregator then you
must also set the "all" option on GlobalIndexUidAggregator since it is
a Combiner (even though the all option won't actually be used, the
options validation requires that it or a specific column list be set).
Another consideration for anyone using this feature is the impact of
changing the max UID list size after data has been loaded. Decreasing
the max size will likely cause UID lists to be purged as they now
exceed the max UID count. Increasing also won't quite behave as you'd
expect for any lists that had already exceeded the previous max count
and had their lists cleared. Really the main use for this is when
setting up a new cluster.