Allow options to be passed to GlobalIndexUidAggregator #1091

brianloss · 2021-02-26T14:55:59Z

Currently, the only way to change the max number of UIDs kept in the
GlobalIndexUidAggregator's protocol buffer is to change the code and
recompile. It would be nice to be able to configure this on the iterator
using the shell.

This changes modifies PropogatingIterator [sic] to support passing
options to the configured combiner(s).

Note that there is a side-effect of this change that stems from the
fact that the aggregators used by PropogatingIterator aren't really
used as Combiners (init is never called, for example) but rather as
the old removed Aggregator. The side effect is that if you wish to
configure the max UID list size on GlobalIndexUidAggregator then you
must also set the "all" option on GlobalIndexUidAggregator since it is
a Combiner (even though the all option won't actually be used, the
options validation requires that it or a specific column list be set).

Another consideration for anyone using this feature is the impact of
changing the max UID list size after data has been loaded. Decreasing
the max size will likely cause UID lists to be purged as they now
exceed the max UID count. Increasing also won't quite behave as you'd
expect for any lists that had already exceeded the previous max count
and had their lists cleared. Really the main use for this is when
setting up a new cluster.

Currently, the only way to change the max number of UIDs kept in the GlobalIndexUidAggregator's protocol buffer is to change the code and recompile. It would be nice to be able to configure this on the iterator using the shell. This changes modifies PropogatingIterator [sic] to support passing options to the configured combiner(s). Note that there is a side-effect of this change that stems from the fact that the aggregators used by PropogatingIterator aren't really used as Combiners (init is never called, for example) but rather as the old removed Aggregator. The side effect is that if you wish to configure the max UID list size on GlobalIndexUidAggregator then you must also set the "all" option on GlobalIndexUidAggregator since it is a Combiner (even though the all option won't actually be used, the options validation requires that it or a specific column list be set). Another consideration for anyone using this feature is the impact of changing the max UID list size after data has been loaded. Decreasing the max size will likely cause UID lists to be purged as they now exceed the max UID count. Increasing also won't quite behave as you'd expect for any lists that had already exceeded the previous max count and had their lists cleared. Really the main use for this is when setting up a new cluster.

...use/ingest-core/src/main/java/datawave/ingest/table/aggregator/GlobalIndexUidAggregator.java

ivakegg · 2021-03-24T12:55:42Z

My gut reaction here is that it will become too easy to totally screw up an existing index on a system. I am wondering if we can update the aggregator to not be destructive: if there are already N UIDs in the list, and the configuration is M where M < N, then the UIDs will remain for that list.

brianloss · 2021-03-24T13:39:48Z

My gut reaction here is that it will become too easy to totally screw up an existing index on a system. I am wondering if we can update the aggregator to not be destructive: if there are already N UIDs in the list, and the configuration is M where M < N, then the UIDs will remain for that list.

Sure, that could be done, but it would lead to some inconsistencies. Any new terms would "follow the rules" and their protocol buffers would stop keeping UIDs at the new max count whereas existing terms whose count were over the new max would have more UIDs. I suppose old terms whose count was below the new max would also behave like new terms.

Just trying to think of other options. What if you had another parameter that has to be set to true in order for the maxUIDs parameter to be honored? It's like a confirmation at that point, and not really different from "droptable shard". :) Of course there are plenty of ways to screw up a system if you don't know what you're doing. "deleterows -t shard -f" doesn't require confirmation...

The base branch was changed.

brianloss requested review from alerman, apmoriarty, billoley, cawaring, drewfarris, hlgp, ivakegg, jschmidt10, jwomeara, keith-ratcliffe and matthpeterson as code owners February 26, 2021 14:56

Handle aggregator options in AggregatingReducer.

796ab78

keith-ratcliffe requested changes Mar 3, 2021

View reviewed changes

...use/ingest-core/src/main/java/datawave/ingest/table/aggregator/GlobalIndexUidAggregator.java Show resolved Hide resolved

Address PR feedback - update javadoc.

3d8f114

keith-ratcliffe previously approved these changes Mar 4, 2021

View reviewed changes

Merge branch 'master' into feature/passCombinerOptions

827fe93

ivakegg changed the base branch from master to feature/accumulo-2.0 July 15, 2021 19:06

This was referenced Jul 21, 2021

Don't drop UIDs in GlobalIndexUidAggregator #1194

Merged

Allow options to be passed to aggregators #1228

Closed

ivakegg deleted the branch NationalSecurityAgency:integration January 31, 2024 18:03

ivakegg closed this Jan 31, 2024

ivakegg reopened this Feb 2, 2024

ivakegg changed the base branch from feature/accumulo-2.0 to integration February 2, 2024 15:19

hlgp closed this Oct 30, 2024

mineralntl mentioned this pull request Nov 20, 2024

Adding ability to pass options to GlobalIndexUidAggregator #2643

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow options to be passed to GlobalIndexUidAggregator #1091

Allow options to be passed to GlobalIndexUidAggregator #1091

brianloss commented Feb 26, 2021

ivakegg commented Mar 24, 2021

brianloss commented Mar 24, 2021

Allow options to be passed to GlobalIndexUidAggregator #1091

Allow options to be passed to GlobalIndexUidAggregator #1091

Conversation

brianloss commented Feb 26, 2021

ivakegg commented Mar 24, 2021

brianloss commented Mar 24, 2021