Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reindex: Remove ability to sort #47567

Open
henningandersen opened this issue Oct 4, 2019 · 3 comments
Open

Reindex: Remove ability to sort #47567

henningandersen opened this issue Oct 4, 2019 · 3 comments
Labels
>deprecation :Distributed Indexing/Reindex Issues relating to reindex that are not caused by issues further down Team:Distributed Meta label for distributed team (obsolete)

Comments

@henningandersen
Copy link
Contributor

henningandersen commented Oct 4, 2019

As part of the reindex job specification sorting can be specified. Documentation describes that this can be used in combination with max_docs to extract either a specific or a random subset of data.

However, specifying sorting is not compatible with the new upcoming resilient reindex mechanism, since this relies on sorting by seq_no. Any reindex request that sorts by anything but seq_no first will not be resilient.

When copying the full data set, sorting does not really make a difference, the net end result will be the same. Extracting subsets of data can likely be done by adding queries instead. To avoid having cases where reindex is not resilient, I propose to deprecate sorting in reindex in 7.x and remove sorting from reindex in 8.0.

This issue is created to gather feedback on this proposal. If you rely on being able to sort while reindexing, please let us know here.

@henningandersen henningandersen added >deprecation :Distributed Indexing/Reindex Issues relating to reindex that are not caused by issues further down labels Oct 4, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Reindex)

@henningandersen
Copy link
Contributor Author

We discussed this in our FixitFriday meeting and concluded that we think removing the ability to sort in reindex is the right path forward. Following is a summary of the use cases discussed that might be affected and their possible workarounds:

  1. Resilience
    1. Sort might be useful to add resilience to reindex.
    2. Most such cases would directly do search/scroll and bulk requests since it gives more control.
    3. Resilience will become native to reindex in the future. We should ensure that resilient reindex is a reality before completely removing the sort option.
  2. Incremental reindex
    1. Sorting by a timestamp (or _seq_no) field might be used for providing incremental reindex, trying to only index newly arrived docs since last reindex.
    2. An alternative is to do range based reindex requests instead.
    3. A separate alternative is to handle the search/scroll and bulk outside ES instead.
    4. It should be noted that handling deletes on source will likely require an external program doing bulk requests anyway.
  3. Get last X docs.
    1. This is a documentation example and thus not necessarily a valid use case.
    2. Getting a specific subset of docs could be expressed as a range filter instead.
  4. Get a random subset of docs
    1. This is a documentation example and thus not necessarily a valid use case.
    2. A similar result can be obtained without sorting by adding a suitable min_score to the function_score clause, for instance "min_score" : 0.9.

The plan is to move forward with deprecating the ability to sort during reindex. We will give this another 2-3 weeks here to gather input before taking concrete actions. Our current plan after that is to:

  1. Deprecate sorting in reindex in 7.x such that deprecation warnings are emitted if used. We prefer to deprecate early to maximize the period where users could run into the deprecation warning.
  2. Remove the documentation examples using sort in reindex from master.
  3. When resilient reindex has reached feature parity with current reindex, remove the ability to sort in master (breaking change).

henningandersen added a commit to henningandersen/elasticsearch that referenced this issue Nov 21, 2019
Reindex sort never gave a guarantee about the order of documents being
indexed into the destination, though it could give a sense of locality
of source data.

It prevents us from doing resilient reindex and other optimizations and
it has therefore been deprecated.

Related to elastic#47567
henningandersen added a commit that referenced this issue Nov 29, 2019
Reindex sort never gave a guarantee about the order of documents being
indexed into the destination, though it could give a sense of locality
of source data.

It prevents us from doing resilient reindex and other optimizations and
it has therefore been deprecated.

Related to #47567
henningandersen added a commit that referenced this issue Nov 29, 2019
Reindex sort never gave a guarantee about the order of documents being
indexed into the destination, though it could give a sense of locality
of source data.

It prevents us from doing resilient reindex and other optimizations and
it has therefore been deprecated.

Related to #47567
henningandersen added a commit to henningandersen/elasticsearch that referenced this issue Nov 29, 2019
Reindex sort never gave a guarantee about the order of documents being
indexed into the destination, though it could give a sense of locality
of source data.

It prevents us from doing resilient reindex and other optimizations and
it has therefore been deprecated.

Related to elastic#47567
henningandersen added a commit that referenced this issue Dec 1, 2019
Reindex sort never gave a guarantee about the order of documents being
indexed into the destination, though it could give a sense of locality
of source data.

It prevents us from doing resilient reindex and other optimizations and
it has therefore been deprecated.

Related to #47567
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this issue Jan 23, 2020
Reindex sort never gave a guarantee about the order of documents being
indexed into the destination, though it could give a sense of locality
of source data.

It prevents us from doing resilient reindex and other optimizations and
it has therefore been deprecated.

Related to elastic#47567
russcam added a commit to elastic/elasticsearch-net that referenced this issue Feb 4, 2020
russcam added a commit to elastic/elasticsearch-net that referenced this issue Feb 9, 2020
russcam added a commit to elastic/elasticsearch-net that referenced this issue Feb 9, 2020
Relates: #4341

Deprecate sorting in reindex elastic/elasticsearch#49458 (issue: elastic/elasticsearch#47567)

Closes #4356

(cherry picked from commit 20a2133)
@rjernst rjernst added the Team:Distributed Meta label for distributed team (obsolete) label May 4, 2020
@aksdb
Copy link

aksdb commented Dec 8, 2020

Maybe I miss something, but isn't a sorted reindex necessary for timeseries data?
For example we have around 50 indices with timeseries data of the last years. ILM is used to automatically roll indices over once a certain document count is hit. If I have to reindex them (after an upgrade or because I need new index sizes), I usually create an appropriate ILM rule, create the initial index, and then let the Reindex API read from the old alias covering all old indices into the new write alias. ILM will constantly monitor the new index and perform rollover as necessary; during the reindex.

If I can no longer apply sorting here, it sounds like the timeseries could end up being spread randomly over all the resulting indices. That doesn't sound optimal regarding query plans. Since time based queries usually query consecutive time frames, it should be better if data from the same time period is close together, shouldn't it? And this would not be possible without applying sorting, from what I understand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>deprecation :Distributed Indexing/Reindex Issues relating to reindex that are not caused by issues further down Team:Distributed Meta label for distributed team (obsolete)
Projects
None yet
Development

No branches or pull requests

5 participants