Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monitor sweep index #4694

Open
mlissner opened this issue Nov 15, 2024 · 6 comments
Open

Monitor sweep index #4694

mlissner opened this issue Nov 15, 2024 · 6 comments
Assignees

Comments

@mlissner
Copy link
Member

We need to monitor the new sweep index. From our earlier PR:

Do we just need to check on the cronjob tomorrow, then, I assume?

Yes, we could also re-check the ES query to confirm that the reindexing task is running.

GET _tasks?detailed=true&actions=*reindex

Originally posted by @albertisfu in #4672 (comment)

@mlissner mlissner self-assigned this Nov 15, 2024
@mlissner mlissner added this to Sprint Nov 15, 2024
@mlissner mlissner moved this to In progress in Sprint Nov 15, 2024
@mlissner
Copy link
Member Author

@albertisfu, I spun this into a new issue, so we can discuss it here. I ran the tasks command just now and it actually had results:

{
  "nodes": {
    "nL1wCXyqQCW64ODQvubEnA": {
      "name": "elastic-cluster-es-master-data-nodes-v3-5",
      "transport_address": "172.30.0.203:9300",
      "host": "172.30.0.203",
      "ip": "172.30.0.203:9300",
      "roles": [
        "data",
        "data_cold",
        "data_content",
        "data_frozen",
        "data_hot",
        "data_warm",
        "ingest",
        "master",
        "ml",
        "remote_cluster_client",
        "transform"
      ],
      "attributes": {
        "ml.allocated_processors": "4",
        "k8s_node_name": "ip-172-30-0-75.us-west-2.compute.internal",
        "ml.allocated_processors_double": "4.0",
        "ml.machine_memory": "31138512896",
        "xpack.installed": "true",
        "ml.max_jvm_size": "17179869184"
      },
      "tasks": {
        "nL1wCXyqQCW64ODQvubEnA:1261302059": {
          "node": "nL1wCXyqQCW64ODQvubEnA",
          "id": 1261302059,
          "type": "transport",
          "action": "indices:data/write/reindex",
          "status": {
            "total": 792895,
            "updated": 0,
            "created": 390000,
            "deleted": 0,
            "batches": 391,
            "version_conflicts": 0,
            "noops": 0,
            "retries": {
              "bulk": 0,
              "search": 0
            },
            "throttled_millis": 0,
            "requests_per_second": -1,
            "throttled_until_millis": 0
          },
          "description": "reindex from [recap_vectors] to [recap_sweep]",
          "start_time_in_millis": 1731628870357,
          "running_time_in_nanos": 2474749945113,
          "cancellable": true,
          "cancelled": false,
          "headers": {}
        }
      }
    }
  }
}

Surprising, no?

@albertisfu
Copy link
Contributor

Are there any cron job instances running for the command?

Is it possible that the code deployment restarted any of the cron job processes? and since you removed the Redis keys, it just started again.

According to the running time (2474749945113), it has been running for just 41 minutes.

The total number of documents targeted for re-indexing is 792,895, and it had made progress up to 390,000, so it should have finished by now.

If not, we can cancel it by:

POST _tasks/nL1wCXyqQCW64ODQvubEnA:1261302059/_cancel

So when the cron job run again only one ES tasks runs.

And it might be also required to clean up the Redis keys again.

@mlissner
Copy link
Member Author

Well, this took me down a rabbit hole, but luckily I had some time. We didn't have timezone information set on this or any other of our cronjobs, so I just audited and tweaked them all. This means this won't run until tonight, so we'll have to check on it on Monday.

I did clear out the redis stuff a second time though, so that should be good.

@ERosendo
Copy link
Contributor

Here are the logs from the latest execution on November 15th:

2024-11-15 20:01:09.712 INFO Re-indexing task scheduled ID: 0cQl85qiTiyiNppgQkkPOA:1952097644
2024-11-15 20:02:09.717 INFO Task progress: 26000/627645 documents. Estimated time to finish: 1388.953806 seconds.
2024-11-15 20:17:09.720 INFO Task progress: 26000/627645 documents. Estimated time to finish: 22215.187001 seconds.
2024-11-15 20:32:09.723 INFO Task progress: 192000/627645 documents. Estimated time to finish: 4220.377473 seconds.
2024-11-15 20:47:09.727 INFO Task progress: 315000/627645 documents. Estimated time to finish: 2739.398105 seconds.
2024-11-15 21:02:09.730 INFO Task progress: 402000/627645 documents. Estimated time to finish: 2054.400233 seconds.
2024-11-15 21:17:09.734 INFO Task progress: 507000/627645 documents. Estimated time to finish: 1085.100631 seconds.
2024-11-15 21:32:09.738 INFO Task progress: 625000/627645 documents. Estimated time to finish: 60.0 seconds.

2024-11-15 21:33:09.760 INFO Resuming re-indexing process for date: 2024-11-15 00:00:00
2024-11-15 21:33:10.286 INFO Re-indexing task scheduled ID: 0cQl85qiTiyiNppgQkkPOA:1952406648
2024-11-15 21:34:10.289 INFO Task progress: 9000/564072 documents. Estimated time to finish: 3701.636893 seconds.
2024-11-15 21:49:10.292 INFO Task progress: 9000/564072 documents. Estimated time to finish: 59208.99034 seconds.
2024-11-15 22:04:10.296 INFO Task progress: 150000/564072 documents. Estimated time to finish: 5134.561793 seconds.
2024-11-15 22:19:10.299 INFO Task progress: 283000/564072 documents. Estimated time to finish: 2741.225053 seconds.
2024-11-15 22:34:10.303 INFO Task progress: 395000/564072 documents. Estimated time to finish: 1566.604886 seconds.
2024-11-15 22:49:10.306 INFO Task progress: 537000/564072 documents. Estimated time to finish: 229.886932 seconds.

I've included timestamps to help us understand how long it takes to reindex records.

It looks like the missing_error_index issue is resolved. However, the command was unable to complete due to issue #4698.

@mlissner
Copy link
Member Author

Cool. Sounds like we should close #4646 and put #4698 on our next sprint, with this as a parent?

@ERosendo
Copy link
Contributor

Sounds like we should close #4646 and put #4698 on our next sprint, with this as a parent?

Sounds good! Let's close #4646 and move #4698 to the next sprint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In progress
Development

No branches or pull requests

3 participants