Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

503s Solr Performance Issues #10287

Closed
3 of 13 tasks
mekarpeles opened this issue Jan 6, 2025 · 3 comments
Closed
3 of 13 tasks

503s Solr Performance Issues #10287

mekarpeles opened this issue Jan 6, 2025 · 3 comments
Assignees
Labels
Lead: @mekarpeles Issues overseen by Mek (Staff: Program Lead) [managed] Module: Solr Issues related to the configuration or use of the Solr subsystem. [managed] Priority: 0 Fix now: Issue prevents users from using the site or active data corruption. [managed] Type: Post-Mortem Log for when having to resolve a P0 issue

Comments

@mekarpeles
Copy link
Member

mekarpeles commented Jan 6, 2025

Summary

  • What is wrong?

Since ~Jan 1, 2025, the website has been seeing lots of 503s and slow performance

  • Evidence

(Annotations needed on each)

Sentry performance shows increase in request load time

image

Digging into book page, solr seems to be the bottleneck

image

Solr strain evident on CPU charts

image

image

Increase in errors (related to solr 503s in sentry):

image

image

Load times correlated with decrease in page views and error response codes:

image

image

  • What caused it?

A part of our solr re-indexing flow is to run an index optimize. If we don't rebuild the frequently enough, the current index may become fragmented.

Possible a lack of optimization step on our index is partially responsible for the problem.

There was also low free disk space

  • What fixed it?

  • What was the impact?

Degraded website performance over multiple days as solr traffic increased

Followup / What could have gone better

Monitoring of:

  • solr disk space
  • solr total # requests
  • solr average response time
  • stats.inc every time solr restarted

More routine:

Steps to close

  1. Assignment: Is someone assigned to this issue? (notetaker, responder)
  2. Labels: Is there an Affects: label applied?
  3. Diagnosis: Add a description and scope of the issue
  4. Updates: As events unfold, is notable provenance documented in issue comments? (i.e. useful debug commands / steps / learnings / reference links)
  5. "What caused it?" - please answer in summary
  6. "What fixed it?" - please answer in summary
  7. "Followup actions:" actions added to summary
@mekarpeles mekarpeles added Module: Solr Issues related to the configuration or use of the Solr subsystem. [managed] Priority: 0 Fix now: Issue prevents users from using the site or active data corruption. [managed] Lead: @mekarpeles Issues overseen by Mek (Staff: Program Lead) [managed] Type: Post-Mortem Log for when having to resolve a P0 issue labels Jan 6, 2025
@mekarpeles
Copy link
Member Author

Optimize took ~2.5h, the last re-index was ~July 2024 (so ~6 months)

@mekarpeles
Copy link
Member Author

mekarpeles commented Jan 7, 2025

Example of optimize command:

stage('Optimize indices') {
// This is kind of like a disk defrag, but it's ESSENTIAL after a big import. Shouldn't
// really be needed after that.
steps {
dir(env.HOST_SOLR_BUILDER_DIR) {
sh "docker compose exec -T solr curl -s 'http://localhost:8983/solr/openlibrary/update?optimize=true&maxSegments=1'"
}
}
}

@tfmorris
Copy link
Contributor

tfmorris commented Jan 8, 2025

Anecdotally, the site was completely unusable while the index optimization task was (presumably) being run yesterday, with a pasted OL..A identifier taking multiple seconds to resolve in the work edit author autocomplete. If possible, it would be good to schedule future reindexing/index optimization runs for off-peak times.

Also, it seems a little strange that 6 months worth of index fragmentation would suddenly become a huge issue in the last two weeks, so I would keep an eye out for indications that it wasn't the (only) problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Lead: @mekarpeles Issues overseen by Mek (Staff: Program Lead) [managed] Module: Solr Issues related to the configuration or use of the Solr subsystem. [managed] Priority: 0 Fix now: Issue prevents users from using the site or active data corruption. [managed] Type: Post-Mortem Log for when having to resolve a P0 issue
Projects
None yet
Development

No branches or pull requests

3 participants