Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature flag to disable asset stats #668

Merged
merged 1 commit into from
Sep 19, 2018
Merged

Conversation

bartekn
Copy link
Contributor

@bartekn bartekn commented Sep 18, 2018

Asset stats calculations have become very CPU intensive with the recent growth of operations per ledger and assets in, both, public and test networks. Because it's part of the ingestion session it sometimes caused delays in ingesting new ledgers but also slows down connected stellar-core DB (calculations are done in core DB).

The original plan I discussed with @MonsieurNicolas was to extract asset stats calculations outside ingestion and make it run every X seconds, independently of ingestion. The problem is that it's a huge architectural change and requires much more work than I expected to make it right and test it properly. Given all the arguments above and the fact that it's already causing operational issues for some organizations I think we should design it along with the upcoming big architectural change in Horizon and ingest package and have a quickfix now.

This PR introduces a feature flag that can be set using environment variable or command line param. When DISABLE_ASSET_STATS=true assets stats calculation are turned off during ingestion and /assets endpoint is not present. This is not a breaking change as the default value is false. It can be released in 0.14.1 patch release.

@bartekn bartekn added this to the Horizon v0.14.x - patch releases milestone Sep 18, 2018
@MonsieurNicolas
Copy link
Contributor

MonsieurNicolas commented Sep 18, 2018

Right now it's updating stats every time a ledger is ingested, this PR makes it never update.
While this helps with one off Horizon deployments, it doesn't really help with the public facing endpoints.

What I would suggest then as it's a very similar change (code wise) is to change this from a boolean to an update period (default is 1 = updated every ledger close): that way we would make the SDF public nodes expose the endpoint but only update every 120 ledgers (10 minutes) for example.

What do you think?

@bartekn
Copy link
Contributor Author

bartekn commented Sep 18, 2018

What I would suggest then as it's a very similar change (code wise) is to change this from a boolean to an update period (default is 1 = updated every ledger close): that way we would make the SDF public nodes expose the endpoint but only update every 120 ledgers (10 minutes) for example.

This is one of the ideas I had too but it will still trigger high CPU load every X minutes this time possibly causing longer delays as list of assets to check grows bigger. And it still involves a big rearchitecture of the system.

Currently every ingest.Session keeps track of assets that should be recalculated. We have two options: keep track of the asset in the higher layer (possibly ingest.System) or recalculate all assets stats. The former option seems better but there are still some unknowns, ex. the list of assets can grow to max ops per ledger * # ledgers in 10 minutes (currently 600k in testnet correct?). What are memory and DB CPU implications of this? What we can end up with is instead a of short delay every busy ledger we can have, say, 5 minutes of calculations every 10 minutes. Both situations are really bad.

I propose the following solution:

  1. Let's merge this PR and release in 0.14.1 today/tomorrow. This way we can quickly solve delays for organizations that run core+horizon and experience high CPU load on their core instances.
  2. Start testing "calculate asset stats every X ledgers" solution, if it works fine we can release it later this month as another feature flag.
  3. In the long term, let's think about and implement solution to solve this problem completely (make this calculations faster). Possibly connected with ingest package rearchitecture/plugins etc.

@MonsieurNicolas
Copy link
Contributor

Oh I see, I thought the cost of running was pretty much fixed cost.

This seems fine then. I'll let other @stellar/platform-committers merge.

Copy link
Contributor

@MonsieurNicolas MonsieurNicolas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

Copy link
Contributor

@nikhilsaraf nikhilsaraf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

some thoughts on how to amortize the cost over N ledgers for a future PR:

  1. if you write the list of modified assets to a db table ever ledger, there will be many overwrites because of repeats (which is good). This will effectively reduce the average number of assets needed to be calculated on a per-ledger basis, thereby amortizing the cost (because of the repeats).

  2. you can have a background thread that looks at this table, copies over the first K values to a new table (to unblock insertions from the main ingestion thread), and then recomputes these K values. This will ensure we keep the cost fixed to a max of K values and the insertion table will also contain dates so assets are not "out of sync" for too long.

  3. params N and K should be configurable, as indicated in the comments above.

@bartekn bartekn merged commit 9eca875 into master Sep 19, 2018
@bartekn bartekn deleted the disable-asset-stats-flag branch September 19, 2018 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants