Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Store calendar periodicity historical error statistics in compressed format #127

Merged

Conversation

tveasey
Copy link
Contributor

@tveasey tveasey commented Jun 14, 2018

Following on from #100, the calendar periodicity test is another part of the model state which benefits from storing some of its state in a compressed format. This PR switches to storing the historical error statistics in compressed format. On the functional side, I took the opportunity to not simply reduce memory usage, but also:

  1. Extend the window of historical error statistics, which means we can improve the test accuracy and was previously being limited to keep memory usage down,
  2. Simplify the implementation, which was previously bit packing the count statistics to reduce memory usage,
  3. Rationalise the test statistic (accounting for the fact that we selecting, the most significant, from a number of possible features). This effect is fixed and can be rolled into the significance at which we accept a feature, but I think it is better to make it explicit,
  4. Extend the unit testing of this class: extending the testing of true positives and creating a test to check false positives.

Even with the first two changes I still get close to 30% memory reduction for this data structure for typical usage with essentially no compute overhead, since inflate/deflate only needs to happen approximately once per day of data processed.

Finally, I decided to split CCalendarCyclicTest out from CTrendTests. This historically grouped more tests, but some of these have since been superseded or factored out so it was an anachronism. I made this change in the first commit so it makes sense to primarily review this commit.

This will affect results. In particular, we get fewer false positives from the test primarily due to item 1. These constitute overfitting, so we will be more robust to noise as a result. The test is also more sensitive to marginal affects, since the test statistic is sensitive to the number of repeats of the feature it has error statistics for and it now has more. This improves our true positive rate and so prediction accuracy when calendar features are helpful predictors. Finally, we also get an impact on results for memory limited jobs. For example, in CAnomalyJobLimitTest we were able to model 7% and 5% more by and partition fields for the same memory, respectively.

Release note: Reduces model memory by storing the state for testing predictive calendar
features in a compressed format

std::uint64_t checksum(std::uint64_t seed = 0) const;

//! Debug the memory used by this object.
void debugMemoryUsage(core::CMemoryUsage::TMemoryUsagePtr mem) const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#include <core/CMemoryUsage.h>


std::uint32_t s_Count = 0;
std::uint32_t s_LargeErrorCount = 0;
CFloatStorage s_LargeErrorSum = 0.0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#include <core/CFloatStorage.h>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For historical reasons this is forwarded in to the maths namespace in MathsTypes.h which I included instead.

}

void CCalendarCyclicTest::propagateForwardsByTime(double time) {
if (!CMathsFuncs::isFinite(time) || time < 0.0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit '!' -> '== false'

inserter.insertValue(CURRENT_BUCKET_INDEX_6_4_TAG, m_CurrentBucketIndex);
inserter.insertValue(CURRENT_BUCKET_ERROR_STATS_6_4_TAG,
m_CurrentBucketErrorStats.toDelimited());
TErrorStatsVec errors{this->inflate()};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so the compressed buffer is only an internal detail and will not be persisted/restored?

would be good to document that somewhere

}

void CRandomizedPeriodicityTestTest::testPersist() {
// Check that persistence is idempotent.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test only compares XML and the rng. It would be good if it could actually test the actual class instances.

I see that checksum being disabled(unrelated for this PR), but maybe you can checksum some of the members and compare them. Should be possible as you have friend access.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, I was able to re-enable the checksum implementation because of changes I'd made to maths::CTimeSeriesDecomposition.

test.add(t, 0.2);
}

std::string origXml;
Copy link
Contributor

@hendrikmuhs hendrikmuhs Jun 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#include <string>


CCalendarCyclicTest::TErrorStatsVec CCalendarCyclicTest::inflate() const {
bool lengthOnly{false};
core::CInflator inflator{lengthOnly};
Copy link
Contributor

@hendrikmuhs hendrikmuhs Jun 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is room for improvement here, instantiating Inflator/Deflator is quite expensive and adds to heap fragmentation (would be good to do a long running test to be sure this isn't a problem). We could have some compression/decompression service which internally reuses buffers.

Nit?: In terms of memory usage: It sounds a bit like cheating to me, we are not accounting for the memory that is temporary used. I understand that this isn't a big deal if you have a lot of partitions, but for small jobs it adds 4K++ which are not accounted anymore.

Copy link
Contributor

@hendrikmuhs hendrikmuhs Jun 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

happen approximately once per day

Ok, in this case the comment above is probably solved, I did not see that this is running only once per day.

^ Maybe good to add a note to the class description

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct that it's once per day per time series?

It also happens in persist/restore, which is an extra 6-8 times per day per time series at the default background persist frequency.

But it depends on whether creating and destroying the zlib data structures is really significant compared to all the other processing we're doing. A shared service would need to consider thread safety, maybe by caching an inflator and deflator per thread, so if the overhead isn't measurable compared to the other CPU-intensive processing we're doing then it may not be worth the added complexity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, I think it only makes sense to think about a service if it's used by more than this, which is the case: "another part of the model state which benefits from storing some of its state in a compressed format."

Having that said, before looking into that, I would 1st check zlib alternatives, e.g. snappy: It's faster, it does not require initialization/deinitialization (-> therefore a service is superfluous). But it might be a little bit weaker in compression (needs measurement!).

^ But not important, just an idea for a potential future improvement which might be of interest if we use compression in more places.

Copy link
Contributor Author

@tveasey tveasey Jun 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding runtime, you're right strictly it's 1 + (persists + restores) x "time series" per day.

I profiled running the executable standalone for the "worst case scenario": polled, at the bucket length (= 5m), normally distributed data with many series. This gives us about the simplest possible modelling and the minimum possible overhead for processing the raw data to extract bucket statistics. I'd say this is pathological, i.e. I don't think it is realistic, but even so the cost still only comes in at about 4% increase in run time due to the compress/decompress. That also excludes any time spent in Java. So I think this will not introduce a new CPU bottleneck.

Since this solution is performant enough I think I'd prefer to just stick with zlib and temporaries. (I agree it would be nice to avoid memory churn, but I think it makes sense to wait for the C++17 standard library enhancements, when we could revisit allocation strategies in general.)

Regarding memory usage, there is as much (and often more) memory not accounted for elsewhere, for example in api::CAnomalyJob and in temporaries we create, principally the results object. In terms of job allocations to nodes, we have a pad for a single job which I think more than covers this. Since it is fixed, independent of the number of time series modelled, I think the small additional temporary memory this PR introduces is ok.

I wouldn't be against trying to get any memory accounting we do better and in particular to estimate peak usage. I think my preferred solution would be an empirically determine a margin, but I'd prefer to keep that as a separate change.

@hendrikmuhs
Copy link
Contributor

LGTM, only some recommendations

Copy link
Contributor

@hendrikmuhs hendrikmuhs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tveasey tveasey merged commit 4dd90fa into elastic:master Jun 27, 2018
tveasey added a commit to tveasey/ml-cpp-1 that referenced this pull request Jul 2, 2018
…ssed format (elastic#127)

Also, extend the window of historical error statistics, which means we can improve the test
accuracy (this was previously being limited to keep memory usage down) and simplify the
implementation, which was previously bit packing the count statistics to reduce memory
usage.
tveasey added a commit that referenced this pull request Jul 2, 2018
@tveasey tveasey deleted the enhancement/deflate-calendar-periodicity-test branch April 10, 2019 10:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants