[ML] Store calendar periodicity historical error statistics in compressed format #127

tveasey · 2018-06-14T09:54:50Z

Following on from #100, the calendar periodicity test is another part of the model state which benefits from storing some of its state in a compressed format. This PR switches to storing the historical error statistics in compressed format. On the functional side, I took the opportunity to not simply reduce memory usage, but also:

Extend the window of historical error statistics, which means we can improve the test accuracy and was previously being limited to keep memory usage down,
Simplify the implementation, which was previously bit packing the count statistics to reduce memory usage,
Rationalise the test statistic (accounting for the fact that we selecting, the most significant, from a number of possible features). This effect is fixed and can be rolled into the significance at which we accept a feature, but I think it is better to make it explicit,
Extend the unit testing of this class: extending the testing of true positives and creating a test to check false positives.

Even with the first two changes I still get close to 30% memory reduction for this data structure for typical usage with essentially no compute overhead, since inflate/deflate only needs to happen approximately once per day of data processed.

Finally, I decided to split CCalendarCyclicTest out from CTrendTests. This historically grouped more tests, but some of these have since been superseded or factored out so it was an anachronism. I made this change in the first commit so it makes sense to primarily review this commit.

This will affect results. In particular, we get fewer false positives from the test primarily due to item 1. These constitute overfitting, so we will be more robust to noise as a result. The test is also more sensitive to marginal affects, since the test statistic is sensitive to the number of repeats of the feature it has error statistics for and it now has more. This improves our true positive rate and so prediction accuracy when calendar features are helpful predictors. Finally, we also get an impact on results for memory limited jobs. For example, in CAnomalyJobLimitTest we were able to model 7% and 5% more by and partition fields for the same memory, respectively.

Release note: Reduces model memory by storing the state for testing predictive calendar
features in a compressed format

hendrikmuhs · 2018-06-27T07:19:17Z

include/maths/CCalendarCyclicTest.h

+    std::uint64_t checksum(std::uint64_t seed = 0) const;
+
+    //! Debug the memory used by this object.
+    void debugMemoryUsage(core::CMemoryUsage::TMemoryUsagePtr mem) const;


#include <core/CMemoryUsage.h>

hendrikmuhs · 2018-06-27T07:20:38Z

include/maths/CCalendarCyclicTest.h

+
+        std::uint32_t s_Count = 0;
+        std::uint32_t s_LargeErrorCount = 0;
+        CFloatStorage s_LargeErrorSum = 0.0;


#include <core/CFloatStorage.h>

For historical reasons this is forwarded in to the maths namespace in MathsTypes.h which I included instead.

hendrikmuhs · 2018-06-27T07:24:18Z

lib/maths/CCalendarCyclicTest.cc

+}
+
+void CCalendarCyclicTest::propagateForwardsByTime(double time) {
+    if (!CMathsFuncs::isFinite(time) || time < 0.0) {


nit '!' -> '== false'

hendrikmuhs · 2018-06-27T07:41:17Z

lib/maths/CCalendarCyclicTest.cc

+    inserter.insertValue(CURRENT_BUCKET_INDEX_6_4_TAG, m_CurrentBucketIndex);
+    inserter.insertValue(CURRENT_BUCKET_ERROR_STATS_6_4_TAG,
+                         m_CurrentBucketErrorStats.toDelimited());
+    TErrorStatsVec errors{this->inflate()};


so the compressed buffer is only an internal detail and will not be persisted/restored?

would be good to document that somewhere

hendrikmuhs · 2018-06-27T08:18:05Z

lib/maths/unittest/CRandomizedPeriodicityTestTest.cc

+}
+
+void CRandomizedPeriodicityTestTest::testPersist() {
+    // Check that persistence is idempotent.


this test only compares XML and the rng. It would be good if it could actually test the actual class instances.

I see that checksum being disabled(unrelated for this PR), but maybe you can checksum some of the members and compare them. Should be possible as you have friend access.

In fact, I was able to re-enable the checksum implementation because of changes I'd made to maths::CTimeSeriesDecomposition.

hendrikmuhs · 2018-06-27T08:38:51Z

lib/maths/unittest/CRandomizedPeriodicityTestTest.cc

+        test.add(t, 0.2);
+    }
+
+    std::string origXml;


#include <string>

hendrikmuhs · 2018-06-27T09:02:45Z

lib/maths/CCalendarCyclicTest.cc

+
+CCalendarCyclicTest::TErrorStatsVec CCalendarCyclicTest::inflate() const {
+    bool lengthOnly{false};
+    core::CInflator inflator{lengthOnly};


I wonder if there is room for improvement here, instantiating Inflator/Deflator is quite expensive and adds to heap fragmentation (would be good to do a long running test to be sure this isn't a problem). We could have some compression/decompression service which internally reuses buffers.

Nit?: In terms of memory usage: It sounds a bit like cheating to me, we are not accounting for the memory that is temporary used. I understand that this isn't a big deal if you have a lot of partitions, but for small jobs it adds 4K++ which are not accounted anymore.

happen approximately once per day

Ok, in this case the comment above is probably solved, I did not see that this is running only once per day.

^ Maybe good to add a note to the class description

Am I correct that it's once per day per time series?

It also happens in persist/restore, which is an extra 6-8 times per day per time series at the default background persist frequency.

But it depends on whether creating and destroying the zlib data structures is really significant compared to all the other processing we're doing. A shared service would need to consider thread safety, maybe by caching an inflator and deflator per thread, so if the overhead isn't measurable compared to the other CPU-intensive processing we're doing then it may not be worth the added complexity.

agreed, I think it only makes sense to think about a service if it's used by more than this, which is the case: "another part of the model state which benefits from storing some of its state in a compressed format."

Having that said, before looking into that, I would 1st check zlib alternatives, e.g. snappy: It's faster, it does not require initialization/deinitialization (-> therefore a service is superfluous). But it might be a little bit weaker in compression (needs measurement!).

^ But not important, just an idea for a potential future improvement which might be of interest if we use compression in more places.

Regarding runtime, you're right strictly it's 1 + (persists + restores) x "time series" per day.

I profiled running the executable standalone for the "worst case scenario": polled, at the bucket length (= 5m), normally distributed data with many series. This gives us about the simplest possible modelling and the minimum possible overhead for processing the raw data to extract bucket statistics. I'd say this is pathological, i.e. I don't think it is realistic, but even so the cost still only comes in at about 4% increase in run time due to the compress/decompress. That also excludes any time spent in Java. So I think this will not introduce a new CPU bottleneck.

Since this solution is performant enough I think I'd prefer to just stick with zlib and temporaries. (I agree it would be nice to avoid memory churn, but I think it makes sense to wait for the C++17 standard library enhancements, when we could revisit allocation strategies in general.)

Regarding memory usage, there is as much (and often more) memory not accounted for elsewhere, for example in api::CAnomalyJob and in temporaries we create, principally the results object. In terms of job allocations to nodes, we have a pad for a single job which I think more than covers this. Since it is fixed, independent of the number of time series modelled, I think the small additional temporary memory this PR introduces is ok.

I wouldn't be against trying to get any memory accounting we do better and in particular to estimate peak usage. I think my preferred solution would be an empirically determine a margin, but I'd prefer to keep that as a separate change.

hendrikmuhs · 2018-06-27T09:10:50Z

LGTM, only some recommendations

hendrikmuhs

LGTM

…ssed format (elastic#127) Also, extend the window of historical error statistics, which means we can improve the test accuracy (this was previously being limited to keep memory usage down) and simplify the implementation, which was previously bit packing the count statistics to reduce memory usage.

…ompressed format (#137) Backport of #127.

tveasey added 4 commits May 25, 2018 14:53

Split CTrendTests into the two component tests

5f48eeb

Store cyclic calendar test statistics in compressed format

45f3597

Merge branch 'master' into enhancement/deflate-calendar-periodicity-test

3ab4786

Remove out-of-date comment

3f932a7

tveasey added >enhancement v7.0.0 review :ml v6.4.0 affects-results labels Jun 14, 2018

Update change log

c282914

hendrikmuhs reviewed Jun 27, 2018

View reviewed changes

hendrikmuhs approved these changes Jun 27, 2018

View reviewed changes

tveasey added 2 commits June 27, 2018 16:07

Review comments

e38a168

Fix formatting

a0f9cf1

tveasey merged commit 4dd90fa into elastic:master Jun 27, 2018

tveasey mentioned this pull request Jul 2, 2018

[6.4][ML] Store calendar periodicity historical error statistics in compressed format #137

Merged

tveasey added a commit that referenced this pull request Jul 2, 2018

[6.4][ML] Store calendar periodicity historical error statistics in c…

2ec8910

…ompressed format (#137) Backport of #127.

tveasey mentioned this pull request Jul 30, 2018

[ML] Anomaly detection for multiple bucket features #175

Merged

tveasey deleted the enhancement/deflate-calendar-periodicity-test branch April 10, 2019 10:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Store calendar periodicity historical error statistics in compressed format #127

[ML] Store calendar periodicity historical error statistics in compressed format #127

tveasey commented Jun 14, 2018 •

edited by lcawl

Loading

hendrikmuhs Jun 27, 2018

hendrikmuhs Jun 27, 2018

tveasey Jun 27, 2018

hendrikmuhs Jun 27, 2018

hendrikmuhs Jun 27, 2018

hendrikmuhs Jun 27, 2018

tveasey Jun 27, 2018

hendrikmuhs Jun 27, 2018 •

edited

Loading

hendrikmuhs Jun 27, 2018 •

edited

Loading

hendrikmuhs Jun 27, 2018 •

edited

Loading

droberts195 Jun 27, 2018

hendrikmuhs Jun 27, 2018

tveasey Jun 27, 2018 •

edited

Loading

hendrikmuhs commented Jun 27, 2018

hendrikmuhs left a comment

[ML] Store calendar periodicity historical error statistics in compressed format #127

[ML] Store calendar periodicity historical error statistics in compressed format #127

Conversation

tveasey commented Jun 14, 2018 • edited by lcawl Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hendrikmuhs Jun 27, 2018 • edited Loading

Choose a reason for hiding this comment

hendrikmuhs Jun 27, 2018 • edited Loading

Choose a reason for hiding this comment

hendrikmuhs Jun 27, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tveasey Jun 27, 2018 • edited Loading

Choose a reason for hiding this comment

hendrikmuhs commented Jun 27, 2018

hendrikmuhs left a comment

Choose a reason for hiding this comment

tveasey commented Jun 14, 2018 •

edited by lcawl

Loading

hendrikmuhs Jun 27, 2018 •

edited

Loading

hendrikmuhs Jun 27, 2018 •

edited

Loading

hendrikmuhs Jun 27, 2018 •

edited

Loading

tveasey Jun 27, 2018 •

edited

Loading