perf: regression since 20190805 release #41206

maddyblue · 2019-09-30T17:21:59Z

While testing the 20190930 release in #41128, the test clusters diverged pretty heavily in their performance:

vs

The top image (the newer release) shows heavily increasing time and variance for SQL latency compare to the 0805 release which has a slighty increase over time. The various transaction, SQL, and storage metrics are similar. This seems like a performance regression since then.

maddyblue · 2019-09-30T17:24:23Z

These clusters are at http://mjibson-release-v1920-alpha20190805-0001.roachprod.crdb.io:26258
http://mjibson-release-250f4c36de2b88eff443cf9be9cd5d2759312c88-0001.roachprod.crdb.io:26258 and should be around for the next few days. Can extend if requested.

andreimatei · 2019-09-30T17:28:27Z

Just to eliminate one thing can you please disable the periodic generation of the new replication reports by doing set cluster setting kv.replication_reports.interval = 0? I think the chances of this helping are not great, though.

maddyblue · 2019-09-30T17:41:34Z

Sure. I executed set cluster setting kv.replication_reports.interval = '0s' just now on mjibson-release-250f4c36de2b88eff443cf9be9cd5d2759312c88. We can check back tomorrow and see what changes.

bdarnell · 2019-09-30T18:05:08Z

My first guess was that the uninitialized variable that Andrew K just discovered was causing a lot of unnecessary compactions (and for some obscure reason this uninitialized variable happened to be zero most of the time in the old build). But it looks like the number of rocksdb compactions is the same between the two builds. (We don't know the size of the compactions, so there could be a change in this area, but it's unlikely that the count of compactions would match so closely with a substantially different size).

Is this regression visible in roachperf, or does that not run long enough to see the performance degrade like this? It would be nice to narrow this change down to a smaller window than two months.

nvanbenschoten · 2019-09-30T18:31:29Z

The high latencies appear at very regular intervals, with a period of almost exactly 30 minutes.

nvanbenschoten · 2019-09-30T18:33:06Z

CPU utilization doesn't seem to be correlated with the spikes but Go allocated memory usage does. We should wait for the next spike and grab a heap profile.

jordanlewis · 2019-09-30T18:40:04Z

Could this be statistics collection? I didn't think runs were so regularly scheduled, but collecting statistics does use a good amount of Go memory on each run. There's nothing in the jobs table though, and I think statistics runs do get jobs, so I think this theory is wrong.

nvanbenschoten · 2019-09-30T19:04:51Z

I'm seeing large fluctuations in Go allocated memory in the following sources:

  457.52MB 29.25% 29.25%   551.03MB 35.23%  github.com/cockroachdb/cockroach/pkg/sql.(*optTableStat).init
   93.51MB  5.98% 35.23%    93.51MB  5.98%  github.com/cockroachdb/cockroach/pkg/sql/sqlbase.(*DatumAlloc).NewDInt
   41.74MB  2.67% 37.89%   593.77MB 37.96%  github.com/cockroachdb/cockroach/pkg/sql.newOptTable

EDIT: this is a diff between the high and the low profiles. (*optTableStat).init is responsible for more than this.

We also saw this in #40566 (comment) and we addressed it by reducing the SQL connection limit. Here we have 2k connections.

cc. @RaduBerinde is it possible that something changed here in the past two months? It does suspect that we suddenly needed #40924 after never needing it before.

nvanbenschoten · 2019-09-30T19:10:53Z

We do have very regular stats recomputation on the warehouse and district tables. Perhaps this is invalidating all of the connections' optimizer metadata and forcing recreation of query plans.

nvanbenschoten · 2019-09-30T19:11:37Z

I'm going to turn off auto-stats collection (sql.stats.automatic_collection.enabled) and see if that has an effect.

jordanlewis · 2019-09-30T19:15:17Z

#39743 turned on histograms by default, which was merged 8/20. I wonder if that's related.

nvanbenschoten · 2019-09-30T19:24:39Z

It's possible. There is a sql.stats.histogram_collection.enabled cluster setting, so we can play with that once we see whether sql.stats.automatic_collection.enabled has any effect.

RaduBerinde · 2019-09-30T19:26:13Z

I believe the delta of memory usage in optTableStats is due to the histograms. @rytaft might know better how much of a difference we should expect. I do agree that having separate instances of these objects across connections is silly so I plan to fix that for 20.1 (tracked by #40922). We can also try a smaller fix to make the stat cache itself store these objects.

RaduBerinde · 2019-09-30T20:15:27Z

I started working on a change to push the decoding of the histograms to the stats cache. That will reuse the histogram memory across all connections (and make things more efficient when a new statistic becomes available).

Currently the stats cache exposes the encoded histogram data, and the opt catalog decodes it. The problem is that opt catalog objects are not shared across connections so this is inefficient (especially in memory usage). This change moves the decoding into the stats cache. The opt catalog objects now simply point to the data computed by the cache. There are still inefficiencies to improve on in the future: the opt catalog can hold on to multiple versions of tables, so we will keep many versions of histograms "alive". Informs cockroachdb#41206. Informs cockroachdb#40922. Release justification: fix for new functionality. Release note: None

nvanbenschoten · 2019-09-30T21:13:07Z

I'm going to turn off auto-stats collection (sql.stats.automatic_collection.enabled) and see if that has an effect.

Unfortunately, it looks like that didn't resolve the issue. We still see the oscillations.

Regardless, @RaduBerinde thanks for getting #41220 out so fast!

nvanbenschoten · 2019-10-01T15:32:59Z

That said, it does seem like the memory fluctuations are from optTableStats. Here's a series of heap profiles that show its growing contribution.

@RaduBerinde do you have any idea why this memory would be getting flushed every 30 minutes even without auto stats collection?

nvanbenschoten · 2019-10-01T15:45:12Z

#41230 was also contributing to memory usage in the cluster, but that memory was stable so it's not responsible for this issue.

41220: sql: move histogram decoding into the stats cache r=RaduBerinde a=RaduBerinde Currently the stats cache exposes the encoded histogram data, and the opt catalog decodes it. The problem is that opt catalog objects are not shared across connections so this is inefficient (especially in memory usage). This change moves the decoding into the stats cache. The opt catalog objects now simply point to the data computed by the cache. There are still inefficiencies to improve on in the future: the opt catalog can hold on to multiple versions of tables, so we will keep many versions of histograms "alive". Informs #41206. Informs #40922. Release justification: fix for new functionality. Release note: None 41231: util/log: ensure that secondary loggers do not leak memory r=petermattis,nvanbenschoten a=knz Fixes #41230. Note: the primary cause of this issue is removed by #40993 but that PR is blocked until 19.2 is out. I'm cherry-picking the subset of those changes sufficient to solve issue #41230, here. Prior to this patch, logging via a secondary logger would allocate a buffer, then add it to the buffer free list of the secondary logger. This was causing a memory leak because only the free list from the main logger is used to allocate buffers (even in secondary loggers), so all the now-unused buffers from secondary logs would remain unused and accumulate, locked from Go's GC attention because they are referenced somewhere. Release justification: bug fix Release note (bug fix): A memory leak was fixed that affected secondary logging (SQL audit logs, statement execution, and RocksDB logging). Co-authored-by: Radu Berinde <radu@cockroachlabs.com> Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>

RaduBerinde · 2019-10-01T18:04:30Z

The optCatalog (one per connection) has a cache of dataSources. When this reaches 100, it gets cleared. Very simplistic but these were meant to be small objects.

With auto stats off, this should only happen if something else is causing the table descriptor versions to keep changing.

We should see if #41220 fixes it.

knz · 2019-10-02T07:01:41Z

Comment from @nvanbenschoten

the v19.2-beta cluster looks much better running cbbf40a, which contains #41220 and #41231

we should continue to monitor the cluster, but as long as things are still looking ok tomorrow morning, we should consider #41206 sufficiently resolved to where it’s no longer a blocker

nvanbenschoten · 2019-10-02T15:22:34Z

It looks like this is fixed. Thanks for the help @knz and @RaduBerinde.

maddyblue added the regression Regression from a release. label Sep 30, 2019

maddyblue assigned bdarnell Sep 30, 2019

maddyblue mentioned this issue Sep 30, 2019

19.2 release blockers list #40447

Closed

53 tasks

RaduBerinde mentioned this issue Sep 30, 2019

sql: move histogram decoding into the stats cache #41220

Merged

knz mentioned this issue Oct 2, 2019

release: v19.2.0-beta.20190930 #41128

Closed

18 tasks

nvanbenschoten closed this as completed Oct 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: regression since 20190805 release #41206

perf: regression since 20190805 release #41206

maddyblue commented Sep 30, 2019

maddyblue commented Sep 30, 2019

andreimatei commented Sep 30, 2019

maddyblue commented Sep 30, 2019

bdarnell commented Sep 30, 2019

nvanbenschoten commented Sep 30, 2019

nvanbenschoten commented Sep 30, 2019

jordanlewis commented Sep 30, 2019

nvanbenschoten commented Sep 30, 2019 •

edited

Loading

nvanbenschoten commented Sep 30, 2019 •

edited

Loading

nvanbenschoten commented Sep 30, 2019

jordanlewis commented Sep 30, 2019

nvanbenschoten commented Sep 30, 2019

RaduBerinde commented Sep 30, 2019

RaduBerinde commented Sep 30, 2019 •

edited

Loading

nvanbenschoten commented Sep 30, 2019

nvanbenschoten commented Oct 1, 2019

nvanbenschoten commented Oct 1, 2019

RaduBerinde commented Oct 1, 2019

knz commented Oct 2, 2019

nvanbenschoten commented Oct 2, 2019

perf: regression since 20190805 release #41206

perf: regression since 20190805 release #41206

Comments

maddyblue commented Sep 30, 2019

maddyblue commented Sep 30, 2019

andreimatei commented Sep 30, 2019

maddyblue commented Sep 30, 2019

bdarnell commented Sep 30, 2019

nvanbenschoten commented Sep 30, 2019

nvanbenschoten commented Sep 30, 2019

jordanlewis commented Sep 30, 2019

nvanbenschoten commented Sep 30, 2019 • edited Loading

nvanbenschoten commented Sep 30, 2019 • edited Loading

nvanbenschoten commented Sep 30, 2019

jordanlewis commented Sep 30, 2019

nvanbenschoten commented Sep 30, 2019

RaduBerinde commented Sep 30, 2019

RaduBerinde commented Sep 30, 2019 • edited Loading

nvanbenschoten commented Sep 30, 2019

nvanbenschoten commented Oct 1, 2019

nvanbenschoten commented Oct 1, 2019

RaduBerinde commented Oct 1, 2019

knz commented Oct 2, 2019

nvanbenschoten commented Oct 2, 2019

nvanbenschoten commented Sep 30, 2019 •

edited

Loading

nvanbenschoten commented Sep 30, 2019 •

edited

Loading

RaduBerinde commented Sep 30, 2019 •

edited

Loading