Fix metric sdk when multiple readers are present #4436

jack-berg · 2022-05-04T21:20:01Z

I’ve found a rather important conceptual flaw in the metrics SDK that prevents proper function when multiple metric readers are registered.

The problem is rooted in that there is 1 metric storage per instrument and matching view. This single metric storage is shared among multiple metric readers which each reset the storage after collection. The result is readers interfering with each other and each receiving partial measurements.

The solution in this PR is to create 1 metric storage per reader per instrument and matching view. When measurements are recorded to an instrument, they accumulate to each registered readers storage. When a reader collects metrics, it reads and resets only from the storages associated with it. Along the way, I was able to remove a fair amount of complexity.

I discovered this issue while investigating adding support for allowing metric readers (and exporters) to specify their own default aggregation for each instrument type. This solution paves the way for that as well.

codecov · 2022-05-04T21:32:47Z

Codecov Report

Merging #4436 (ab11d74) into main (078d55a) will decrease coverage by 0.07%.
The diff coverage is 84.86%.

@@             Coverage Diff              @@
##               main    #4436      +/-   ##
============================================
- Coverage     90.20%   90.12%   -0.08%     
+ Complexity     5030     5002      -28     
============================================
  Files           572      569       -3     
  Lines         15513    15438      -75     
  Branches       1497     1488       -9     
============================================
- Hits          13994    13914      -80     
- Misses         1048     1061      +13     
+ Partials        471      463       -8

Impacted Files	Coverage Δ
...try/sdk/metrics/internal/SdkMeterProviderUtil.java	`77.77% <ø> (+0.63%)`	⬆️
...sdk/metrics/internal/state/EmptyMetricStorage.java	`0.00% <0.00%> (-45.46%)`	⬇️
...etry/sdk/metrics/internal/state/MetricStorage.java	`0.00% <ø> (-100.00%)`	⬇️
.../metrics/internal/state/MetricStorageRegistry.java	`92.59% <ø> (ø)`
.../sdk/metrics/internal/export/RegisteredReader.java	`71.42% <71.42%> (ø)`
...nternal/state/DefaultSynchronousMetricStorage.java	`78.57% <81.39%> (-4.36%)`	⬇️
...k/metrics/internal/state/CallbackRegistration.java	`97.43% <93.75%> (-2.57%)`	⬇️
...y/sdk/metrics/internal/state/MeterSharedState.java	`97.80% <94.28%> (-1.01%)`	⬇️
...in/java/io/opentelemetry/sdk/metrics/SdkMeter.java	`100.00% <100.00%> (ø)`
...io/opentelemetry/sdk/metrics/SdkMeterProvider.java	`100.00% <100.00%> (+5.55%)`	⬆️
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 078d55a...ab11d74. Read the comment docs.

jack-berg · 2022-05-04T21:21:48Z

sdk/metrics/src/test/java/io/opentelemetry/sdk/metrics/SdkMeterProviderTest.java

@@ -748,22 +748,21 @@ void viewSdk_capturesBaggageFromContext() {
  }

  @Test
-  void sdkMeterProvider_supportsMultipleCollectorsCumulative() {


These two tests were written in a way that allowed the bug to hide. Recording additional measurements with attributes reveals the true behavior. These tests as currently written fail on main.

jack-berg · 2022-05-04T21:23:12Z

sdk/metrics/src/main/java/io/opentelemetry/sdk/metrics/SdkMeterProvider.java

@@ -41,10 +35,8 @@ public final class SdkMeterProvider implements MeterProvider, Closeable {

  private final ComponentRegistry<SdkMeter> registry;
  private final MeterProviderSharedState sharedState;
-  private final Map<CollectionHandle, CollectionInfo> collectionInfoMap;


CollectionHandle and CollectionInfo had overlapping responsibilities. I've replaced them with a single class with a name which more clearly conveys its responsibilities: RegisteredReader.

jack-berg · 2022-05-04T21:26:25Z

sdk/metrics/src/main/java/io/opentelemetry/sdk/metrics/SdkMeterProvider.java

  private final AtomicBoolean isClosed = new AtomicBoolean(false);
-  private final AtomicLong lastCollectionTimestamp;
-  private final long minimumCollectionIntervalNanos;


The minimumCollectionIntervalNanos we've previously talked about turned out to be central to the bug. It's true purpose appears to be trying to ensure that when multiple readers are present they each receive the same data if they collect within a narrow enough interval of time. However, I believe the mechanism to be flawed as it didn't account for correctness when the readers are on different schedules (a perfectly reasonable scenario).

The refactor negates the need for it and assists in reducing complexity.

jack-berg · 2022-05-04T21:29:25Z

sdk/metrics/src/main/java/io/opentelemetry/sdk/metrics/internal/export/RegisteredReader.java

+ * <p>This class is internal and is hence not for public use. Its APIs are unstable and can change
+ * at any time.
+ */
+public class RegisteredReader {


A simple wrapper of MetricReader which is assigned a UUID to allows internal code to differentiate readers.

Later, we may choose to use this class to track when a reader has last collected, which would assist in solving #4400.

As an FYI - you could continue to use CollectionHandle if you wanted, which is GUID (from the standpoint of the Metrics SDK), and performs similarly to using an integer ID.

They may be less needed given the overall scope here.

If I understand correctly, CollectionHandle and CollectionInfo are both somewhat involved in identifying a unique reader. I didn't see the need to have two classes for that concept, and figured renaming to RegisteredReader was more representative of what the class does. That is, it represents a unique registered reader. Internal components can rely on hashCode() and equals() that are unique among readers, and any meta data that needs to be stored with the reader can be associated with the RegisteredReader.

jack-berg · 2022-05-04T21:29:59Z

...ics/src/main/java/io/opentelemetry/sdk/metrics/internal/state/AsynchronousMetricStorage.java

+    this.aggregationTemporality =
+        registeredReader
+            .getReader()
+            .getAggregationTemporality(metricDescriptor.getSourceInstrument().getType());


A small optimization to calculate the temporality one instead of each time a collection occurs.

jack-berg · 2022-05-04T21:40:18Z

...metrics/src/main/java/io/opentelemetry/sdk/metrics/internal/state/MetricStorageRegistry.java

@@ -38,7 +38,7 @@ public class MetricStorageRegistry {
  private final Map<MetricDescriptor, MetricStorage> registry = new HashMap<>();

  /** Returns a {@link Collection} of the registered {@link MetricStorage}. */
-  public Collection<MetricStorage> getMetrics() {
+  public Collection<MetricStorage> getStorages() {


Rename for improved clarify.

Makes sense. It could be even something like "getAllStorages" to make it clear that all registered ones will be returned.
It may also make sense to change JavaDocs - eg line 48 to "Registers the storage..." (I can't add review comment to these lines directly).

jsuereth · 2022-05-09T16:26:54Z

sdk/metrics/src/main/java/io/opentelemetry/sdk/metrics/internal/export/RegisteredReader.java

+ * <p>This class is internal and is hence not for public use. Its APIs are unstable and can change
+ * at any time.
+ */
+public class RegisteredReader {


As an FYI - you could continue to use CollectionHandle if you wanted, which is GUID (from the standpoint of the Metrics SDK), and performs similarly to using an integer ID.

They may be less needed given the overall scope here.

jsuereth · 2022-05-09T16:28:16Z

sdk/metrics/src/main/java/io/opentelemetry/sdk/metrics/internal/state/CallbackRegistration.java

  }

  public InstrumentDescriptor getInstrumentDescriptor() {
    return instrumentDescriptor;
  }

-  void invokeCallback() {
+  void invokeCallback(RegisteredReader reader) {


Is this method synchronized in some fashion?

At the moment all collections across all readers are synchronized such that only one happens at a time. This is controlled in MeterSharedState, which obtains collectLock during collection.

...c/main/java/io/opentelemetry/sdk/metrics/internal/state/DefaultSynchronousMetricStorage.java

…y-java into fix-multiple-readers

sdk/metrics/src/main/java/io/opentelemetry/sdk/metrics/internal/export/RegisteredReader.java

sdk/metrics/src/main/java/io/opentelemetry/sdk/metrics/SdkMeterProvider.java

ghost

Looks solid, some minor comments from my side.

…y-java into fix-multiple-readers

Fix metric sdk when multiple readers are present

3602eb4

jack-berg requested review from anuraaga, jkwatson and jsuereth as code owners May 4, 2022 21:20

jack-berg requested a review from a user May 4, 2022 21:20

jack-berg requested a review from Oberon00 as a code owner May 4, 2022 21:20

jack-berg commented May 4, 2022

View reviewed changes

jack-berg mentioned this pull request May 5, 2022

Fix delta intervals #4437

Merged

jsuereth reviewed May 9, 2022

View reviewed changes

jack-berg added 2 commits May 11, 2022 14:47

Merge branch 'main' of https://github.com/open-telemetry/opentelemetr…

148cc51

…y-java into fix-multiple-readers

Merge DeltaMetricStorage into DefaultSynchronousMetricStorage

ec9f1b8

anuraaga approved these changes May 13, 2022

View reviewed changes

sdk/metrics/src/main/java/io/opentelemetry/sdk/metrics/internal/export/RegisteredReader.java Outdated Show resolved Hide resolved

sdk/metrics/src/main/java/io/opentelemetry/sdk/metrics/SdkMeterProvider.java Outdated Show resolved Hide resolved

ghost approved these changes May 13, 2022

View reviewed changes

Merge branch 'main' of https://github.com/open-telemetry/opentelemetr…

ab11d74

…y-java into fix-multiple-readers

jack-berg merged commit 8659a82 into open-telemetry:main May 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix metric sdk when multiple readers are present #4436

Fix metric sdk when multiple readers are present #4436

jack-berg commented May 4, 2022

codecov bot commented May 4, 2022 •

edited

Loading

jack-berg May 4, 2022

jack-berg May 4, 2022

jack-berg May 4, 2022

jack-berg May 4, 2022

jsuereth May 9, 2022

jack-berg May 11, 2022

jack-berg May 4, 2022

jack-berg May 4, 2022

ghost May 13, 2022

jsuereth May 9, 2022

jsuereth May 9, 2022

jack-berg May 9, 2022

ghost left a comment

Fix metric sdk when multiple readers are present #4436

Fix metric sdk when multiple readers are present #4436

Conversation

jack-berg commented May 4, 2022

codecov bot commented May 4, 2022 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghost left a comment

Choose a reason for hiding this comment

codecov bot commented May 4, 2022 •

edited

Loading