Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTLP Metrics update #162

Closed
wants to merge 7 commits into from
Closed

Conversation

jmacd
Copy link
Contributor

@jmacd jmacd commented Jul 1, 2020

This is a substantial refactoring. The highlights:

Resolves #125
Resolves #124
Resolves #81
Resolves #78

Partially implemented: open-telemetry/opentelemetry-go#884

This PR is a draft mainly because there are still a number of comments that have to be fixed in the lower half of the file.

@jmacd
Copy link
Contributor Author

jmacd commented Jul 2, 2020

Please review the Metric, MetricDescriptor, and DataPoint types (lines 47-389). (There are still comments to incorporate from open-telemetry/oteps#117, and a few comments that need updating in Histogram).

Copy link
Contributor

@MrAlias MrAlias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Nothing blocking, I'm interested in moving forward with this approach. Likely iterations on documentation, but structurally it looks good.

opentelemetry/proto/metrics/v1/metrics.proto Outdated Show resolved Hide resolved
opentelemetry/proto/metrics/v1/metrics.proto Outdated Show resolved Hide resolved
opentelemetry/proto/metrics/v1/metrics.proto Outdated Show resolved Hide resolved
opentelemetry/proto/metrics/v1/metrics.proto Outdated Show resolved Hide resolved
Comment on lines +452 to +457
// Remove the labels, start_time, and time TODO.
//
// The cumulative value is over the time interval (start_time_unix_nano, time_unix_nano].
// Value is UNIX Epoch time in nanoseconds since 00:00:00 UTC on 1 January 1970.
// fields: Add min, max, last fields
// as described in https://github.com/open-telemetry/oteps/pull/117
//
// Value of 0 indicates that the timestamp is unspecified. In that case the timestamp
// may be decided by the backend.
fixed64 start_time_unix_nano = 2;
// (Same comment)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done?

// time_unix_nano is the moment when this value was recorded.
// Value is UNIX Epoch time in nanoseconds since 00:00:00 UTC on 1 January 1970.
fixed64 time_unix_nano = 3;
// (Same comment)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing TODO?

@bogdandrutu
Copy link
Member

I remembered the problem with using one DataPoint. The main argument was overall memory usage (not number of allocations). For one metric you have a lot of data points and now each data point requires more space (even when using pointers for Histogram and Summary). For example for a int metric with 10 data points you have 10*3(int64 If using pointers for Histogram and Summary) total memory allocated. Every DataPoint has 3 int64 extra now.

@jmacd
Copy link
Contributor Author

jmacd commented Jul 3, 2020

The main argument was overall memory usage

I see. This makes me wonder about using a custom decoder, encoder, and representation of metric data as it passes through the collector. It least I'm much happier to deal with a performance concern than a philosophical one. Thanks, I'll evaluate.

opentelemetry/proto/metrics/v1/metrics.proto Show resolved Hide resolved
opentelemetry/proto/metrics/v1/metrics.proto Outdated Show resolved Hide resolved
// application called the API, as opposed to the SDK calling the
// application via a callback. When set, SYNCHRONOUS indicates
// that there may be an associated trace context, otherwise
// data points are considered asynchronous. Asynchronous Counts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What Count does this redder to?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I clarified that this refers to DataPoint.Histogram.Count or DataPoint.Summary.Count.

// MONOTONIC indicates that the calculated sum can be monitored
// as a rate of change.
//
// Synchronicity applies to data points produced when the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this consumed by a backend? If the data are coming from other library like Prometheus how would I know what to set here? They define exemplars as well and I don't know in advance if a metric will or not have exemplars associated (in the collector). Is this something that we really need or we can remove to not complicate translation from other Protocols.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rewrote this as:

  //   Synchronicity applies to data points produced when the
  //   application called the API, as opposed to the SDK calling the
  //   application via a callback.  When set, SYNCHRONOUS indicates
  //   that there may be an associated trace context.  The rate of
  //   synchronous events has a meaningful interpretation, when
  //   considering the DataPoint.Histogram.Count or
  //   DataPoint.Summary.Count fields, as these are true
  //   application-level events.
  //
  //   Data points without SYNCHRONOUS are considered asynchronous.
  //   The DataPoint.Histogram.Count and DataPoint.Summary.Count
  //   fields record a number of observations, which is controlled
  //   both by the number of distinct label sets observed and by the
  //   rate of collection.  Because asynchronous metric Count fields
  //   depend on the rate of collection, the rate of these metrics
  //   should not be interpreted as rates (however these Counts may
  //   still be used to form an average value of these metrics).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the data are coming from other library like Prometheus how would I know what to set here?

Hopefully SYNCHRONOUS is always a safe option. This makes me think that the optional flag should be ASYNCHRONOUS, and it will only be set when for sure the value was generated by a variable-rate callback controlled by the SDK, not user-level events. How's this sound?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels to me that this information cannot be guaranteed, and would like to challenge if this is an important and well defined property that should be in the protocol. Maybe consider to be removed instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we cannot include this information, the entire part of the specification that deals with asynchronous instrument semantics was for nothing. I think when ASYNCHRONOUS is true it is guaranteed.

@jmacd
Copy link
Contributor Author

jmacd commented Jul 14, 2020

I remembered the problem with using one DataPoint. The main argument was overall memory usage

I did some performance analysis: open-telemetry/opentelemetry-collector#1346

It led me to think this PR is OK as it stands and that we can proceed as follows:
(1) we hand-optimize it for the collector later
(2) we refine the protocol to support interned labels.

@MrAlias
Copy link
Contributor

MrAlias commented Jul 14, 2020

@jmacd do you have the results of the performance test you could post?

@jmacd
Copy link
Contributor Author

jmacd commented Jul 14, 2020

@MrAlias I performed a hack-and-slash operation on Tigran's repository, branch here. I took a number of shortcuts to avoid creating build rules that would automatically run protoc and produce the generated artifacts (since this is a tough problem and I feel time pressure). What I did instead was to check-out two branches of opentelemetry-proto, run the same protoc/protoc-gen-go on both, and use go.mod replacements to build both dependencies in a single binary:

  1. The base-line is opentelemetry-proto + a small MMLSC change
  2. The experiment is this PR with one extra commit full of hacks to rename the package to avoid trouble with the protobuf library but no structural changes.

There are a lot of parameters we could study to compare performance. First I'll post a sample comparing the two protocols with small requests consisting of one metric, one point, 10 labels per metric, 3 values per label:

640000 nano batches, 1 metrics per batch, 10 labels per metric
http11/otlp/int64/1           640000 points| 89.3 cpusec| 50.8 wallsec| 7166.1 batches/cpusec| 12594.4 batches/wallsec|139.5 cpuμs/point
http11conc/otlp/int64/10      640000 points| 82.1 cpusec| 12.2 wallsec| 7799.2 batches/cpusec| 52438.3 batches/wallsec|128.2 cpuμs/point
http11/otlp/float64/1         640000 points| 94.7 cpusec| 54.6 wallsec| 6761.8 batches/cpusec| 11724.5 batches/wallsec|147.9 cpuμs/point
http11conc/otlp/float64/10    640000 points| 80.8 cpusec| 12.7 wallsec| 7925.7 batches/cpusec| 50277.0 batches/wallsec|126.2 cpuμs/point
http11/otlp/mmlsc/1           640000 points| 93.9 cpusec| 54.1 wallsec| 6817.2 batches/cpusec| 11819.9 batches/wallsec|146.7 cpuμs/point
http11conc/otlp/mmlsc/10      640000 points| 79.5 cpusec| 11.7 wallsec| 8054.4 batches/cpusec| 54600.4 batches/wallsec|124.2 cpuμs/point
http11/experiment/int64/1     640000 points| 89.5 cpusec| 50.8 wallsec| 7148.4 batches/cpusec| 12602.3 batches/wallsec|139.9 cpuμs/point
http11conc/experiment/int64/10  640000 points| 78.3 cpusec| 11.3 wallsec| 8176.8 batches/cpusec| 56441.9 batches/wallsec|122.3 cpuμs/point
http11/experiment/float64/1   640000 points| 89.8 cpusec| 51.1 wallsec| 7123.8 batches/cpusec| 12531.7 batches/wallsec|140.4 cpuμs/point
http11conc/experiment/float64/10  640000 points| 77.8 cpusec| 11.3 wallsec| 8224.1 batches/cpusec| 56762.5 batches/wallsec|121.6 cpuμs/point
http11/experiment/mmlsc/1     640000 points| 91.2 cpusec| 51.9 wallsec| 7019.1 batches/cpusec| 12335.5 batches/wallsec|142.5 cpuμs/point
http11conc/experiment/mmlsc/10  640000 points| 77.4 cpusec| 11.4 wallsec| 8266.6 batches/cpusec| 56346.1 batches/wallsec|121.0 cpuμs/point

The protocols are HTTP with 1 thread (http11) and 10 threads (http11conc). There are three aggregations tested, int64, double, and MMLSC. You can see for example comparing http11conc/experiment/mmlsc/10 vs http11conc/otlp/mmlsc/10: a comparison of 121.0 cpuμs/point vs 124.2 cpuμs/point for the concurrent mmlsc experiment, or a loss of 2.5% in throughput. I will post memory profiles for the same example next, then I'll post throughput results for a greater number of points per metric.

@jmacd
Copy link
Contributor Author

jmacd commented Jul 14, 2020

The same test with 10 points per metric, reducing the number of batches to keep approximately the same wall time of these tests:

320000 nano batches, 1 metrics per batch, 10 labels per metric
http11/otlp/int64/1          3200000 points| 66.9 cpusec| 36.6 wallsec| 4781.8 batches/cpusec|  8753.4 batches/wallsec| 20.9 cpuμs/point
http11conc/otlp/int64/10     3200000 points| 71.3 cpusec|  9.8 wallsec| 4486.8 batches/cpusec| 32508.4 batches/wallsec| 22.3 cpuμs/point
http11/otlp/float64/1        3200000 points| 66.9 cpusec| 36.6 wallsec| 4784.7 batches/cpusec|  8737.9 batches/wallsec| 20.9 cpuμs/point
http11conc/otlp/float64/10   3200000 points| 70.5 cpusec|  9.7 wallsec| 4538.4 batches/cpusec| 32993.3 batches/wallsec| 22.0 cpuμs/point
http11/otlp/mmlsc/1          3200000 points| 67.8 cpusec| 37.5 wallsec| 4721.2 batches/cpusec|  8539.2 batches/wallsec| 21.2 cpuμs/point
http11conc/otlp/mmlsc/10     3200000 points| 70.8 cpusec|  9.7 wallsec| 4522.3 batches/cpusec| 32869.2 batches/wallsec| 22.1 cpuμs/point
http11/experiment/int64/1    3200000 points| 66.5 cpusec| 36.5 wallsec| 4812.8 batches/cpusec|  8758.2 batches/wallsec| 20.8 cpuμs/point
http11conc/experiment/int64/10 3200000 points| 69.1 cpusec|  9.5 wallsec| 4629.0 batches/cpusec| 33573.6 batches/wallsec| 21.6 cpuμs/point
http11/experiment/float64/1  3200000 points| 66.6 cpusec| 36.6 wallsec| 4801.9 batches/cpusec|  8739.7 batches/wallsec| 20.8 cpuμs/point
http11conc/experiment/float64/10 3200000 points| 68.8 cpusec|  9.4 wallsec| 4654.5 batches/cpusec| 33946.5 batches/wallsec| 21.5 cpuμs/point
http11/experiment/mmlsc/1    3200000 points| 67.6 cpusec| 37.6 wallsec| 4732.3 batches/cpusec|  8519.5 batches/wallsec| 21.1 cpuμs/point
http11conc/experiment/mmlsc/10 3200000 points| 70.7 cpusec|  9.7 wallsec| 4527.4 batches/cpusec| 32899.3 batches/wallsec| 22.1 cpuμs/point

Here the throughput difference is negligible, I suspect because label encoding becomes the dominant cost of transmitting metrics. Potentially I've used a larger number of labels than is realistic? I think that varies widely. I'm still reaching the conclusion that the detail we were focused on its relatively minor from a throughput perspective. Next I will share memory profile results.

@jmacd
Copy link
Contributor Author

jmacd commented Jul 14, 2020

I repeated a specific pair of tests with a larger number of metrics per batch, which reduces the overhead of HTTP requests in the benchmark. Here with 100 metrics per batch (above it was 1 / batch):

$ go run ./cmd/benchmark/main.go -protocol http11conc -encoding otlp -batches=20000 -metricsperbatch=100 -pointspermetric=10 -labelspermetric=10 -labelnumvalues=3  -aggregation mmlsc -memprofile=mmlsc.otlp.mem
<pointspermetric=10 -labelspermetric=10 -labelnumvalues=3  -aggregation mmlsc -memprofile=mmlsc.otlp.mem
http11conc/otlp/mmlsc/10     20000000 points|225.8 cpusec| 31.0 wallsec|   88.6 batches/cpusec|   644.8 batches/wallsec| 11.3 cpuμs/point
$ go run ./cmd/benchmark/main.go -protocol http11conc -encoding experiment -batches=20000 -metricsperbatch=100 -pointspermetric=10 -labelspermetric=10 -labelnumvalues=3  -aggregation mmlsc -memprofile=mmlsc.experiment.mem
<=100 -pointspermetric=10 -labelspermetric=10 -labelnumvalues=3  -aggregation mmlsc -memprofile=mmlsc.experiment.mem
http11conc/experiment/mmlsc/10 20000000 points|261.0 cpusec| 40.0 wallsec|   76.6 batches/cpusec|   500.0 batches/wallsec| 13.0 cpuμs/point

Now this shows a 15% loss of performance with the experimental change. (That's bad!)

Here's the pair of go tool pprof -alloc_objects top-10 lists.

For the baseline (OTLP+MMLSC):

Showing nodes accounting for 578831252, 99.60% of 581179683 total
Dropped 73 nodes (cum <= 2905898)
Showing top 10 nodes out of 27
      flat  flat%   sum%        cum   cum%
 225059320 38.72% 38.72%  225059320 38.72%  reflect.New
 202279983 34.81% 73.53%  202279983 34.81%  google.golang.org/protobuf/encoding/protowire.ConsumeString (inline)
 110054989 18.94% 92.47%  110054989 18.94%  google.golang.org/protobuf/internal/impl.pointer.AppendPointerSlice (inline)
  33130975  5.70% 98.17%   33130975  5.70%  github.com/tigrannajaryan/exp-otelproto/encodings/otlp.otlpLabels
   5088219  0.88% 99.04%   41892360  7.21%  github.com/tigrannajaryan/exp-otelproto/encodings/otlp.genMMLSC
   3045076  0.52% 99.57%    3045076  0.52%  github.com/tigrannajaryan/exp-otelproto/core.(*LabelMaker).Get
     65537 0.011% 99.58%  537979329 92.57%  net/http.(*conn).serve
     62808 0.011% 99.59%  537632028 92.51%  github.com/tigrannajaryan/exp-otelproto/protoimpls/http11.telemetryReceiver
     44345 0.0076% 99.60%   42220965  7.26%  github.com/tigrannajaryan/exp-otelproto/encodings/otlp.(*Generator).GenerateMetricBatch
         0     0% 99.60%  537394292 92.47%  github.com/golang/protobuf/proto.Unmarshal

and for the EXPERIMENT, which as you see allocates 624M vs 581M times (7% more memory allocations by count):

Showing nodes accounting for 618562859, 99.06% of 624460737 total
Dropped 107 nodes (cum <= 3122303)
Showing top 10 nodes out of 28
      flat  flat%   sum%        cum   cum%
 224461993 35.94% 35.94%  224461993 35.94%  reflect.New
 200313871 32.08% 68.02%  200313871 32.08%  google.golang.org/protobuf/encoding/protowire.ConsumeString (inline)
 112267887 17.98% 86.00%  112267887 17.98%  google.golang.org/protobuf/internal/impl.pointer.AppendPointerSlice (inline)
  32442794  5.20% 91.20%   32442794  5.20%  github.com/tigrannajaryan/exp-otelproto/encodings/otlpexperiment.otlpLabels
  20676923  3.31% 94.51%   20676923  3.31%  google.golang.org/protobuf/encoding/protowire.AppendVarint
  20251242  3.24% 97.75%  577971916 92.56%  google.golang.org/protobuf/internal/impl.(*MessageInfo).unmarshalPointer
   8043799  1.29% 99.04%   44284284  7.09%  github.com/tigrannajaryan/exp-otelproto/encodings/otlpexperiment.genMMLSC
     65537  0.01% 99.05%  578525904 92.64%  net/http.(*conn).serve
     30621 0.0049% 99.05%   44811557  7.18%  github.com/tigrannajaryan/exp-otelproto/encodings/otlpexperiment.(*Generator).GenerateMetricBatch
      8192 0.0013% 99.06%  578170927 92.59%  github.com/tigrannajaryan/exp-otelproto/protoimpls/http11.telemetryReceiver

The use of MMLSC for this test is indicative of what we'd see across the board with a proper protobuf oneof. This leads me to suggest that we reject the structural changes here.

@tigrannajaryan
Copy link
Member

tigrannajaryan commented Jul 14, 2020

@jmacd I'd advise to run with larger batches to see the effect of your changes more easily. For tiny batches the performance profile is likely dominated by per-request constant overhead and makes less visible the serialization performance differences which I assume is what you are optimizing here.
Alternatively you could use just the encoding benchmarks that measure purely the marshaling/marshaling. make encoding-benchmark should do it.

@jmacd
Copy link
Contributor Author

jmacd commented Jul 14, 2020

For the same data described above, the -alloc_space summary shows a 1.4% increase in total space allocated.

For OTLP (baseline):

Showing nodes accounting for 45306.84MB, 99.67% of 45454.92MB total
Dropped 62 nodes (cum <= 227.27MB)
Showing top 10 nodes out of 38
      flat  flat%   sum%        cum   cum%
18531.60MB 40.77% 40.77% 18531.60MB 40.77%  reflect.New
 9921.67MB 21.83% 62.60%  9921.67MB 21.83%  bytes.makeSlice
 5203.46MB 11.45% 74.04%  5203.46MB 11.45%  google.golang.org/protobuf/internal/impl.pointer.AppendPointerSlice (inline)
 4525.51MB  9.96% 84.00%  4526.01MB  9.96%  google.golang.org/protobuf/proto.MarshalOptions.marshal
 3121.05MB  6.87% 90.87%  3121.05MB  6.87%  google.golang.org/protobuf/encoding/protowire.ConsumeString (inline)
 2527.69MB  5.56% 96.43%  2527.69MB  5.56%  github.com/tigrannajaryan/exp-otelproto/encodings/otlp.otlpLabels
  929.28MB  2.04% 98.47%   929.28MB  2.04%  github.com/tigrannajaryan/exp-otelproto/core.(*LabelMaker).Get
  534.57MB  1.18% 99.65%  4029.05MB  8.86%  github.com/tigrannajaryan/exp-otelproto/encodings/otlp.genMMLSC
    8.50MB 0.019% 99.67%  4060.12MB  8.93%  github.com/tigrannajaryan/exp-otelproto/encodings/otlp.(*Generator).GenerateMetricBatch
    3.50MB 0.0077% 99.67% 36775.27MB 80.90%  github.com/tigrannajaryan/exp-otelproto/protoimpls/http11.telemetryReceiver

For the experiment:

Showing nodes accounting for 44.85GB, 99.62% of 45.03GB total
Dropped 95 nodes (cum <= 0.23GB)
Showing top 10 nodes out of 40
      flat  flat%   sum%        cum   cum%
   17.17GB 38.13% 38.13%    17.17GB 38.13%  reflect.New
    9.80GB 21.76% 59.89%     9.80GB 21.76%  bytes.makeSlice
    5.12GB 11.36% 71.25%     5.12GB 11.36%  google.golang.org/protobuf/internal/impl.pointer.AppendPointerSlice (inline)
    4.55GB 10.11% 81.37%     4.55GB 10.11%  google.golang.org/protobuf/proto.MarshalOptions.marshal
    3.02GB  6.70% 88.06%     3.02GB  6.70%  google.golang.org/protobuf/encoding/protowire.ConsumeString (inline)
    2.42GB  5.37% 93.43%     2.42GB  5.37%  github.com/tigrannajaryan/exp-otelproto/encodings/otlpexperiment.otlpLabels
    0.96GB  2.14% 95.57%     4.33GB  9.62%  github.com/tigrannajaryan/exp-otelproto/encodings/otlpexperiment.genMMLSC
    0.91GB  2.02% 97.59%     0.91GB  2.02%  github.com/tigrannajaryan/exp-otelproto/core.(*LabelMaker).Get
    0.60GB  1.34% 98.93%    26.21GB 58.22%  google.golang.org/protobuf/internal/impl.(*MessageInfo).unmarshalPointer
    0.31GB  0.68% 99.62%     0.31GB  0.68%  google.golang.org/protobuf/encoding/protowire.AppendVarint

@jmacd
Copy link
Contributor Author

jmacd commented Jul 14, 2020

I'd advise to run with larger batches to see the effect of your changes more easily

Yes, I think the 100/batch test shows enough of a problem that I'm going to revert the structural change.

#162 (comment)


// GROUPING structure means the value has been computed by combining
// individual values in a meaningful aggregation. GROUPING structure
// implies the sum of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

incomplete sentence?

@jmacd
Copy link
Contributor Author

jmacd commented Jul 14, 2020

Closing this. I will factor apart the Kind and ValueType changes into a separate change, develop the MMLSC change independently, and I will recommend @cnnradams re-attempt the RawValue portion of this PR afterwards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants