Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apollo-server-core: unified Studio reporting #4142

Merged
merged 46 commits into from
Apr 28, 2021

Conversation

jsegaran
Copy link
Contributor

@jsegaran jsegaran commented May 21, 2020

The usage reporting plugin in apollo-server-core is not the first tool Apollo
built to report usage to Studio. Previous iterations such as optics-agent and
engineproxy reported a combination of detailed per-field single-operation
performance traces and summarized stats of operations to Apollo's
servers. When we built this TypeScript usage reporting plugin in 2018, for the
sakes of expediency we did something different: it only sent traces to Apollo's
servers. This meant that the performance of every single single user operation
was described in detail to Apollo's servers. Studio is not an exhaustive trace
warehouse: we have always sampled the traces received, making only some of
them available via Studio's Traces UI. The other traces were converted to stats
inside Studio's servers.

While this meant that the reporting agent was simpler than the previous
implementations (no need to be able to describe performance statistics), it also
meant that the protocol used to talk to Studio consumed a lot more bandwidth (as
well as CPU time for encoding traces).

This PR returns us to the world where Studio usage is reported as a combination
of stats and traces. It takes a slightly different approach than the previous
implementations: instead of reporting stats and traces in parallel, usage
reports contain both stats and traces. Each GraphQL operation is described
either as a trace or as stats, not both.

We expect this to significantly reduce the network and CPU requirements of
sending usage reports to Studio. It should not significantly affect the
experience of using Studio: we have always heavily sampled traces in Studio
before saving them to the trace warehouse, and the default heuristic for which
operations to send as traces works similarly to the heuristic used in Studio's
servers.

This PR introduces an option experimental_sendOperationAsTrace to allow you to
control whether a given operation is sent as trace or stats. This is truly an
experimental option that may change at any time. For example, you should not
rely on the fact that this will be called on all operations after the operation
is done with a full, or on its signature, or even that it exists. It is likely
that future improvements to the usage reporting plugin will change how
operations are observed so that we don't have to collect a full trace before
deciding how to represent the operation.

Some other notes:

  • Upgrade our fork @apollo/protobufjs with a few improvements:
    • New js_use_toArray option which lets you encode repeated fields from
      objects that aren't stored in memory as arrays but expose toArray
      methods. We use this so that we can build up DurationHistograms and
      map-like objects in a non-array fashion and only convert to array at
      encoding time.
    • New js_preEncoded option which allows you to encode messages in repeated
      fields as buffers (Uint8Arrays). This helps amortize encoding cost of a
      large message over time instead of freezing the event loop to encode the
      whole message at once. This replaces an old hack we used for one field with
      something built in to the protobuf compiler (including correct TypeScript
      typings).
    • New --no-from-object flag which we use to reduce the size of generated
      code (as we don't use the fromObject protobuf.js API).
  • In order to help us validate that the trace->stats code in this PR matches
    similar code in Studio's servers, the flag
    internal_includeTracesContributingToStats sends the traces that contribute
    to stats in a special field. This is something we only use as part of our own
    validation in our servers; for your graphs it will have no effect other than
    increasing message size.
  • Viewing traces in Studio is only available on paid plans. The usage-reporting
    endpoint now tells the plugin whether traces are supported on your graph's
    plan; if not supported, the plugin will switch to sending all operations as
    stats (regardless of the value of experimental_sendOperationAsTrace) after
    the first report.
  • We try to estimate the message size compared to maxUncompressedReportSize via
    a rough estimate about how big the leaf nodes of the stats messages will be
    rather than carefully counting how much space is used by each number and
    histogram. We do take the lengths of all strings into account.
  • By mistake, this plugin never sent the cache policy on traces, meaning that
    visualizing cache-specific stats in Studio did not work. This is now fixed.

This project was begun by @jsegaran and completed by @glasser.

@jsegaran jsegaran changed the base branch from master to release-2.14.0 May 21, 2020 06:28
@jsegaran jsegaran changed the base branch from release-2.14.0 to jsegaran/should_report_api May 21, 2020 06:29
@jsegaran jsegaran force-pushed the jsegaran/unified_reporting branch 2 times, most recently from 5903a0c to 636a4a6 Compare May 21, 2020 17:21
@glasser
Copy link
Member

glasser commented May 26, 2020

lmk when this is ready for eyes again

@jsegaran jsegaran force-pushed the jsegaran/unified_reporting branch 6 times, most recently from dec357f to aa4d6d9 Compare June 17, 2020 06:48
@jsegaran jsegaran changed the base branch from jsegaran/should_report_api to release-2.15.0 June 17, 2020 06:49
@jsegaran jsegaran force-pushed the jsegaran/unified_reporting branch from aa4d6d9 to 39d8ed0 Compare June 17, 2020 06:52
Base automatically changed from release-2.15.0 to master June 17, 2020 18:25
@@ -4,6 +4,12 @@ package mdg.engine.proto;

import "google/protobuf/timestamp.proto";

import "google/protobuf/descriptor.proto";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think my PR description that implemented this explained that you need to leave this stuff out of the version of the file you use in protobufjs somehow, because protobufjs doesn't need the option to be declared but adding the import will massively inflate the size of generated code. Bah.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is still happening

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. I need to fix this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still happening

packages/apollo-engine-reporting/src/agent.ts Outdated Show resolved Hide resolved
packages/apollo-engine-reporting/src/agent.ts Outdated Show resolved Hide resolved
packages/apollo-engine-reporting/src/agent.ts Outdated Show resolved Hide resolved
}

public addTrace(trace: Trace) {
const queryLatencyStats = this.queryLatencyStats;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't done an in-depth review of this function for accuracy (I think you were mostly looking for overall structural review so far?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok i got a bit distracted and i want to see your fixes from the thing we talked about on slack re the interfaces anyway. will get back to this in the next round after you push!

Base automatically changed from master to main June 24, 2020 18:17
@jsegaran jsegaran force-pushed the jsegaran/unified_reporting branch from 5b2047f to 84a3ecb Compare June 29, 2020 21:27
@jsegaran jsegaran requested a review from glasser June 29, 2020 22:28
@jsegaran jsegaran force-pushed the jsegaran/unified_reporting branch 2 times, most recently from 2a3aaef to 9c2d289 Compare June 29, 2020 23:00
Copy link
Member

@glasser glasser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first, just looking at the stuff that i had comment on last time

@@ -4,6 +4,12 @@ package mdg.engine.proto;

import "google/protobuf/timestamp.proto";

import "google/protobuf/descriptor.proto";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is still happening

packages/apollo-engine-reporting/src/agent.ts Outdated Show resolved Hide resolved
packages/apollo-engine-reporting/src/agent.ts Outdated Show resolved Hide resolved
);
reportData.size +=
encodedTrace.length + Buffer.byteLength(statsReportKey);
reportData.traceCache.set(traceCacheKey, true);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is traceCache going to grow indefinitely? That's not going to be good. Could do some sort of LRU thing. Or alternatively, instead of making trace.endTime.seconds be part of each entry's key, we can use just use the time that addTrace is called (presumably vaguely monotonic) and throw away the whole map every time we tick over to the next minute.

If we want to match the Apollo-cloud-service-side calculation which is based on the minute from the trace, and behave better when the clock turns back by a few seconds around a minute break, perhaps keep 2 or 3 maps of the most recent minutes? This can just be a tiny map where every time we scan through it and drop sub-maps that are too old (it's O(n) but n=3 so whatever).

Also it's not really a "cache", it's more of a "seen" map

}

public addTrace(trace: Trace) {
const queryLatencyStats = this.queryLatencyStats;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok i got a bit distracted and i want to see your fixes from the thing we talked about on slack re the interfaces anyway. will get back to this in the next round after you push!

@jsegaran jsegaran force-pushed the jsegaran/unified_reporting branch from 9c2d289 to b9c6b49 Compare July 24, 2020 18:32
@jsegaran jsegaran requested a review from glasser July 31, 2020 22:06
@jsegaran jsegaran force-pushed the jsegaran/unified_reporting branch 4 times, most recently from 93e55a4 to 50211ca Compare August 1, 2020 18:48
glasser added 19 commits April 28, 2021 13:46
 - apollo-cache-control@0.12.1-unified.0
 - apollo-datasource-rest@0.12.1-unified.0
 - apollo-datasource@0.8.1-unified.0
 - apollo-reporting-protobuf@0.6.3-unified.0
 - apollo-server-azure-functions@2.23.1-unified.0
 - apollo-server-cache-memcached@0.7.1-unified.0
 - apollo-server-cache-redis@1.4.1-unified.0
 - apollo-server-caching@0.6.1-unified.0
 - apollo-server-cloud-functions@2.23.1-unified.0
 - apollo-server-cloudflare@2.23.1-unified.0
 - apollo-server-core@2.23.1-unified.0
 - apollo-server-env@3.0.1-unified.0
 - apollo-server-express@2.23.1-unified.0
 - apollo-server-fastify@2.23.1-unified.0
 - apollo-server-hapi@2.23.1-unified.0
 - apollo-server-integration-testsuite@2.23.1-unified.0
 - apollo-server-koa@2.23.1-unified.0
 - apollo-server-lambda@2.23.1-unified.0
 - apollo-server-micro@2.23.1-unified.0
 - apollo-server-plugin-base@0.11.1-unified.0
 - apollo-server-plugin-operation-registry@0.9.1-unified.0
 - apollo-server-plugin-response-cache@0.7.1-unified.0
 - apollo-server-testing@2.23.1-unified.0
 - apollo-server-types@0.7.1-unified.0
 - apollo-server@2.23.1-unified.0
 - apollo-tracing@0.13.1-unified.0
 - graphql-extensions@0.13.1-unified.0
 - apollo-cache-control@0.12.1-unified.1
 - apollo-datasource-rest@0.12.1-unified.1
 - apollo-reporting-protobuf@0.6.3-unified.1
 - apollo-server-azure-functions@2.23.1-unified.1
 - apollo-server-cloud-functions@2.23.1-unified.1
 - apollo-server-cloudflare@2.23.1-unified.1
 - apollo-server-core@2.23.1-unified.1
 - apollo-server-express@2.23.1-unified.1
 - apollo-server-fastify@2.23.1-unified.1
 - apollo-server-hapi@2.23.1-unified.1
 - apollo-server-integration-testsuite@2.23.1-unified.1
 - apollo-server-koa@2.23.1-unified.1
 - apollo-server-lambda@2.23.1-unified.1
 - apollo-server-micro@2.23.1-unified.1
 - apollo-server-plugin-base@0.11.1-unified.1
 - apollo-server-plugin-operation-registry@0.9.1-unified.1
 - apollo-server-plugin-response-cache@0.7.1-unified.1
 - apollo-server-testing@2.23.1-unified.1
 - apollo-server-types@0.7.1-unified.1
 - apollo-server@2.23.1-unified.1
 - apollo-tracing@0.13.1-unified.1
 - graphql-extensions@0.13.1-unified.1
 - apollo-server-azure-functions@2.23.1-unified.2
 - apollo-server-cloud-functions@2.23.1-unified.2
 - apollo-server-cloudflare@2.23.1-unified.2
 - apollo-server-core@2.23.1-unified.2
 - apollo-server-express@2.23.1-unified.2
 - apollo-server-fastify@2.23.1-unified.2
 - apollo-server-hapi@2.23.1-unified.2
 - apollo-server-integration-testsuite@2.23.1-unified.2
 - apollo-server-koa@2.23.1-unified.2
 - apollo-server-lambda@2.23.1-unified.2
 - apollo-server-micro@2.23.1-unified.2
 - apollo-server-testing@2.23.1-unified.2
 - apollo-server@2.23.1-unified.2
 - apollo-cache-control@0.12.1-unified2.0
 - apollo-datasource-rest@0.12.1-unified2.0
 - apollo-datasource@0.8.1-unified2.0
 - apollo-reporting-protobuf@0.6.3-unified2.0
 - apollo-server-azure-functions@2.23.1-unified2.0
 - apollo-server-cache-memcached@0.7.1-unified2.0
 - apollo-server-cache-redis@1.4.1-unified2.0
 - apollo-server-caching@0.6.1-unified2.0
 - apollo-server-cloud-functions@2.23.1-unified2.0
 - apollo-server-cloudflare@2.23.1-unified2.0
 - apollo-server-core@2.23.1-unified2.0
 - apollo-server-env@3.0.1-unified2.0
 - apollo-server-express@2.23.1-unified2.0
 - apollo-server-fastify@2.23.1-unified2.0
 - apollo-server-hapi@2.23.1-unified2.0
 - apollo-server-integration-testsuite@2.23.1-unified2.0
 - apollo-server-koa@2.23.1-unified2.0
 - apollo-server-lambda@2.23.1-unified2.0
 - apollo-server-micro@2.23.1-unified2.0
 - apollo-server-plugin-base@0.11.1-unified2.0
 - apollo-server-plugin-operation-registry@0.9.1-unified2.0
 - apollo-server-plugin-response-cache@0.7.1-unified2.0
 - apollo-server-testing@2.23.1-unified2.0
 - apollo-server-types@0.7.1-unified2.0
 - apollo-server@2.23.1-unified2.0
 - apollo-tracing@0.13.1-unified2.0
 - graphql-extensions@0.13.1-unified2.0
 - apollo-server-azure-functions@2.23.1-unified2.1
 - apollo-server-cloud-functions@2.23.1-unified2.1
 - apollo-server-cloudflare@2.23.1-unified2.1
 - apollo-server-core@2.23.1-unified2.1
 - apollo-server-express@2.23.1-unified2.1
 - apollo-server-fastify@2.23.1-unified2.1
 - apollo-server-hapi@2.23.1-unified2.1
 - apollo-server-integration-testsuite@2.23.1-unified2.1
 - apollo-server-koa@2.23.1-unified2.1
 - apollo-server-lambda@2.23.1-unified2.1
 - apollo-server-micro@2.23.1-unified2.1
 - apollo-server-testing@2.23.1-unified2.1
 - apollo-server@2.23.1-unified2.1
 - apollo-cache-control@0.12.1-unified2.1
 - apollo-datasource-rest@0.12.1-unified2.1
 - apollo-reporting-protobuf@0.6.3-unified2.1
 - apollo-server-azure-functions@2.23.1-unified2.2
 - apollo-server-cloud-functions@2.23.1-unified2.2
 - apollo-server-cloudflare@2.23.1-unified2.2
 - apollo-server-core@2.23.1-unified2.2
 - apollo-server-express@2.23.1-unified2.2
 - apollo-server-fastify@2.23.1-unified2.2
 - apollo-server-hapi@2.23.1-unified2.2
 - apollo-server-integration-testsuite@2.23.1-unified2.2
 - apollo-server-koa@2.23.1-unified2.2
 - apollo-server-lambda@2.23.1-unified2.2
 - apollo-server-micro@2.23.1-unified2.2
 - apollo-server-plugin-base@0.11.1-unified2.1
 - apollo-server-plugin-operation-registry@0.9.1-unified2.1
 - apollo-server-plugin-response-cache@0.7.1-unified2.1
 - apollo-server-testing@2.23.1-unified2.2
 - apollo-server-types@0.7.1-unified2.1
 - apollo-server@2.23.1-unified2.2
 - apollo-tracing@0.13.1-unified2.1
 - graphql-extensions@0.13.1-unified2.1
 - apollo-server-azure-functions@2.23.1-unified2.3
 - apollo-server-cloud-functions@2.23.1-unified2.3
 - apollo-server-cloudflare@2.23.1-unified2.3
 - apollo-server-core@2.23.1-unified2.3
 - apollo-server-express@2.23.1-unified2.3
 - apollo-server-fastify@2.23.1-unified2.3
 - apollo-server-hapi@2.23.1-unified2.3
 - apollo-server-integration-testsuite@2.23.1-unified2.3
 - apollo-server-koa@2.23.1-unified2.3
 - apollo-server-lambda@2.23.1-unified2.3
 - apollo-server-micro@2.23.1-unified2.3
 - apollo-server-testing@2.23.1-unified2.3
 - apollo-server@2.23.1-unified2.3
@glasser glasser force-pushed the jsegaran/unified_reporting branch from 03fd83d to 7fb3df0 Compare April 28, 2021 20:48
@glasser glasser changed the base branch from main to release-2.24.0 April 28, 2021 20:50
@glasser glasser merged commit 8ce26dd into release-2.24.0 Apr 28, 2021
@glasser glasser deleted the jsegaran/unified_reporting branch April 28, 2021 20:52
@glasser glasser mentioned this pull request Apr 28, 2021
@glasser
Copy link
Member

glasser commented Apr 30, 2021

This is released in AS 2.24.0!

We expect this to be a no-op for most users (other than a potential performance benefit). If it causes problems, please file an issue and/or contact Apollo Support; you can get behavior like the previous version by passing experimental_sendOperationAsTrace: () => true to ApolloServerPluginUsageReporting. I don't recommend you do so other than in the process of working with us to fix your problem, though!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants