Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SDK Trace: Adding span collection limits #942

Conversation

toumorokoshi
Copy link
Member

@toumorokoshi toumorokoshi commented Sep 13, 2020

To help safeguard against a common error, adding default limits
to the number of events per span in the SDK.

Fixes #182

Changes

add default limits to the number of events per span in the SDK

  • update changelog
  • update spec compliance matrix

Fixes open-telemetry#182

To help safeguard against a common error, adding default limits
to the number of events per span in the SDK.
@toumorokoshi toumorokoshi requested a review from a team as a code owner September 13, 2020 05:48
@toumorokoshi toumorokoshi requested a review from a team September 13, 2020 05:48
@toumorokoshi toumorokoshi requested a review from a team as a code owner September 13, 2020 05:48
specification/trace/sdk.md Outdated Show resolved Hide resolved
specification/trace/sdk.md Outdated Show resolved Hide resolved
specification/trace/sdk.md Outdated Show resolved Hide resolved
Adding optional additional configurations for span limits.

Modifying the unmlimited value to -1 to enable restricting collections
to 0 elements.
specification/trace/sdk.md Outdated Show resolved Hide resolved
specification/trace/sdk.md Outdated Show resolved Hide resolved
specification/trace/sdk.md Outdated Show resolved Hide resolved
@Oberon00 Oberon00 added area:sdk Related to the SDK spec:trace Related to the specification/trace directory labels Sep 15, 2020
- modifying logging messaging to SHOULD, to imply configurability
- clarifying collections types rather than using "each"
@SergeyKanzhelev
Copy link
Member

@tigrannajaryan @bogdandrutu @Oberon00 are you satisfied with the PR update? Can you pls re-review?

specification/trace/sdk.md Outdated Show resolved Hide resolved
specification/trace/sdk.md Outdated Show resolved Hide resolved
Copy link
Member

@bogdandrutu bogdandrutu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading this, I am not sure if you ask for a "bytes limit" or "number of attributes", "number of events", etc limits.

I would say that the first one may sound very interesting but it comes with some performance overhead and hard to implement it right. So I am more supportive of the "max number of attributes", etc.

Independent of what is suggested here, I think the fact that is not clear has to be fixed.

toumorokoshi and others added 2 commits September 16, 2020 20:55
Co-authored-by: Tigran Najaryan <4194920+tigrannajaryan@users.noreply.github.com>
Ensuring there is no ambiguation between limits being in bytes, or
numbers of elements in a collection.
@toumorokoshi
Copy link
Member Author

Independent of what is suggested here, I think the fact that is not clear has to be fixed.

Good point! I've fixed to element count.

I agree that bytes has some interesting ramifications and addressed the immediate concern of stressing memory, but maybe a follow-up in another PR.

specification/trace/sdk.md Outdated Show resolved Hide resolved
specification/trace/sdk.md Outdated Show resolved Hide resolved
@andrewhsu
Copy link
Member

from the issue triage meeting today with TC, if this PR is desired before GA, need 2 approvals an no disapprovals to merge, else this will be closed by end of day today @open-telemetry/technical-committee @open-telemetry/specs-approvers @SergeyKanzhelev

Copy link
Contributor

@jmacd jmacd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I might object to this spec change is simply that it's a lot of work to require, but it's work that was inherited from the OpenCensus libraries in many cases, e.g., I believe these limits are already implemented for OTel-Go.

@andrewhsu andrewhsu added the release:after-ga Not required before GA release, and not going to work on before GA label Sep 25, 2020
@andrewhsu
Copy link
Member

from the issue triage mtg today with TC, marking as after-ga because corresponding issue is also after-ga

Addressing feedback arguing for a non-configurable limit,
as is the case with other limits present in OpenTelemetry.
@tigrannajaryan
Copy link
Member

@open-telemetry/technical-committee should we add such a sanity limit to prevent app crashes due to instrumentation problems for the GA version?

I do not quite like the hard stance taken by this proposal. I think the spec is overstepping into what arguably can be also seen as an implementation decision.

I also do not like that we prescribe only one way to limit resource exhaustion by telemetry data. Limiting the total number of elements is not the only possible way and arguably is not necessarily completely effective either because we do not simultaneously limit the size of the element.

Some SDK's may have alternate solutions to this problem, e.g. implement limits for the total byte size rather then for the size of an individual element. By prescribing the specific behavior in the spec we prevent innovation in this area.

I would be more comfortable with this proposal if it was a recommendation.

@toumorokoshi
Copy link
Member Author

Thanks for the feedback! It's great to have more input.

I also do not like that we prescribe only one way to limit resource exhaustion by telemetry data.

Do you believe this will restrict people from adding other ways? I thought the spec was always a dictation of what is a requirement for an implementation, with anything not explicitly stated up to the liberty of the implementations.

Some SDK's may have alternate solutions to this problem, e.g. implement limits for the total byte size rather then for the size of an individual element. By prescribing the specific behavior in the spec we prevent innovation in this area.

I think this is the tradeoff to be made: the con is forcing choices that are not appropriate for the implementation. the pro is standard behavior that's enables knowledge that transfers across language boundaries. We dictate several implementation details that could be argued to prevent innovation, such as dictating that the behavior of the default span processor, and it's defaults.

If the consensus (and I'd like to hear a few more chime in too) is "we don't know if this is the right behavior to dictate across implementations", I think it's fine to leave this PR be (or close).

But I would argue that this is a practical implementation detail that would prevent a significant category of human errors. And the cost of differing implementations is end user confusion.

@SergeyKanzhelev
Copy link
Member

Some SDK's may have alternate solutions to this problem, e.g. implement limits for the total byte size rather then for the size of an individual element. By prescribing the specific behavior in the spec we prevent innovation in this area.

the proposal in current form is very toothless. 1000 is non-limit in real life. Just a statement that SDKs needs to watch out for these collections.

If anybody disagree with this PR - please mark it with "changes requested". Otherwise I will merge mid day (pacific time) tomorrow.

@tigrannajaryan
Copy link
Member

tigrannajaryan commented Oct 2, 2020

Thanks for the feedback! It's great to have more input.

I also do not like that we prescribe only one way to limit resource exhaustion by telemetry data.

Do you believe this will restrict people from adding other ways? I thought the spec was always a dictation of what is a requirement for an implementation, with anything not explicitly stated up to the liberty of the implementations.

@toumorokoshi Not restrict, but possibly discourage. There is limited amount of time SDK authors are willing to spend on this topic. As an SDK author if I spend effort to implement a spec requirement I am less likely to go ahead and implement another additional way to limit resource exhaustion. Developer's time is a finite resource too.

Some SDK's may have alternate solutions to this problem, e.g. implement limits for the total byte size rather then for the size of an individual element. By prescribing the specific behavior in the spec we prevent innovation in this area.

I think this is the tradeoff to be made: the con is forcing choices that are not appropriate for the implementation. the pro is standard behavior that's enables knowledge that transfers across language boundaries. We dictate several implementation details that could be argued to prevent innovation, such as dictating that the behavior of the default span processor, and it's defaults.

I agree that it is beneficial to have a good standard behavior. I am not convinced the proposed standard behavior is valuable enough to be enforced across all SDKs.

If the consensus (and I'd like to hear a few more chime in too) is "we don't know if this is the right behavior to dictate across implementations", I think it's fine to leave this PR be (or close).

But I would argue that this is a practical implementation detail that would prevent a significant category of human errors. And the cost of differing implementations is end user confusion.

I would also like to hear more opinions. I am willing to be convinced.

Copy link
Member

@tigrannajaryan tigrannajaryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reasons I posted in my previous comments I am putting a soft block on this to prevent premature merging, but I am open to be convinced.

@iNikem
Copy link
Contributor

iNikem commented Oct 2, 2020

I would like to point to yet another aspect of the current discussion. Even if this PR is not merged, it will be allowed for SDK to implement these or similar limits. Then it will be beneficial if different languages at least used the same configuration options for these limits. E.g. the same environment variable name. Is the specification open to augment sdk-environment-variables.md with configuration for such optional behaviour?

@tigrannajaryan
Copy link
Member

Is the specification open to augment sdk-environment-variables.md with configuration for such optional behaviour?

Yes, I think so. I would prefer configurable behavior.

@SergeyKanzhelev
Copy link
Member

Is the specification open to augment sdk-environment-variables.md with configuration for such optional behaviour?

Yes, I think so. I would prefer configurable behavior.

Configurability was removed from this PR as we cannot formalize use cases. For instance, if somebody set the limit to 2 attributes, should it be supported? Why? If somebody sets the limit to infinity, should it be supported? No exporter will probably need those so why to allow to set it?

Configurability was important in OpenCensus as default was quite low. Like 10 or 20 as far as I remember. In OpenTelemetry I believe the general agreement that we need to start with much higher default. Which defeats the purpose of having this limits configurable on SDK side.

I see this PR in current form as a warning that something need to be done to control for unbound collections. Alternative to this PR is doing nothing - SDKs needs to be safe in general. Switching to controllable behavior would need to start with the scenarios.

@tigrannajaryan
Copy link
Member

If somebody sets the limit to infinity, should it be supported? No exporter will probably need those so why to allow to set it?

There is a use case where collection of all telemetry is important: audit. In extreme cases it may be preferable to crash the actual operation if the telemetry that describes the operation cannot be exported. I understand that we do not necessarily care about this particular use case.

I see this PR in current form as a warning that something need to be done to control for unbound collections.

Makes sense. If it's a warning perhaps we can it a recommendation rather than hard requirement?

@toumorokoshi
Copy link
Member Author

I think there's a good comprise with the optional aspect of it. Aggregating the proposal and calling out how this is addressing concerns, how about:

  • make it optional
    • addresses @SergeyKanzhelev's point about having something in the spec to bring awareness to this issue.
    • enables experimentation as per @tigrannajaryan's point.
  • keep the configurability, as per @iNikem and @tigrannajaryan's ask.
    • add suggested environment variables in sdk-environment-variables.md

I'm still concerned about this not being required, but I'd much prefer some awareness of the issue in the spec vs none at all. And it can be a good building block if we want to expand that into something more standardized.

@iNikem
Copy link
Contributor

iNikem commented Oct 3, 2020

SDKs needs to be safe in general

I am sorry, what does it mean? SDKs cannot be safe by magic, they have to build themselves in such a way. This PR is one way to help/advice SDKs how to do that. SDKs have to implement some safety measures, this PR tries to make them similar across different languages.

Do we want to have ever language different? One has low limit, one has no limit, everyone using different configuration variables?

Addressing feedback to make collection limits optional for now. The
motivation is to allow innovation around safe SDK behavior, while
providing a guideline and awareness for the issue.

Reducing the logging on discarded attributes, events, or links to
once per process to further reduce spam.
Giving the SDK implementors more freedom in how they choose to log,
while setting guidelines on the frequency.
@tigrannajaryan
Copy link
Member

Thank you @toumorokoshi

@tigrannajaryan
Copy link
Member

@SergeyKanzhelev @iNikem I re-requested your review since the PR is substantially different now.

spec-compliance-matrix.md Outdated Show resolved Hide resolved
To help make consumers aware of which SDKS implement span
collection limits.
specification/trace/sdk.md Outdated Show resolved Hide resolved
specification/trace/sdk.md Outdated Show resolved Hide resolved
Copy link
Member

@arminru arminru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once @iNikem's final suggestions are applied.

Thanks for carrying this through, @toumorokoshi!

Co-authored-by: Nikita Salnikov-Tarnovski <gnikem@gmail.com>
@toumorokoshi
Copy link
Member Author

@arminru added!

@toumorokoshi
Copy link
Member Author

@SergeyKanzhelev could you do a final review, and merge if you feel it's appropriate? I believe I've addressed all outstanding comments.

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:sdk Related to the SDK release:after-ga Not required before GA release, and not going to work on before GA spec:trace Related to the specification/trace directory
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Limitation on number of attributes, links, events and behavior