Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: New way of handling data mutability #6794

Closed
bogdandrutu opened this issue Dec 13, 2022 · 11 comments · Fixed by #8634
Closed

Proposal: New way of handling data mutability #6794

bogdandrutu opened this issue Dec 13, 2022 · 11 comments · Fixed by #8634
Assignees

Comments

@bogdandrutu
Copy link
Member

Background

Currently we ask every component to declare their "Capabilities". Capabilities are part of the consumer.Consumer* interface, to allow "helper consumer" to also declare their own capabilities see in contrib batchperesourceattr

This design proved in the recent weeks to be very limited, see issues/PRs (most of the recent issues are related to sort, but this is not the only mutable func that may cause this):

There are few issues with this design:

  1. Currently there is no way to test components. An possible solution is to provide a different pdata package maybe using a build tag that checks for incorrect access.
  2. In case of crashes like Collector constantly breaking down #6420 (I was lucky that I am very familiar with the pdata and collector code to find) could take a long time to find an the root cause with the current design since we do not crash/log when UB happens (this is an UB since concurrent access to non thread safe data is defined as UB).

Proposal

The current issue/document proposes to replace the current "Capabilities" model from the consumer.Consumer* interface and add a notion of "Shared(ReadOnly)" and "Exclusive(Mutable)" (Shared/Exclusive are nice since they are also used in MESI protocol) to the top signal objects ptrace.Traces/pmetric.Metrics/plog.Logs. Here is the example API for traces (similar changes will be applied to all):

type Traces struct {
  // state is propagated to every "sub struct" only to enforce correctness.
  state pcommon.State
}

// State returns the current access state of the data. If StateShared the object cannot be mutated, and any call that mutates 
// the current instance or a sub-struct (Span, SpanEvent, etc.) will panic. If mutability is required a copy of the object should
// be made using CopyTo (TODO: may consider a Clone func since this will become used quite some time)
func (t Traces) State() pcommon.State {
  return t.state
}

// Changes the state of the StateShared. Must be called only when the object is fan-out to multiple consumers, or in tests.
func (t *Traces) MarkShared() {
  t.state = pcommon.StateShared;
}

func (ms Traces) MoveTo(dest Traces) {
	// Similar changes will be made to every mutable func. It is ok to crash since this is a UB.
	if t.state == pcommon.StateShared {
		panic("invalid access to shared data")
	}
	*dest.getOrig() = *ms.getOrig()
	*ms.getOrig() = otlpcollectortrace.ExportTraceServiceRequest{}
}

With this change, consumers/components that need to mutate the data can check first thing if the data are "Shared" or "Exclusive" and do a copy of the data if needed.

This proposal has the following advantages:

  • Easy to test by passing "shared" data to the consumer and ensure no crashes.
  • Easy to debug failures, since the panic stacktrace will be printed and "guilty" components will be present in the stacktrace.
  • Allow components to "delay" cloning if only in some cases the data are mutated. For example a transform processor may delay cloning until a match happens.

This proposal has the following disadvantages:

  • Still require code that authors may forget to add (the initial clone if needed);
  • In case of fan-out components, authors need to manually call MarkShared. Though there should be a very limited number of components that do this.

Alternative Options

An alternative is to just provide some testing helpers, these helpers will be very similar in terms of the generated code with the proposed solution, but will not run in production code. The advantage is that it will remove a branch check and will "not crash" immediately (since this is an UB sooner than later will crash randomly as in the provided examples) in production.

@tigrannajaryan
Copy link
Member

The current issue/document proposes to replace the current "Capabilities" model from the consumer.Consumer* interface

Is it necessary to remove this? I would imagine the ability to enforce what is declared (your "state" proposal) to be useful as an addition, instead of replacing the declaration.

If we remove the ability to declare that the component mutates data then every component that can mutate data has to be fixed and the Clone calls must be added. I wonder what the impact is on all existing components someone may have created outside Otel repos.

To be efficient for cases like transform processor the processor can claim that it doesn't mutate data and then use Clone when it needs to mutate. It is not a lie since we define MutatesData to mean that input data is not modified. The processor is still free to Clone and output a modified cloned data.

I would also like to understand what the performance implications of adding the state field to every pdata struct is (likely small, but still worth checking).

@bogdandrutu
Copy link
Member Author

bogdandrutu commented Dec 13, 2022

Is it necessary to remove this? I would imagine the ability to enforce what is declared (your "state" proposal) to be useful as an addition, instead of replacing the declaration.

It is not "necessary" but redundant. We can still provide a "consumer.MutableConsumer(consumer.Consumer)" (equivalent with the component helpers that deal with exposing the Capability function that does the "cloning". So for example in the exporterhelper when specified we can automatically clone the data in advance if needed so components don't actually need to do the check and clone.

To be efficient for cases like transform processor the processor can claim that it doesn't mutate data and then use Clone when it needs to mutate. It is not a lie since we define MutatesData to mean that input data is not modified. The processor is still free to Clone and output a modified cloned data.

Sure they can do half of it, since they always Clone even when not needed. (e.g. they have exclusive access to the data, but they don't know about since this is not expose to the consumers today).

If we remove the ability to declare that the component mutates data then every component that can mutate data has to be fixed and the Clone calls must be added. I wonder what the impact is on all existing components someone may have created outside Otel repos.

Internally we can check to see if they implement this and as mentioned do some sort of "wrapping" initially. Because we remove the "component.Capabilities" object at one point (after deprecation, etc.) if they don't upgrade to latest they will get compilation error (not ideal, but happens with every breaking change we do), so not worry about having unexpected crashes.

I would also like to understand what the performance implications of adding the state field to every pdata struct is (likely small, but still worth checking).

On it. My gut feeling tells me that this is not a problem, since the compiler and CPU branch prediction should be smart enough and identify that these branches are never hit.

Update: golang/go#11451 this is implemented as an optimization.

Results: as expected is less then a cycle

type state int32

const (
	stateExclusive state = iota
	stateShared
)

type safeStruct struct {
	val int
	s   state
}

func (ss *safeStruct) set(val int) {
	if ss.s == stateShared {
		panic("")
	}
	ss.val = val
}

type unsafeStruct struct {
	val int
}

func (ss *unsafeStruct) set(val int) {
	ss.val = val
}

func BenchmarkSafeStruct(b *testing.B) {
	ss := safeStruct{s: stateExclusive}
	b.ReportAllocs()
	b.ResetTimer()
	for n := 0; n < b.N; n++ {
		ss.set(n)
		if ss.val != n {
			b.Fail()
		}
	}
}

func BenchmarkUnsafeStruct(b *testing.B) {
	ss := unsafeStruct{}
	b.ReportAllocs()
	b.ResetTimer()
	for n := 0; n < b.N; n++ {
		ss.set(n)
		if ss.val != n {
			b.Fail()
		}
	}
}
goos: darwin
goarch: amd64
pkg: go.opentelemetry.io/collector/pdata/ptrace
cpu: Intel(R) Core(TM) i9-9880H CPU @ 2.30GHz
BenchmarkSafeStruct
BenchmarkSafeStruct-16      	1000000000	         0.3721 ns/op	       0 B/op	       0 allocs/op
BenchmarkUnsafeStruct
BenchmarkUnsafeStruct-16    	1000000000	         0.2560 ns/op	       0 B/op	       0 allocs/op
PASS

@dmitryax
Copy link
Member

dmitryax commented Dec 14, 2022

In case of fan-out components, authors need to manually call MarkShared. Though there should be a very limited number of components that do this.

May be we can wrap fan-out components and call MarkShared automatically if they are sending to more than one consumer?

@bogdandrutu
Copy link
Member Author

May be we can wrap fan-out components and call MarkShared automatically if they are sending to more than one consumer?

yes, we can expose our fan-out consumer and they can use that to ensure they do the right thing.

@MovieStoreGuy
Copy link
Contributor

Sorry if this is obvious, but do you intend to keep the panic's in or would methods that mutate return an error saying the data is unmodifiable?

In the scenario that a component like transformprocessor, schemaprocessor that are built to intentionally modify incoming data based on configuration. What would be the expected action when incoming pdata is unmodifiable? Should the component clone the data?

@dmitryax
Copy link
Member

I'm also thinking of another possible approach that doesn't require introducing panics and passing the state field to all nested pdata structs. It requires many changes in the pdata tho. The idea is to have a separate set of API for mutable data that can only be accessed after calling a particular method on pmetric.Metrics (same for other signals).

// MutableResourceMetrics returns the MutableResourceMetricsSlice associated with this Metrics object.
// This method should be called to get a slice that can be changed. 
// If changing the slice is not required, use ResourceMetrics method.
// This method makes a copy of the whole Metric object if used in a fan-out component with more than one consumer.
func (ms Metrics) MutableResourceMetrics() MutableResourceMetricsSlice {
	if ms.state == pcommon.StateShared {
		rms := NewResourceMetricsSlice()
		newResourceMetricsSlice(&ms.getOrig().ResourceMetrics).CopyTo(rms)
		*ms.getOrig() = otlpcollectormetrics.ExportMetricsServiceRequest{
			ResourceMetrics: *rms.getOrig(),
		}
		ms.state = pcommon.StateExclusive
		return rms
	}
	return newMutableResourceMetricsSlice(&ms.getOrig().ResourceMetrics)
}

In that case, state is needed only on the top Metrics struct, and users don't need to worry about the state themselves. Whenever they need to mutate pdata, they use MutableResourceMetrics, and that method does a copy if necessary.

I tried some prototyping in #6798. It's not polished yet, but I can proceed if you think it's worth trying.

@bogdandrutu
Copy link
Member Author

bogdandrutu commented Dec 14, 2022

In the scenario that a component like transformprocessor, schemaprocessor that are built to intentionally modify incoming data based on configuration. What would be the expected action when incoming pdata is unmodifiable? Should the component clone the data?

So I expect to add 3 lines to their Consume funcs:

func (t *transform) ConsumeTraces(ctx context.Context, td ptrace.Traces) error {
  if td.State == pcommon.StateShared {
    td = td.Clone()
  }
  ....
}

Update after a discussion with @dmitryax, we can offer a helper and user have to do only this. where inside the MakeExclusive we do the cloning/refcounting.

func (t *transform) ConsumeTraces(ctx context.Context, td ptrace.Traces) error {
  td.MakeExclusive()
  ....
}

@codeboten
Copy link
Contributor

Just to summarize, the proposal is to:

  • add a state on the pdata model for each signal
  • add a MakeExclusive helper to make the data mutable (clone and set the state flag)
  • modify all components to call MakeExclusive if there's a need to mutate the data
  • deprecate the use of component Capabilities
  • panic anytime someone tries to modify data that hasn't been explicitly made exclusive?

If my understanding is correct, I think it seems like an OK plan. It removes the ability for someone to unexpectedly modify data they shouldn't, and give component author a mechanism to be made aware of the problem before the code is shipped.

So long as the performance isn't impacted (based on @bogdandrutu's results above it doesn't appear to be a problem) but it would be good to confirm once the code is in a PR anyways.

@andrzej-stencel
Copy link
Member

I agree that the current state is undesirable and I applaud this idea to change it. The solution with panics is already so much better than current state. I like the non-panics approach even more though.

As a component developer, I'd much prefer being told by the compiler that I cannot modify the data, as opposed to compiling successfully and only finding out during runtime (hopefully when running my tests) that my implementation is incorrect. In case of components that only mutate data in specific circumstances, this means the possibility that I've missed calling the MakeExclusive() function in one of the code execution paths and haven't covered it by tests, which means that I (or the users of my component) will find this out during runtime.

I've looked at Dmitrii's PR, but I'm afraid I'm not able to grasp the scope of changes needed to the API and understand the burden of maintaining the duplication that the non-panics approach would introduce.

@dmitryax
Copy link
Member

dmitryax commented Jan 27, 2023

@astencel-sumo thanks for your feedback.

I'm working on alternative compile-time solutions that would reduce the amount of duplicated code. Will submit another PR soon.

For now, I'd like to add this issue to https://github.com/open-telemetry/opentelemetry-collector/milestone/24 until we make a decision.

@bogdandrutu
Copy link
Member Author

This also is related to https://github.com/open-telemetry/opentelemetry-collector/milestone/31

bogdandrutu pushed a commit that referenced this issue Oct 6, 2023
…ns (#8494)

This change introduces an option to enable runtime assertions to catch
unintentional pdata mutations in components that are claimed as
non-mutating pdata. Without these assertions, runtime errors may still
occur, but thrown by unrelated components, making it very difficult to
troubleshoot.

For now, this doesn't change the default behavior. It just introduces a
new method on `[Metrics/Traces|Logs].Shared()` that returns pdata marked
as shared. The method will be applied to fan-out consumers in the next
PR.

Later, if we want to remove the need of `MutatesData` capability, we can
introduce another method `[Metrics/Traces|Logs].Exclusive()` which
returns a copy of the pdata if it's shared.

This change unblocks the 1.0 release by implementing the original
solution proposed by @bogdandrutu in
#6794.
Going forward, we may introduce a copy-on-write solution that doesn't
require the runtime assertions. That will likely be part of the 2.0
release.
dmitryax added a commit that referenced this issue Oct 16, 2023
…8634)

This change enables the runtime assertions to catch unintentional pdata
mutations in components claiming as non-mutating pdata. Without these
assertions, runtime errors may still occur, but thrown by unrelated
components, making it very difficult to troubleshoot.

This required introducing extra API to get the pdata mutability state:
- p[metric|trace|log].[Metrics|Traces|Logs].IsReadOnly()

Resolves:
#6794
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants