Possible bug in trigger event count #1920

zhongduo · 2020-11-16T23:23:29Z

Describe the bug
During the upgrade test, I found that the broker event count is not the same as sum of all the trigger event counts. Note that the receiver pod does receive all the events as the broker event count, so there is a possibility that we didn't record all the trigger event counts.

Expected behavior
Broker event count should be same as sum of all trigger event count in stackdriver.

To Reproduce
Run the upgrade test, then check the number of event counts for broker and triggers.

Knative-GCP release version

Additional context
Add any other context about the problem here such as proposed priority

Harwayne · 2020-11-25T19:37:24Z

The description of the bug is a bit ambiguous. Was the test set up such that every event sent to the Broker should pass the only Trigger's filter? Or that every event sent into the Broker passes exactly one Trigger's filter?

Which was greater, the Broker's event_count or Sum(Trigger event_count)?

The Broker's event_count and sum of all the Trigger event_counts for that Broker do not need to equal each other.

Broker event_count > Sum(Trigger event_count)

I send 1 event to the Broker and that event does not pass the filter for any of my 5 Triggers.
Broker event_count = 1
Sum(Trigger event_count) = 0

Broker event_count < Sum(Trigger event_count)

I send 1 event to the Broker and that event passes the filter for all of my 5 Triggers.
Broker event_count = 1
Sum(Trigger event_count) = 5

Broker event_count < Trigger event_count

I send 1 event to the Broker and that event passes the filter for my 1 Trigger. However, the initial delivery fails because the downstream service returns a 503. On the next attempt, the Trigger successfully sends the event to the downstream service, which returns a 202.

Broker event_count = 1
Trigger event_count = 2

zhongduo · 2020-11-25T20:20:36Z

received events == Broker event_count > Sum(Trigger event_count)

The difference is about missing 300 events out of about 4200.

zhongduo · 2020-11-25T20:22:18Z

There are two triggers, step gets almost all the events from broker, finished gets only one.

grantr · 2020-12-01T21:24:40Z

@quentin-cha

Harwayne · 2020-12-02T22:59:20Z

@zhongduo and I talked, this issue is reproducible via the TestEventingUpgrades E2E test.

There is a Broker which is being sent events continuously at roughly 30 QPS. Each event going into the Broker should match a single Trigger's filter, all of which point to a receiver Pod. At the end of the test, the Broker event_count is 4200, which exactly matches the number of events seen by the receiver Pod. However, the Sum(Trigger event_count) is only 3900. Which is very odd as it should be at least as large as the number of events seen by the receiver Pod.

Harwayne · 2020-12-03T00:01:07Z

Because the Pods were restarting, I thought it might have something to do with not flushing properly in the fanout/retry Pods, but from what I can see, the ingress, fanout, and retry Pods all flush metrics in the exact same way.

Harwayne · 2020-12-03T07:59:00Z

As far as I can tell this line is the only place a Trigger sends an event to the subscriber:

knative-gcp/pkg/broker/handler/processors/deliver/processor.go

Line 151 in 2ef4ca1

    
           resp, err := p.sendMsg(ctx, target.Address, msg, transformer.DeleteExtension(eventutil.HopsAttribute))

The only way to get out of that function without calling p.StatsReporter.ReportEventDispatchTime, which is what increments the event_count metric, is if that line returns a non-nil error that is not a timeout.

knative-gcp/pkg/broker/handler/processors/deliver/processor.go

Lines 151 to 159 in 2ef4ca1

    
           resp, err := p.sendMsg(ctx, target.Address, msg, transformer.DeleteExtension(eventutil.HopsAttribute)) 
        
           if err != nil { 
        
           	var result *url.Error 
        
           	if errors.As(err, &result) && result.Timeout() { 
        
           		// If the delivery is cancelled because of timeout, report event dispatch time without resp status code. 
        
           		p.StatsReporter.ReportEventDispatchTime(ctx, time.Since(startTime)) 
        
           	} 
        
           	return err 
        
           }

But even if that happens, the Trigger should retry the event, which would result in the same code being called again and once the event is sent successfully, p.StatsReporter.ReportEventDispatchTime is called, which would increment event_count.

quentin-cha · 2020-12-09T15:13:02Z

Reran the test (TestEventingUpgrades) a couple of times and confirmed the issue.
Observations so far:

There is always a difference between the Broker event count and Trigger event count.
Trigger event count is always less than Broker event count.
The delta amount varies from run to run.

Investigate

make sure triggers have a chance to report all counts before test tear down
add logging to figure out the number of invocations made to Broker and Trigger event count

quentin-cha · 2020-12-10T22:35:13Z

Observation update:

broker event count doesn't always have the correct count, and sometimes the trigger event count is higher than the broker event count.
added log around where the broker event count and trigger event count are being recorded

knative-gcp/pkg/broker/ingress/handler.go

Line 234 in eae120d

func (h *Handler) reportMetrics(ctx context.Context, eventType string, statusCode int) {

knative-gcp/pkg/broker/handler/processors/deliver/processor.go

Line 156 in eae120d

p.StatsReporter.ReportEventDispatchTime(ctx, time.Since(startTime))

knative-gcp/pkg/broker/handler/processors/deliver/processor.go

Line 173 in eae120d

p.StatsReporter.ReportEventDispatchTime(cctx, time.Since(startTime))

after counting the number of times these logs appear, they match with the expected event count reported by the prober.
added log around exporter flushing

knative-gcp/pkg/observability/observability.go

Line 52 in eae120d

func flushExporters(logger *zap.SugaredLogger) {

observed that the Flushing is being invoked during redeployment.

Conclusion so far:

the actual broker and trigger event count are being recorded correctly based on the log instance count.
pods do have the chance to invoke flushing before being killed.

quentin-cha · 2020-12-16T17:26:51Z

After some more investigation, a possible theory is that:

When we record metrics, the metrics themselves do not get pushed into the bundlers right away, instead, they are stored in intermediaries call views:

knative-gcp/vendor/go.opencensus.io/stats/view/worker_commands.go

Line 166 in 3fc6d6f

v.addSample(cmd.tm, m.Value(), cmd.attachments, cmd.t)

There is a separate mechanism that periodically puts these view data into bundler:

knative-gcp/vendor/knative.dev/pkg/metrics/stackdriver_exporter.go

Line 173 in 3fc6d6f

if err := e.StartMetricsExporter(); err != nil {

knative-gcp/vendor/contrib.go.opencensus.io/exporter/stackdriver/stats.go

Line 139 in 3fc6d6f

func (e *statsExporter) startMetricsReader() error {

knative-gcp/vendor/go.opencensus.io/metric/metricexport/reader.go

Line 148 in 3fc6d6f

ir.reader.ReadAndExport(ir.exporter)

This is where the metrics actually get into the bundler to be exported:

knative-gcp/vendor/contrib.go.opencensus.io/exporter/stackdriver/metrics.go

Line 52 in 3fc6d6f

    
           func (se *statsExporter) ExportMetrics(ctx context.Context, metrics []*metricdata.Metric) error {

So after the metric is recorded and before the reader puts it into the bundler, the metric is essentially 'in-flight' within the system.
If the pod were to terminate, flushing of the bundlers are initiated:

knative-gcp/pkg/observability/observability.go

Line 52 in 3fc6d6f

func flushExporters(logger *zap.SugaredLogger) {

knative-gcp/vendor/contrib.go.opencensus.io/exporter/stackdriver/stats.go

Line 202 in 3fc6d6f

func (e *statsExporter) Flush() {

Notice how flushing only flushes the bundlers, so there is a chance that the reader has not pushed the metrics from view into those bundlers ('in-flight metrics'). This will likely lead to one interval of metrics being essentially discarded.

zhongduo · 2020-12-16T17:30:20Z

Does this explain why the broker always has the right number?

…

On Wed, Dec 16, 2020 at 12:27 PM Quentin Cha ***@***.***> wrote: After some more investigation, a possible theory is that: When we record metrics, the metrics themselves do not get pushed into the bundlers right away, instead, they are stored in intermediaries call views: https://github.com/google/knative-gcp/blob/3fc6d6f43caca85fa04ff85b231f7a414d74b010/vendor/go.opencensus.io/stats/view/worker_commands.go#L166 There is a separate mechanism that periodically puts these view data into bundler: https://github.com/google/knative-gcp/blob/3fc6d6f43caca85fa04ff85b231f7a414d74b010/vendor/knative.dev/pkg/metrics/stackdriver_exporter.go#L173 https://github.com/google/knative-gcp/blob/3fc6d6f43caca85fa04ff85b231f7a414d74b010/vendor/contrib.go.opencensus.io/exporter/stackdriver/stats.go#L139 https://github.com/google/knative-gcp/blob/3fc6d6f43caca85fa04ff85b231f7a414d74b010/vendor/go.opencensus.io/metric/metricexport/reader.go#L148 This is where the metrics actually get into the bundler to be exported: https://github.com/google/knative-gcp/blob/3fc6d6f43caca85fa04ff85b231f7a414d74b010/vendor/contrib.go.opencensus.io/exporter/stackdriver/metrics.go#L52 So after the metric is recorded and before the reader puts it into the bundler, the metric is essentially 'in-flight' within the system. If the pod were to terminate, flushing of the bundlers are initiated: https://github.com/google/knative-gcp/blob/3fc6d6f43caca85fa04ff85b231f7a414d74b010/pkg/observability/observability.go#L52 https://github.com/google/knative-gcp/blob/3fc6d6f43caca85fa04ff85b231f7a414d74b010/vendor/contrib.go.opencensus.io/exporter/stackdriver/stats.go#L202 Notice how flushing only flushes the bundlers, so there is a chance that the reader has not pushed the metrics from view into those bundlers ('in-flight metrics'). This will likely lead to one interval of metrics being essentially discarded. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1920 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACE6CNH7F4PC6BORRYVRKQ3SVDUWZANCNFSM4TX2NKRQ> .

quentin-cha · 2020-12-16T18:06:37Z

Based on what I have seen during the investigation, broker event counts can also be off. So the root cause would most likely be in the common layers.

zhongduo · 2020-12-16T18:14:55Z

It might be a coincidence in this case then. My previous experience has always been that the broker is getting the right number *when the test succeeds.*

…

On Wed, Dec 16, 2020 at 1:06 PM Quentin Cha ***@***.***> wrote: Based on what I have seen during the investigation, broker event counts can also be off. So the root cause would most likely be in the common layers. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1920 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACE6CNBJY2SQNIQFEFSG6DDSVDZL3ANCNFSM4TX2NKRQ> .

Harwayne · 2021-02-01T20:43:10Z

@quentin-cha found the root cause to be OpenCensus not flushing all metric buffers on process shutdown. census-instrumentation/opencensus-go#1248 was made to add an API to flush this buffer.

/unassign

grantr · 2021-02-03T18:30:01Z

Looks like that PR was merged. When can we get the fix into this repo?

grantr · 2021-02-03T18:30:13Z

Also thanks @quentin-cha for tracking this down!

quentin-cha · 2021-02-03T19:40:59Z

No worries.
A couple more PRs needs to happen for the change to propagate to here.
The second PR just got approved: census-ecosystem/opencensus-go-exporter-stackdriver#282
There will be at least two more PRs, one in knative/pkg and one in this repo.
I will close this when all the PRs are merged.

github-actions · 2021-05-05T01:58:43Z

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

zhongduo added the kind/bug Something isn't working label Nov 16, 2020

grantr added area/broker area/test-and-release Test infrastructure, tests or release labels Nov 20, 2020

Harwayne added the priority/awaiting-more-evidence Items which might need to be fixed, but which need more details first. label Nov 25, 2020

grantr assigned Harwayne Dec 1, 2020

Harwayne added priority/1 Blocks current release defined by release/* label or blocks current milestone area/observability and removed priority/awaiting-more-evidence Items which might need to be fixed, but which need more details first. labels Dec 2, 2020

Harwayne assigned quentin-cha Dec 8, 2020

quentin-cha mentioned this issue Dec 21, 2020

Add proper metric flush quentin-cha/knative-gcp#3

Open

knative-prow-robot unassigned Harwayne Feb 1, 2021

github-actions bot added the lifecycle/stale label May 5, 2021

github-actions bot closed this as completed Jun 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible bug in trigger event count #1920

Possible bug in trigger event count #1920

zhongduo commented Nov 16, 2020

Harwayne commented Nov 25, 2020

zhongduo commented Nov 25, 2020 •

edited

Loading

zhongduo commented Nov 25, 2020

grantr commented Dec 1, 2020

Harwayne commented Dec 2, 2020

Harwayne commented Dec 3, 2020

Harwayne commented Dec 3, 2020

quentin-cha commented Dec 9, 2020

quentin-cha commented Dec 10, 2020

quentin-cha commented Dec 16, 2020

zhongduo commented Dec 16, 2020 via email

quentin-cha commented Dec 16, 2020

zhongduo commented Dec 16, 2020 via email

Harwayne commented Feb 1, 2021

grantr commented Feb 3, 2021

grantr commented Feb 3, 2021

quentin-cha commented Feb 3, 2021

github-actions bot commented May 5, 2021

Possible bug in trigger event count #1920

Possible bug in trigger event count #1920

Comments

zhongduo commented Nov 16, 2020

Harwayne commented Nov 25, 2020

Broker event_count > Sum(Trigger event_count)

Broker event_count < Sum(Trigger event_count)

Broker event_count < Trigger event_count

zhongduo commented Nov 25, 2020 • edited Loading

zhongduo commented Nov 25, 2020

grantr commented Dec 1, 2020

Harwayne commented Dec 2, 2020

Harwayne commented Dec 3, 2020

Harwayne commented Dec 3, 2020

quentin-cha commented Dec 9, 2020

quentin-cha commented Dec 10, 2020

quentin-cha commented Dec 16, 2020

zhongduo commented Dec 16, 2020 via email

quentin-cha commented Dec 16, 2020

zhongduo commented Dec 16, 2020 via email

Harwayne commented Feb 1, 2021

grantr commented Feb 3, 2021

grantr commented Feb 3, 2021

quentin-cha commented Feb 3, 2021

github-actions bot commented May 5, 2021

zhongduo commented Nov 25, 2020 •

edited

Loading