Skip to content
This repository has been archived by the owner on Jun 19, 2022. It is now read-only.

Possible bug in trigger event count #1920

Closed
zhongduo opened this issue Nov 16, 2020 · 18 comments
Closed

Possible bug in trigger event count #1920

zhongduo opened this issue Nov 16, 2020 · 18 comments
Assignees
Labels
area/broker area/observability area/test-and-release Test infrastructure, tests or release kind/bug Something isn't working lifecycle/stale priority/1 Blocks current release defined by release/* label or blocks current milestone

Comments

@zhongduo
Copy link
Contributor

Describe the bug
During the upgrade test, I found that the broker event count is not the same as sum of all the trigger event counts. Note that the receiver pod does receive all the events as the broker event count, so there is a possibility that we didn't record all the trigger event counts.

Expected behavior
Broker event count should be same as sum of all trigger event count in stackdriver.

To Reproduce
Run the upgrade test, then check the number of event counts for broker and triggers.

Knative-GCP release version

Additional context
Add any other context about the problem here such as proposed priority

@zhongduo zhongduo added the kind/bug Something isn't working label Nov 16, 2020
@grantr grantr added area/broker area/test-and-release Test infrastructure, tests or release labels Nov 20, 2020
@Harwayne
Copy link
Contributor

The description of the bug is a bit ambiguous. Was the test set up such that every event sent to the Broker should pass the only Trigger's filter? Or that every event sent into the Broker passes exactly one Trigger's filter?

Which was greater, the Broker's event_count or Sum(Trigger event_count)?


The Broker's event_count and sum of all the Trigger event_counts for that Broker do not need to equal each other.

Broker event_count > Sum(Trigger event_count)

I send 1 event to the Broker and that event does not pass the filter for any of my 5 Triggers.
Broker event_count = 1
Sum(Trigger event_count) = 0

Broker event_count < Sum(Trigger event_count)

I send 1 event to the Broker and that event passes the filter for all of my 5 Triggers.
Broker event_count = 1
Sum(Trigger event_count) = 5

Broker event_count < Trigger event_count

I send 1 event to the Broker and that event passes the filter for my 1 Trigger. However, the initial delivery fails because the downstream service returns a 503. On the next attempt, the Trigger successfully sends the event to the downstream service, which returns a 202.

Broker event_count = 1
Trigger event_count = 2

@Harwayne Harwayne added the priority/awaiting-more-evidence Items which might need to be fixed, but which need more details first. label Nov 25, 2020
@zhongduo
Copy link
Contributor Author

zhongduo commented Nov 25, 2020

received events == Broker event_count > Sum(Trigger event_count)

The difference is about missing 300 events out of about 4200.

@zhongduo
Copy link
Contributor Author

There are two triggers, step gets almost all the events from broker, finished gets only one.

@grantr
Copy link
Contributor

grantr commented Dec 1, 2020

@quentin-cha

@Harwayne
Copy link
Contributor

Harwayne commented Dec 2, 2020

@zhongduo and I talked, this issue is reproducible via the TestEventingUpgrades E2E test.

There is a Broker which is being sent events continuously at roughly 30 QPS. Each event going into the Broker should match a single Trigger's filter, all of which point to a receiver Pod. At the end of the test, the Broker event_count is 4200, which exactly matches the number of events seen by the receiver Pod. However, the Sum(Trigger event_count) is only 3900. Which is very odd as it should be at least as large as the number of events seen by the receiver Pod.

@Harwayne Harwayne added priority/1 Blocks current release defined by release/* label or blocks current milestone area/observability and removed priority/awaiting-more-evidence Items which might need to be fixed, but which need more details first. labels Dec 2, 2020
@Harwayne
Copy link
Contributor

Harwayne commented Dec 3, 2020

Because the Pods were restarting, I thought it might have something to do with not flushing properly in the fanout/retry Pods, but from what I can see, the ingress, fanout, and retry Pods all flush metrics in the exact same way.

@Harwayne
Copy link
Contributor

Harwayne commented Dec 3, 2020

As far as I can tell this line is the only place a Trigger sends an event to the subscriber:

resp, err := p.sendMsg(ctx, target.Address, msg, transformer.DeleteExtension(eventutil.HopsAttribute))

The only way to get out of that function without calling p.StatsReporter.ReportEventDispatchTime, which is what increments the event_count metric, is if that line returns a non-nil error that is not a timeout.

resp, err := p.sendMsg(ctx, target.Address, msg, transformer.DeleteExtension(eventutil.HopsAttribute))
if err != nil {
var result *url.Error
if errors.As(err, &result) && result.Timeout() {
// If the delivery is cancelled because of timeout, report event dispatch time without resp status code.
p.StatsReporter.ReportEventDispatchTime(ctx, time.Since(startTime))
}
return err
}

But even if that happens, the Trigger should retry the event, which would result in the same code being called again and once the event is sent successfully, p.StatsReporter.ReportEventDispatchTime is called, which would increment event_count.

@quentin-cha
Copy link
Member

Reran the test (TestEventingUpgrades) a couple of times and confirmed the issue.
Observations so far:

  • There is always a difference between the Broker event count and Trigger event count.
  • Trigger event count is always less than Broker event count.
  • The delta amount varies from run to run.

Investigate

  • make sure triggers have a chance to report all counts before test tear down
  • add logging to figure out the number of invocations made to Broker and Trigger event count

@quentin-cha
Copy link
Member

Observation update:

Conclusion so far:

  • the actual broker and trigger event count are being recorded correctly based on the log instance count.
  • pods do have the chance to invoke flushing before being killed.

@quentin-cha
Copy link
Member

After some more investigation, a possible theory is that:

When we record metrics, the metrics themselves do not get pushed into the bundlers right away, instead, they are stored in intermediaries call views:

v.addSample(cmd.tm, m.Value(), cmd.attachments, cmd.t)

There is a separate mechanism that periodically puts these view data into bundler:

if err := e.StartMetricsExporter(); err != nil {

func (e *statsExporter) startMetricsReader() error {

ir.reader.ReadAndExport(ir.exporter)

This is where the metrics actually get into the bundler to be exported:
func (se *statsExporter) ExportMetrics(ctx context.Context, metrics []*metricdata.Metric) error {

So after the metric is recorded and before the reader puts it into the bundler, the metric is essentially 'in-flight' within the system.
If the pod were to terminate, flushing of the bundlers are initiated:

func flushExporters(logger *zap.SugaredLogger) {

Notice how flushing only flushes the bundlers, so there is a chance that the reader has not pushed the metrics from view into those bundlers ('in-flight metrics'). This will likely lead to one interval of metrics being essentially discarded.

@zhongduo
Copy link
Contributor Author

zhongduo commented Dec 16, 2020 via email

@quentin-cha
Copy link
Member

Based on what I have seen during the investigation, broker event counts can also be off. So the root cause would most likely be in the common layers.

@zhongduo
Copy link
Contributor Author

zhongduo commented Dec 16, 2020 via email

@Harwayne
Copy link
Contributor

Harwayne commented Feb 1, 2021

@quentin-cha found the root cause to be OpenCensus not flushing all metric buffers on process shutdown. census-instrumentation/opencensus-go#1248 was made to add an API to flush this buffer.

/unassign

@grantr
Copy link
Contributor

grantr commented Feb 3, 2021

Looks like that PR was merged. When can we get the fix into this repo?

@grantr
Copy link
Contributor

grantr commented Feb 3, 2021

Also thanks @quentin-cha for tracking this down!

@quentin-cha
Copy link
Member

No worries.
A couple more PRs needs to happen for the change to propagate to here.
The second PR just got approved: census-ecosystem/opencensus-go-exporter-stackdriver#282
There will be at least two more PRs, one in knative/pkg and one in this repo.
I will close this when all the PRs are merged.

@github-actions
Copy link

github-actions bot commented May 5, 2021

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/broker area/observability area/test-and-release Test infrastructure, tests or release kind/bug Something isn't working lifecycle/stale priority/1 Blocks current release defined by release/* label or blocks current milestone
Projects
None yet
Development

No branches or pull requests

4 participants