-
Notifications
You must be signed in to change notification settings - Fork 74
Possible bug in trigger event count #1920
Comments
The description of the bug is a bit ambiguous. Was the test set up such that every event sent to the Broker should pass the only Trigger's filter? Or that every event sent into the Broker passes exactly one Trigger's filter? Which was greater, the Broker's event_count or Sum(Trigger event_count)? The Broker's event_count and sum of all the Trigger event_counts for that Broker do not need to equal each other. Broker event_count > Sum(Trigger event_count)I send 1 event to the Broker and that event does not pass the filter for any of my 5 Triggers. Broker event_count < Sum(Trigger event_count)I send 1 event to the Broker and that event passes the filter for all of my 5 Triggers. Broker event_count < Trigger event_countI send 1 event to the Broker and that event passes the filter for my 1 Trigger. However, the initial delivery fails because the downstream service returns a 503. On the next attempt, the Trigger successfully sends the event to the downstream service, which returns a 202. Broker event_count = 1 |
received events == Broker event_count > Sum(Trigger event_count) The difference is about missing 300 events out of about 4200. |
There are two triggers, |
@zhongduo and I talked, this issue is reproducible via the There is a Broker which is being sent events continuously at roughly 30 QPS. Each event going into the Broker should match a single Trigger's filter, all of which point to a receiver Pod. At the end of the test, the Broker event_count is 4200, which exactly matches the number of events seen by the receiver Pod. However, the Sum(Trigger event_count) is only 3900. Which is very odd as it should be at least as large as the number of events seen by the receiver Pod. |
Because the Pods were restarting, I thought it might have something to do with not flushing properly in the fanout/retry Pods, but from what I can see, the ingress, fanout, and retry Pods all flush metrics in the exact same way. |
As far as I can tell this line is the only place a Trigger sends an event to the subscriber:
The only way to get out of that function without calling knative-gcp/pkg/broker/handler/processors/deliver/processor.go Lines 151 to 159 in 2ef4ca1
But even if that happens, the Trigger should retry the event, which would result in the same code being called again and once the event is sent successfully, |
Reran the test (
Investigate
|
Observation update:
Conclusion so far:
|
After some more investigation, a possible theory is that: When we record metrics, the metrics themselves do not get pushed into the bundlers right away, instead, they are stored in intermediaries call views:
There is a separate mechanism that periodically puts these view data into bundler:
This is where the metrics actually get into the bundler to be exported:
So after the metric is recorded and before the reader puts it into the bundler, the metric is essentially 'in-flight' within the system.
Notice how flushing only flushes the bundlers, so there is a chance that the reader has not pushed the metrics from view into those bundlers ('in-flight metrics'). This will likely lead to one interval of metrics being essentially discarded. |
Based on what I have seen during the investigation, broker event counts can also be off. So the root cause would most likely be in the common layers. |
It might be a coincidence in this case then. My previous experience has
always been that the broker is getting the right number *when the test
succeeds.*
…On Wed, Dec 16, 2020 at 1:06 PM Quentin Cha ***@***.***> wrote:
Based on what I have seen during the investigation, broker event counts
can also be off. So the root cause would most likely be in the common
layers.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1920 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACE6CNBJY2SQNIQFEFSG6DDSVDZL3ANCNFSM4TX2NKRQ>
.
|
@quentin-cha found the root cause to be OpenCensus not flushing all metric buffers on process shutdown. census-instrumentation/opencensus-go#1248 was made to add an API to flush this buffer. /unassign |
Looks like that PR was merged. When can we get the fix into this repo? |
Also thanks @quentin-cha for tracking this down! |
No worries. |
This issue is stale because it has been open for 90 days with no |
Describe the bug
During the upgrade test, I found that the broker event count is not the same as sum of all the trigger event counts. Note that the receiver pod does receive all the events as the broker event count, so there is a possibility that we didn't record all the trigger event counts.
Expected behavior
Broker event count should be same as sum of all trigger event count in stackdriver.
To Reproduce
Run the upgrade test, then check the number of event counts for broker and triggers.
Knative-GCP release version
Additional context
Add any other context about the problem here such as proposed priority
The text was updated successfully, but these errors were encountered: