Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trace ids are mixed up between unrelated events #3913

Open
mauricecamphadrian opened this issue May 30, 2024 · 5 comments
Open

Trace ids are mixed up between unrelated events #3913

mauricecamphadrian opened this issue May 30, 2024 · 5 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Issues which should be fixed (post-triage)

Comments

@mauricecamphadrian
Copy link

Describe the bug
Trace ids appear to be mixed up by some knative component. We roughly have the following setup:

flowchart LR
AppA("App A")
AppB("App B")
Broker("KNative Kafka Broker")
subgraph Sequence
Channel("KNative Kafka Channel")
end

AppA --> Broker
Broker --> Trigger
Trigger --> Channel
Channel --> AppB
Loading

For each event sent by AppA a new root trace is started with a unique trace id. This is reflected in the traceparent headers of the events we see in the backing topic for the Kafka broker. However the trace ids we receive in AppB get mixed up according to our observability platform, many spans are being grouped together under a few trace ids. This is also reflected in the backing topic of the Kafka channel, which leads me to believe that some KNative component is causing the issue.

Expected behavior
We expect each event to retain its original trace id as it propagates through Knative components.

To Reproduce
I have managed to create a minimal example to reproduce this:

apiVersion: eventing.knative.dev/v1
kind: Broker
metadata:
  annotations:
    eventing.knative.dev/broker.class: Kafka
  name: broker
  namespace: default
spec:
  config:
    apiVersion: v1
    kind: ConfigMap
    name: kafka-broker-config
    namespace: knative-eventing
---
apiVersion: flows.knative.dev/v1
kind: Sequence
metadata:
  name: sequence
  namespace: default
spec:
  channelTemplate:
    apiVersion: messaging.knative.dev/v1beta1
    kind: KafkaChannel
    spec:
      numPartitions: 10
      replicationFactor: 1
  steps:
    - ref:
        apiVersion: serving.knative.dev/v1
        kind: Service
        name: app
      uri: /rejectOnce
---
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: app
  namespace: default
spec:
  template:
    spec:
      containers:
        - image: dev.local/printer:1.0.0 # This is a local test image I created that just prints out the event id and trace parent header as they come in.
          imagePullPolicy: Never
      containerConcurrency: 1 # Set container concurrency to 1 to rule out that the mixup is occuring within the service.
---
apiVersion: eventing.knative.dev/v1
kind: Trigger
metadata:
  name: sequence-trigger
  namespace: default
spec:
  broker: broker
  subscriber:
    ref:
      apiVersion: flows.knative.dev/v1
      kind: Sequence
      name: sequence
  1. Create a local kind environment using the development guide and apply the above config (after also installing knative-serving).
  2. Send 20 events with random EventId and random traceparent headers to the knative-eventing broker ingres (http://kafka-broker-ingress.knative-eventing.svc.cluster.local/default/broker)
  3. Compare the event ids + traceparent header that were sent with the values received by the knative service:
event id 01f3241d-b852-45d5-b993-d504849b9291, sent traceparent 00-a76aa4ff977e7ea935275cd5d9e29676-8ad811a32b3fa738-01, received traceparent 00-a76aa4ff977e7ea935275cd5d9e29676-25e4be4a7b2fb3ea-01
event id 03b11df6-f503-441c-a33c-5f2f03f07d3d, sent traceparent 00-7f59a9b0c19f17cf45fe669b1ece4137-66e673d96fdf4ac3-01, received traceparent 00-7f59a9b0c19f17cf45fe669b1ece4137-bd3dcbfcf514ad12-01
event id 0db02aed-f9c2-41e6-a176-2afdf62d5c86, sent traceparent 00-ed09636d307e13c17a3b58afc0b4d175-25132c6425ef1ae2-01, received traceparent 00-1204504a7b59e014605a9a9ee67cb63a-77aff79ac84d4bcf-01
event id 2e6e21d5-0e68-4eec-9a3b-0451a5e7e92c, sent traceparent 00-7dc9619f8aa46e74ef6f75ef902b9a8d-b272a264c55d992e-01, received traceparent 00-7dc9619f8aa46e74ef6f75ef902b9a8d-b1adee7e54d59269-01
event id 38242069-115e-4060-a2a1-19621baeeb7f, sent traceparent 00-7f022f328595f47710d01382e225d1b1-016224cc898d4577-01, received traceparent 00-1204504a7b59e014605a9a9ee67cb63a-2e4e015e7ff57262-01
event id 39409e02-40e5-460f-a5a3-e6c07f6fd56c, sent traceparent 00-93cc8642d4e78bf143b1c5316c767ffd-4aaa4c2fca829bf4-01, received traceparent 00-bb8c75c08b257670cac3cb2b4b36f686-39724274340c1454-01
event id 44234477-d8f4-4716-a45e-8b9013bb861a, sent traceparent 00-d9206db7c69bc3192b90a7f9da070521-2d42ec9a043603af-01, received traceparent 00-d9206db7c69bc3192b90a7f9da070521-7e80a43776e2f1af-01
event id 4471e335-7703-4919-b8d3-470afda5988b, sent traceparent 00-821de555043da05edb4bc8c0db1bb300-085552f9cdd020a8-01, received traceparent 00-821de555043da05edb4bc8c0db1bb300-61af9c76a499633b-01
event id 725cd1e2-4ff5-47c3-a759-35d1cc9cd176, sent traceparent 00-269a4069c490fd03ab550dc10004947c-7dca840f73a26f2f-01, received traceparent 00-269a4069c490fd03ab550dc10004947c-31f7d43db3d63e97-01
event id 76b47330-10d3-410f-b7dc-cb2669cd2b84, sent traceparent 00-1204504a7b59e014605a9a9ee67cb63a-78567095817116e4-01, received traceparent 00-1204504a7b59e014605a9a9ee67cb63a-0f478741092b5efb-01
event id 971221f3-77c0-4934-9cbf-b66ba0daeb56, sent traceparent 00-2de4ece65d315d962067ece13d18e0ae-0940dfa1d118ca34-01, received traceparent 00-bb8c75c08b257670cac3cb2b4b36f686-5902ac017f7a2508-01
event id 9c701ed1-2924-4d53-8ed8-6975b48786b1, sent traceparent 00-21c076235d8645fce69c8d784caffd84-35e3222d0ed3a358-01, received traceparent 00-1204504a7b59e014605a9a9ee67cb63a-baf596162236d5f4-01
event id bfd26072-28aa-4380-b724-4022578331a8, sent traceparent 00-65cc555834b1addee9676e0bbca211b3-92a1aca18384b01a-01, received traceparent 00-1204504a7b59e014605a9a9ee67cb63a-c314f1fbf6164764-01
event id c0d2ad5a-ed08-4585-9a7c-9efb110e6d5f, sent traceparent 00-b48e0b56763f4f2ea39f1e83d5292cbc-ed8e6e04c72c52fd-01, received traceparent 00-b48e0b56763f4f2ea39f1e83d5292cbc-e19f57ed19ace1d0-01
event id c14378af-f06a-46f8-a427-cba94b0c1ab3, sent traceparent 00-8644e43ccfa6e28f7984c21ab4517531-cb690ea8acaaba62-01, received traceparent 00-1204504a7b59e014605a9a9ee67cb63a-9e6fa2a8daf646d3-01
event id c6214d3e-88ef-4734-82e7-941b7609aba2, sent traceparent 00-70274871b312a40dc4cee26127955c18-5f26f15ebe464ff9-01, received traceparent 00-70274871b312a40dc4cee26127955c18-70e09b8cb3914b2c-01
event id e37879a9-7a47-407e-9208-3968af6ad2b0, sent traceparent 00-622500832c6ee74a3aed65d7d731c6aa-1f27220a9b643bb9-01, received traceparent 00-622500832c6ee74a3aed65d7d731c6aa-42ddc08f45827ffb-01
event id e8a1870c-8173-46bc-b28f-223c461b9b7c, sent traceparent 00-48be4d81478e6ab1a5a749fdbf49e974-614779bea1d0ee8c-01, received traceparent 00-48be4d81478e6ab1a5a749fdbf49e974-9912d19b8b2e6c6a-01
event id fb3c262d-dc3c-44a8-b594-7eed168468f9, sent traceparent 00-d28ba0ec48164e35b0454b20c1d7676b-bd941a4aef610d08-01, received traceparent 00-1204504a7b59e014605a9a9ee67cb63a-295fae61f85330ea-01
event id fd734c35-c403-4b97-8b5b-4cf3f870c98d, sent traceparent 00-bb8c75c08b257670cac3cb2b4b36f686-94e242c3606ceb74-01, received traceparent 00-bb8c75c08b257670cac3cb2b4b36f686-f5a274f3909032e9-01

As you can see from my results there are several events that were received with trace id 1204504a7b59e014605a9a9ee67cb63a but only 1 was sent with this trace id.

Knative release version
1.14.1
Reproduction was done using commit 28b46159cf8fe6877756459198c0a8f49c2d6626

@mauricecamphadrian mauricecamphadrian added the kind/bug Categorizes issue or PR as related to a bug. label May 30, 2024
@Cali0707
Copy link
Member

cc @pierDipi

@pierDipi
Copy link
Member

/assign

@pierDipi
Copy link
Member

Send 20 events with random EventId and random traceparent headers to the knative-eventing broker ingres (http://kafka-broker-ingress.knative-eventing.svc.cluster.local/default/broker)

@mauricecamphadrian do you have a script I can run to do that or would you mind providing one please?

@pierDipi pierDipi added the triage/accepted Issues which should be fixed (post-triage) label Jun 20, 2024
@mauricecamphadrian
Copy link
Author

I used a small dotnet tool to generate the events since our producing service is also a dotnet service:

Program.cs:

using System.Collections.Concurrent;
using CloudNative.CloudEvents;
using CloudNative.CloudEvents.Http;
using CloudNative.CloudEvents.SystemTextJson;

const string RequestURI = "http://localhost:8080/default/broker";
const int WorkerParallism = 10;
const int RequestsPerWorker = 2;

var bag = new ConcurrentBag<KeyValuePair<Guid, string>>();

await Parallel.ForAsync(0, WorkerParallism, async (i, _) =>
{
    var client = new HttpClient();
    for (var x = 0; x < RequestsPerWorker; x++)
    {
        await MakeRequest(i, client);
    }
});

foreach (var (k, v) in bag)
{
    Console.WriteLine($"Sent event with id {k} and traceparent {v}");
}

async Task MakeRequest(int workerIdx, HttpClient client)
{
    var eventId = Guid.NewGuid();
    var cloudEvent = new CloudEvent
    {
        Data = new(),
        DataContentType = "application/json",
        Id = eventId.ToString(),
        Source = new Uri("tester", UriKind.Relative),
        Time = DateTimeOffset.UtcNow,
        Type = "test.event"
    };

    var traceparent = GenerateRandomTraceParentHeader(out var traceId, out var parentId);

    var content = cloudEvent.ToHttpContent(ContentMode.Binary, new JsonEventFormatter());
    content.Headers.Add("traceparent", traceparent);
    content.Headers.Add("ce-partitionkey", $"{workerIdx}");
    var response = await client.PostAsync(RequestURI, content);

    if (!response.IsSuccessStatusCode)
    {
        Console.WriteLine("[WRN] Unexpected status code received");
    }

    bag.Add(KeyValuePair.Create(eventId, traceparent));
}

static string GenerateRandomTraceParentHeader(out string traceId, out string parentId)
{
    var traceBytes = new byte[16];
    Random.Shared.NextBytes(traceBytes);
    traceId = Convert.ToHexString(traceBytes).ToLowerInvariant();

    var parentBytes = new byte[8];
    Random.Shared.NextBytes(parentBytes);
    parentId = Convert.ToHexString(parentBytes).ToLowerInvariant();

    return $"00-{traceId}-{parentId}-01";
}

Project file:

<Project Sdk="Microsoft.NET.Sdk">

  <PropertyGroup>
    <OutputType>Exe</OutputType>
    <TargetFramework>net8.0</TargetFramework>
    <ImplicitUsings>enable</ImplicitUsings>
    <Nullable>enable</Nullable>
  </PropertyGroup>

  <ItemGroup>
    <PackageReference Include="CloudNative.CloudEvents" Version="2.7.1" />
    <PackageReference Include="CloudNative.CloudEvents.SystemTextJson" Version="2.7.1" />
    <PackageReference Include="Microsoft.AspNetCore.Mvc" Version="2.2.0" />
  </ItemGroup>

</Project>

I spent some more time trying to find where this seems to be going wrong and I think it happens in the event sending logic in WebClientCloudEventSender.send():

  • Logging the trace id just before VertxMessageFactory.createWriter(req).writeBinary(event) in results in correct trace ids.
  • Logging the trace id in the onSuccess callback of writeBinary() results in an incorrect trace id that matches the trace id as seen in the kafka topic.

I tried to dive deeper but I'm too unfamiliar with java and Vert.x.

@mauricecamphadrian
Copy link
Author

Any update on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Issues which should be fixed (post-triage)
Projects
None yet
Development

No branches or pull requests

3 participants