Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc(informational): Combined Replay Envelope Item #82

Merged
merged 9 commits into from
Jun 9, 2023
Merged
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,3 +37,4 @@ This repository contains RFCs and DACIs. Lost?
- [0072-kafka-schema-registry](text/0072-kafka-schema-registry.md): Kafka Schema Registry
- [0078-escalating-issues](text/0078-escalating-issues.md): Escalating Issues
- [0080-issue-states](text/0080-issue-states.md): Issue States
- [0082-combined-replay-envelope-item](text/0082-combined-replay-envelope-item.md): Combined Replay Envelope Item
102 changes: 102 additions & 0 deletions text/0082-combined-replay-envelope-item.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
- Start Date: 2023-03-24
- RFC Type: informational
- RFC PR: https://github.com/getsentry/rfcs/pull/82
- RFC Status: draft

# Summary

Right now Session Replays data is sent from the SDK via two envelope item types:

ReplayEvent and ReplayRecording.

- ReplayEvent contains indexable metadata that is stored in our clickhouse table.
- ReplayRecording contains a large, usually compressed blob that is written to our Filestore (GCS as of now)

These are always sent in the same envelope by the sdk. We'd like to combine these two envelope item types into one.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are some very old SDKs from alpha still out there that sent two separate requests. As far as I'm aware none of these SDKs are paying customers and they may have stopped ingesting totally.

I think its worthwhile documenting what happens in the event of a missing envelope item. I assume we should 400 or otherwise reject in some meaningful way. Something that tells the client to stop sending.


We'd first write a compatibility layer in relay that takes the old format and converts them into a single item type.
We'd then modify the sentry-sdk to send combined item types going forward
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to consider compatibility with Self Hosted (and Single Tenant) before we change the SDK


# Motivation

## The Current Flow:

```mermaid
graph
A([SDK Generated Event])-->|HTTP Request| Z{Relay}
Z-->|Published to Kafka topic|B[Replays Kafka Cluster]
B-->|ingest-replay|J[Replay Consumer]
JoshFerge marked this conversation as resolved.
Show resolved Hide resolved
J-->|Buffered Bulk Commit|K[(Clickhouse)]
B-->|ingest-replay-recordings|E[Recording Consumer]
E-->C[Recording Chunk Processor]
C-->|Chunk Stashed|D[(Redis)]
E-->Q[Recording Event Processor]
Q-->F[Chunk Assembly]
I[(Redis)]-->|Chunks Fetched|F
F-->|Assembled Chunks Stored|G[(Blob Storage)]
F-->|File Metadata Saved|H[(PostgreSQL)]
```

We'd like to combine these envelope payloads into one for the following reasons:

- Now that we're decompressing, parsing, and indexing data in the recording, the delineation between the two events no longer makes sense.
- Right now there exists a race condition between ReplayEvent and ReplayRecording -- if a ReplayEvent makes it to clickhouse and is stored before the ReplayRecording, it can result in a bad user experience as a user can navigate to the replay, but the replay's blobs may not be stored yet. We'd like to change it so the snuba writes happen _downstream_ from the recording consumer, as we don't want to make a replay available for search until it's corresponding recording blob has been stored.
- Right now our rate limits are separated per Itemtype. It would be less confusing if the rate limit applied to a single event type.
- It is very hard to do telemetry on the recordings consumer now as we do not have SDK version. combining the envelopes allows us to have metadata in our recordings consumer in an easily accessible way.

# Options Considered

New Proposed Flow:

```mermaid
graph
A([SDK Generated Combined Replay Envelope Item])--> Z[Relay, Combines Envelope Items if needed]
AA([Legacy SDK Generated Separate Items]) --> Z
Z-->|Published to Kafka topic, using random partition key|B[Replays Kafka Cluster]
T-->|ingest-replay-events|J[Snuba Consumer]
J-->|Buffered Bulk Commit|K[(Clickhouse)]
B-->|ingest-replay-recordings|E[Recording Consumer]
E-->Q[Parse Replay for Click Events, etc.]
Q-->R[Emit Outcome if segment = 0]
Q-->S[Store Replay Blob in Storage, GCS in SaaS ]
S-->T[ Emit Kafka Message for Snuba Consumer]
```

### Relay Changes:

1. Create a new ItemType `CombinedReplayRecordingEvent`
2. in `processor.rs` https://github.com/getsentry/relay/blob/606166fca57a84ca1b9240253013871d13827de3/relay-server/src/actors/processor.rs#L1134, replace the `process_replays` function that will combine the envelope items if its the old format using `take_item_by` and `add_item` Envelope functions, or leave them if its the new format. This function would then process the two items and set the item payload.
3. We can begin rejecting replay segments over 10 megabytes, which will allow us to not need chunking in our recording consumer.

- The combined Payload will have the format of

```
ReplayEventJSON\nCompressedReplayRecording
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of making this a newline-separated string, could we also send this as a JSON like this:

{
  event: ReplayEventJSON,
  recording: CompressedReplayRecording
}

Not sure about implications for parsing this, but from an SDK perspective this seems a bit nicer/safer than string-concatting this. But not a hard feeling, just a thought.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a wait, I guess we can't put the compressed stuff into JSON 🤔 but now that I think about this, not sure if we can concat the UInt8Array that is the compressed payload onto a string as well 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this merged type is generated on the client then you will need to serialize with a binary format like msgpack.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I see. I guess we can do that, convert the payload into UInt8Array if it isn't compressed, then combine them together into a single UInt8Array.

One downside to note here is that it may make debugging harder - I currently tend to disable compression when debugging issues, in order to see what data is sent. This would not be possible anymore (as the payload would then always be UInt8Array). Not a blocker, just to be aware of this downside!

```

That the downstream consumer can easily parse by splitting on newline.

### SDK Changes
cmanallen marked this conversation as resolved.
Show resolved Hide resolved

Emit a new item type "replay_recording_event" with the format

```
ReplayEventJSON\nCompressedReplayRecording
cmanallen marked this conversation as resolved.
Show resolved Hide resolved
```

replacing the previous transport calls.

### Replay Recording Consumer Changes

1. Create a new ReplayRecording consumer that can run along-side the existing consumer, as there will be a change-over period
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need a new consumer. We can switch on the type field in the kafka payload. Right now the struct name is converted to snake-case and applied as the "type" value of the serialized object. We would update our consumer to accept the new type and then process accordingly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. 👍🏼

2. This consumer will do all the same things as the previous recordings consumer with the addition of:
- emitting a JSON event to the snuba-replay-events kafka topic

# Drawbacks

This is a decent chunk of engineering work.

# Unresolved questions

- Is there a better format for the combined envelope item type? If we split on \n, this means the replay_event should never have a newline in it. I believe this is acceptable, but is there a better format for sending a combined JSON / binary piece of data?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caught this at the very end. I'm leaving my previous feedback in place :P

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tags could have a new-line character.

- If rate limits are applied before processing, it seems like we'll need to add a new rate limit for this combined item. This should be okay, anything else to think about here?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand. If rate-limits are applied before processing then this merged type doesn't exist yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this would only be the case when people are on new SDKs that send the proposed new combined envelope format