Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc(informational): Combined Replay Envelope Item #82

Merged
merged 9 commits into from
Jun 9, 2023
Merged
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ This repository contains RFCs and DACIs. Lost?
- [0079-exception-groups](text/0079-exception-groups.md): Exception Groups
- [0080-issue-states](text/0080-issue-states.md): Issue States
- [0081-sourcemap-debugid](text/0081-sourcemap-debugid.md): Reliable JavaScript/SourceMap processing via `DebugId`
- [0082-combined-replay-envelope-item](text/0082-combined-replay-envelope-item.md): Combined Replay Envelope Item
- [0084-move-docs-to-sentry-repository](text/0084-move-docs-to-sentry-repository.md): Move onboarding docs from sentry-docs over to sentry repository
- [0086-sentry-bundler-plugins-api](text/0086-sentry-bundler-plugins-api.md): Sentry Bundler Plugins API
- [0087-graphql-errors](text/0087-graphql-errors.md): Request and Response bodies for GraphQL errors
Expand Down
100 changes: 100 additions & 0 deletions text/0082-combined-replay-envelope-item.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
- Start Date: 2023-03-24
- RFC Type: informational
- RFC PR: https://github.com/getsentry/rfcs/pull/82
- RFC Status: draft

# Summary

Right now Session Replays data is sent from the SDK via two envelope item types:

ReplayEvent and ReplayRecording.

- ReplayEvent contains indexable metadata that is stored in our clickhouse table.
- ReplayRecording contains a large, usually compressed blob that is written to our Filestore (GCS as of now)

These are always sent in the same envelope by the sdk. We'd like to combine these two envelope item types into one.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are some very old SDKs from alpha still out there that sent two separate requests. As far as I'm aware none of these SDKs are paying customers and they may have stopped ingesting totally.

I think its worthwhile documenting what happens in the event of a missing envelope item. I assume we should 400 or otherwise reject in some meaningful way. Something that tells the client to stop sending.


We'd first write a compatibility layer in relay that takes the old format and converts them into a single item type.
We'd then modify the sentry-sdk to send combined item types going forward
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to consider compatibility with Self Hosted (and Single Tenant) before we change the SDK


# Motivation

## The Current Flow:

```mermaid
graph
A([SDK Generated Event])-->|HTTP Request| Z{Relay}
Z-->|Published to Kafka topic|B[Replays Kafka Cluster]
B-->|ingest-replay-events|J[Snuba Consumer]
J-->|Buffered Bulk Commit|K[(Clickhouse)]
B-->|ingest-replay-recordings|E[Recording Consumer]
E-->C[Recording Chunk Processor]
C-->|Chunk Stashed|D[(Redis)]
E-->Q[Recording Event Processor]
Q-->F[Chunk Assembly]
I[(Redis)]-->|Chunks Fetched|F
F-->|Assembled Chunks Stored|G[(Blob Storage)]
F-->|File Metadata Saved|H[(PostgreSQL)]
```

We'd like to combine these envelope payloads into one for the following reasons:

- Now that we're decompressing, parsing, and indexing data in the recording, the delineation between the two events no longer makes sense.
- Right now there exists a race condition between ReplayEvent and ReplayRecording -- if a ReplayEvent makes it to clickhouse and is stored before the ReplayRecording, it can result in a bad user experience as a user can navigate to the replay, but the replay's blobs may not be stored yet. We'd like to change it so the snuba writes happen _downstream_ from the recording consumer, as we don't want to make a replay available for search until it's corresponding recording blob has been stored.
- Right now our rate limits are separated per Itemtype. It would be less confusing if the rate limit applied to a single event type.
- It is very hard to do telemetry on the recordings consumer now as we do not have SDK version. combining the envelopes allows us to have metadata in our recordings consumer in an easily accessible way.

# Options Considered

New Proposed Flow:

```mermaid
graph
A([SDK Generated Combined Replay Envelope Item])--> Z[Relay, Combines Envelope Items if needed]
AA([Legacy SDK Generated Separate Items]) --> Z
Z-->|Published to Kafka topic, using random partition key|B[Replays Kafka Cluster]
T-->|ingest-replay-events|J[Snuba Consumer]
J-->|Buffered Bulk Commit|K[(Clickhouse)]
B-->|ingest-replay-recordings|E[Recording Consumer]
E-->Q[Parse Replay for Click Events, etc.]
Q-->R[Emit Outcome if segment = 0]
Q-->S[Store Replay Blob in Storage, GCS in SaaS ]
S-->T[ Emit Kafka Message for Snuba Consumer]
```

### Relay Changes:

1. Create a new ItemType `CombinedReplayRecordingEvent`
2. in `processor.rs` https://github.com/getsentry/relay/blob/606166fca57a84ca1b9240253013871d13827de3/relay-server/src/actors/processor.rs#L1134, Add a new function that combines the proccess ReplayEvent and ReplayRecording ItemTypes into one item.

The Replays Recording Consumer expects MessagePack bytes in the payload field. In our processor, we combine the ReplayEvent and ReplayRecording into a single messagepack payload, and change the content type to ContentType::MessagePack.

The messagepack payload will be a key value object like:

{
replay_event: {...}
replay_recording: _binary_data_ or uncompressed data
replay_recording_headers {}
version: 1
}

In our recordings consumer, we'll parse the message pack which will then give us a dict with these values.

We can also take this opportunity to put the parsed recording headers in the item in relay. This will remove another complication downstream in the consumer.

We will emit these combined messages onto the existing replay recorings kafka topic, with a new field `version` added which the consumer will use to know that the ReplayEvent exists on the message as well.


### Replay Recording Consumer Changes

1. Look at the version of the payload to determine if its the new event, and if so, handle additional work for the ReplayEvent, and load the SDK contextual data into the ReplayRecording for events emitted from it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. Look at the version of the payload to determine if its the new event, and if so, handle additional work for the ReplayEvent, and load the SDK contextual data into the ReplayRecording for events emitted from it.
1. Look at the version of the payload to determine if it's the new event, and if so, handle additional work for the ReplayEvent, and load the SDK contextual data into the ReplayRecording for events emitted from it.


# Drawbacks

This is a decent chunk of engineering work.

# Unresolved questions

- If rate limits are applied before processing, it seems like we'll need to add a new rate limit for this combined item. This should be okay, anything else to think about here?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand. If rate-limits are applied before processing then this merged type doesn't exist yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this would only be the case when people are on new SDKs that send the proposed new combined envelope format

- We will be adding the new rate limit for the combined item.
- Does it ever make sense to do the combining on the frontend? For now we will not do so.