open-telemetry · carlosalberto · Jan 27, 2022 · Sep 8, 2021 · Sep 10, 2021 · Sep 10, 2021
diff --git a/text/trace/0174-http-semantic-conventions.md b/text/trace/0174-http-semantic-conventions.md
@@ -0,0 +1,174 @@
+# Scenarios and Open Questions for Tracing semantic conventions for HTTP
+
+This document aims to capture scenarios/open questions and a road map, both of
+which will serve as a basis for [stabilizing](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/versioning-and-stability.md#stable)
+the [existing semantic conventions for HTTP](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/http.md),
+which are currently in an [experimental](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/versioning-and-stability.md#experimental)
+state. The goal is to declare HTTP semantic conventions stable before the
+end of Q1 2022.
+
+## Motivation
+
+Most observability scenarios involve HTTP communication. For Distributed Tracing
+to be useful across the entire scenario, having good observability for
+HTTP is critical. To achieve this, OpenTelemetry must provide stable conventions
+and guidelines for instrumenting HTTP communication.
+
+Bringing the existing experimental semantic conventions for HTTP to a
+stable state is a crucial step for users and instrumentation authors, as it
+allows them to rely on [stability guarantees](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/versioning-and-stability.md#not-defined-semantic-conventions-stability),
+and thus to ship and use stable instrumentation.
+
+> NOTE. This OTEP captures a scope for changes should be done to existing
+experimental semantic conventions for HTTP, but does not propose solutions.
+
+## Roadmap for v1.0
+
+1. This OTEP, consisting of scenarios/open questions and a proposed roadmap, is
+   approved and merged.
+2. [Stability guarantees](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/versioning-and-stability.md#not-defined-semantic-conventions-stability)
+   for semantic conventions are approved and merged. This is not strictly related
+   to semantic conventions for HTTP but is a prerequisite for stabilizing any
+   semantic conventions.
+3. Separate PRs addressing the scenarios and open questions listed in this
+   document are approved and merged.
+4. Proposed specification changes are verified by prototypes for the scenarios
+   and examples below.
+5. The [specification for HTTP semantic conventions for tracing](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/http.md)
+   are updated according to this OTEP and are declared
+   [stable](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/versioning-and-stability.md#stable).
+
+The steps in the roadmap don't necessarily need to happen in the given order,
+some steps can be worked on in parallel.
+
+## Scope for v1.0: scenarios and open questions
+
+> NOTE. The scope defined here is subject for discussions and can be adjusted
+  until this OTEP is merged.
+
+Scenarios and open questions mentioned below must be addressed via separate PRs.
+
+### Error status defaults
+
+4xx responses are no longer create error status codes in case of
+`SpanKind.SERVER`. It seems reasonable to define the same/similar behavior
+for `SpanKind.CLIENT`.
+
+### Required attribute sets
+
+> At least one of the following sets of attributes is required:
+>
+> * `http.url`
+> * `http.scheme`, `http.host`, `http.target`
+> * `http.scheme`, [`net.peer.name`](span-general.md), [`net.peer.port`](span-general.md), `http.target`
+> * `http.scheme`, [`net.peer.ip`](span-general.md), [`net.peer.port`](span-general.md), `http.target`
+
+As a result, users that write queries against raw data or Zipkin/Jaeger don't
+have consistent story across instrumentations and languages. e.g. they'd need to
+write queries like
+`select * where (getPath(http.url) == "/a/b" || getPath(http.target) == "/a/b")`
+
+### Retries and redirects
+
+Each try/redirect/hedging request must have unique context to be traceable and
+to unambiguously ask for support from downstream service, which implies span
+per call.
+
+Redirects: users may need observability into what server hop had an error/took
+too long. E.g., was 500/timeout from the final destination or a proxy?
+
+### Context propagation needs explanation
+
+* Reusing instances of client HTTP requests between tries (it’s likely, so clean
+  up context before making a call).
+
+### Security concerns
+
+Some attributes can contain potentially sensitive information. Most likely, by
+default web frameworks/http clients should not expose that. For v1.0 these
+attributes can be explicitly called out.
+
+For example, `http.target` has a query string that may contain credentials.
+
+## Scope for vNext: scenarios and open questions
+
+### Error status configuration
+
+In many cases 4xx error criteria depends on the app (e.g., for 404/409). As an
+end user, I might want to have an ability to override existing defaults and
+define what HTTP status codes count as errors.
+
+### Optional attributes
+
+As a library owner, I don't understand the benefits of optional attributes:
+they create overhead, they don't seem to be generically useful (e.g. flavor),
+and are inconsistent across languages/libraries unless unified.
+
+### Sampling for noop case
+
+To make it efficient for noop case, need a hint for instrumentation
+(e.g., `GlobalOTel.isEnabled()`) that SDK is present and configured before
+creating pre-sampling attributes.
+
+### Long-polling and streaming
+
+Are there any specifics for these scenarios, e.g. from span duration or status
+code perspective? How to model multiple requests within the same logical
+session?
+
+### HTTP/2, gRPC, WebSockets
+
+Anything we can do better here? In many cases connection has app-lifetime,
+messages are independent - can we explain to users how to do manual tracing
+for individual messages? Do span events per message make sense at all?
+Need some real-life/expertize here.
+
+### Request/Response body capturing
+
+> NOTE. This is technically out-of-scope, but we should have an idea how to let
+  users do it
+
+There is a lot of user feedback that they want it, but
+
+* We can’t read body in generic instrumentation
+* We can let users collect them
+* Attaching to server span is trivial
+* Spec for client: we should have an approach to let users unambiguously
+  associate body with http client span (e.g. outer manual span that wraps HTTP
+  call and response reading and has event/log with body)
+* Reading/writing body may happen outside of HTTP client API (e.g. through
+  network streams) – how users can track it too?
+
+## Out of scope
+
+HTTP protocol is being widely used within many different platforms and systems,
+which brings a lot of intersections with a transmission protocol layer and an
+application layer. However, for HTTP Semantic Conventions specification we want
+to be strictly focused on HTTP-specific aspects of distributed tracing to keep
+the specification clear. Therefore, the following scenarios, including but not
+limited to, are considered out of scope for this workgroup:
+
+* Batch operations.
+* Fan-in and fan-out operations (e.g., GraphQL).
+* Hedging policies. Hedging enables aggressively sending multiple copies of a
+  single request without waiting for a response. Hedged RPCs may be be executed
+  multiple times on the server side, typically by different backends.
+* HTTP as a transport layer for other systems (e.g., Messaging system built on
+  top of HTTP).
+
+To address these scenarios, we might want to work with OpenTelemetry community
+to build instrumentation guidelines going forward.
+
+## General OpenTelemetry open questions
+
+There are several general OpenTelemetry open questions exist today which most
+likely will affect the way scenarios and open questions above are addressed:
+
+* What does a config language look like for overriding certain defaults.
+  For example, what HTTP status codes count as errors?
+* How to handle additional levels of detail for spans, such as retries and
+  redirects?
+  Should it even be designed as levels of detail or as layers reflecting logical
+  or physical interactions/transactions.
+* What is the data model for links? What would a reasonable storage
+  implementation look like?