Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Misc] Respect trace headers in grpc server #49

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 8 additions & 3 deletions Dockerfile.ubi
Original file line number Diff line number Diff line change
Expand Up @@ -163,7 +163,7 @@ RUN microdnf install -y \

ARG PYTHON_VERSION
# 0.4.2 is built for CUDA 12.1 and PyTorch 2.3.0
ARG VLLM_WHEEL_VERSION=0.4.3
ARG VLLM_WHEEL_VERSION=0.5.0.post1

RUN curl -Lo vllm.whl https://github.com/vllm-project/vllm/releases/download/v${VLLM_WHEEL_VERSION}/vllm-${VLLM_WHEEL_VERSION}-cp${PYTHON_VERSION//.}-cp${PYTHON_VERSION//.}-manylinux1_x86_64.whl \
&& unzip vllm.whl \
Expand Down Expand Up @@ -277,11 +277,16 @@ ENV VLLM_NCCL_SO_PATH=/usr/local/lib/libnccl.so.2
RUN --mount=type=cache,target=/root/.cache/pip \
pip3 install \
# additional dependencies for the TGIS gRPC server
grpcio-tools==1.63.0 \
grpcio-tools \
# additional dependencies for openai api_server
accelerate==0.30.0 \
# hf_transfer for faster HF hub downloads
hf_transfer==0.1.6
hf_transfer==0.1.6 \
# additional dependencies for OpenTelemetry tracing
opentelemetry-sdk \
opentelemetry-api \
opentelemetry-exporter-otlp \
opentelemetry-semantic-conventions-ai

# Triton needs a CC compiler
RUN microdnf install -y gcc \
Expand Down
19 changes: 17 additions & 2 deletions examples/production_monitoring/Otel.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,21 +32,36 @@
export JAEGER_IP=$(docker inspect --format '{{ .NetworkSettings.IPAddress }}' jaeger)
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=grpc://$JAEGER_IP:4317
```
Then set vLLM's service name for OpenTelemetry, enable insecure connections to Jaeger and run vLLM:
Then set vLLM's service name for OpenTelemetry, enable insecure connections to Jaeger and run vLLM with the OpenAI endpoint:
```
export OTEL_SERVICE_NAME="vllm-server"
export OTEL_EXPORTER_OTLP_TRACES_INSECURE=true
python -m vllm.entrypoints.openai.api_server --model="facebook/opt-125m" --otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT"
```
or run vLLM with the grpc endpoint:
```
export OTEL_SERVICE_NAME="vllm-server"
export OTEL_EXPORTER_OTLP_TRACES_INSECURE=true
python -m vllm.entrypoints.openai.api_server --model="facebook/opt-125m" --otlp-traces-endpoint="$OTEL_EXPORTER_OTLP_TRACES_ENDPOINT" --grpc-port 50051
```

1. In a new shell, send requests with trace context from a dummy client
1. In a new shell, send requests with trace context from a dummy http client
```
export JAEGER_IP=$(docker inspect --format '{{ .NetworkSettings.IPAddress }}' jaeger)
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=grpc://$JAEGER_IP:4317
export OTEL_EXPORTER_OTLP_TRACES_INSECURE=true
export OTEL_SERVICE_NAME="client-service"
python dummy_client.py
```
or a dummy grpc client:
```
export JAEGER_IP=$(docker inspect --format '{{ .NetworkSettings.IPAddress }}' jaeger)
export OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=grpc://$JAEGER_IP:4317
export OTEL_EXPORTER_OTLP_TRACES_INSECURE=true
export OTEL_SERVICE_NAME="client-service"
python dummy_client_grpc.py
```


1. Open Jaeger webui: http://localhost:16686/

Expand Down
41 changes: 41 additions & 0 deletions examples/production_monitoring/dummy_client_grpc.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
import grpc
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import (
OTLPSpanExporter)
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import (BatchSpanProcessor,
ConsoleSpanExporter)
from opentelemetry.trace import SpanKind, set_tracer_provider
from opentelemetry.trace.propagation.tracecontext import (
TraceContextTextMapPropagator)

from vllm.entrypoints.grpc.pb import generation_pb2, generation_pb2_grpc

trace_provider = TracerProvider()
set_tracer_provider(trace_provider)

trace_provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter()))
trace_provider.add_span_processor(BatchSpanProcessor(ConsoleSpanExporter()))
tracer = trace_provider.get_tracer("dummy-client")

with grpc.insecure_channel("localhost:50051") as channel:
stub = generation_pb2_grpc.GenerationServiceStub(channel)

with tracer.start_as_current_span("client-span",
kind=SpanKind.CLIENT) as span:
prompt = "San Francisco is a"
span.set_attribute("prompt", prompt)

# Inject the current context into the gRPC metadata
headers = {}
TraceContextTextMapPropagator().inject(headers)
metadata = list(headers.items())

reqs = [generation_pb2.GenerationRequest(text=prompt, )]

req = generation_pb2.BatchedGenerationRequest(
model_id="facebook/opt-125m",
requests=reqs,
params=generation_pb2.Parameters(
sampling=generation_pb2.SamplingParameters(temperature=0.0),
stopping=generation_pb2.StoppingCriteria(max_new_tokens=10)))
response = stub.Generate(req, metadata=metadata)
10 changes: 10 additions & 0 deletions vllm/entrypoints/grpc/grpc_server.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,8 @@
TypicalLogitsWarperWrapper)
from vllm.tgis_utils.metrics import (FailureReasonLabel, ServiceMetrics,
TGISStatLogger)
from vllm.tracing import (contains_trace_headers, extract_trace_headers,
log_tracing_disabled_warning)
from vllm.transformers_utils.tokenizer_group import BaseTokenizerGroup

logger = init_logger(__name__)
Expand Down Expand Up @@ -168,12 +170,20 @@ async def Generate(self, request: BatchedGenerationRequest,
prompt=req.text,
prompt_token_ids=input_ids
)
is_tracing_enabled = await self.engine.is_tracing_enabled()
headers = dict(context.invocation_metadata())
trace_headers = None
if is_tracing_enabled:
trace_headers = extract_trace_headers(headers)
if not is_tracing_enabled and contains_trace_headers(headers):
log_tracing_disabled_warning()
generators.append(
# prompt is supplied for observability, the text is not
# re-tokenized when `prompt_token_ids` is supplied
self.engine.generate(inputs=inputs,
sampling_params=sampling_params,
request_id=f"{request_id}-{i}",
trace_headers=trace_headers,
**adapter_kwargs),
)

Expand Down
Loading