Skip to content

Core Features

Arthur Gonigberg edited this page Apr 27, 2018 · 17 revisions

Service Discovery

Zuul is built to work seamlessly with Eureka but can also be configured to work with static server lists or a discovery service of your choice.

The standard approach with a Eureka server would look like this:

### Load balancing backends with Eureka

eureka.shouldUseDns=true
eureka.eurekaServer.context=discovery/v2
eureka.eurekaServer.domainName=discovery${environment}.netflix.net
eureka.eurekaServer.gzipContent=true

eureka.serviceUrl.default=http://${region}.${eureka.eurekaServer.domainName}:7001/${eureka.eurekaServer.context}

api.ribbon.NIWSServerListClassName=com.netflix.niws.loadbalancer.DiscoveryEnabledNIWSServerList
api.ribbon.DeploymentContextBasedVipAddresses=api-test.netflix.net:7001

In this configuration you have to specify your Eureka context and location. Given that, Zuul will automatically select the server list from Eureka with the given VIP for the api Ribbon client. You can find more info about Ribbon configuration here.

To configure Zuul with a static server list or a different discovery provider, you'll have to keep the the listOfServers property up to date:

### Load balancing backends without Eureka

eureka.shouldFetchRegistry=false

api.ribbon.listOfServers=100.66.23.88:7001,100.65.155.22:7001
api.ribbon.client.NIWSServerListClassName=com.netflix.loadbalancer.ConfigurationBasedServerList
api.ribbon.DeploymentContextBasedVipAddresses=api-test.netflix.net:7001

Notice in this configuration the server list class name is ConfigurationBasedServerList instead of DiscoveryEnabledNIWSServerList.

Load Balancing

By default Zuul load balances using the ZoneAwareLoadBalancer from Ribbon. The algorithm is a round robin of the instances available in discovery, with availability zone success tracking for resiliency. The load balancer will keep stats for each zone and will drop a zone if the failure rates are above a configurable threshold.

If you want to use your own custom load balancer you can set the NFLoadBalancerClassName property for that Ribbon client namespace or override the getLoadBalancerClass() method in the DefaultClientChannelManager. Note that your class should extend DynamicServerListLoadBalancer.

Ribbon also allows you to configure the load balancing rule. For example, you can swap the RoundRobinRule for the WeightedResponseTimeRule, AvailabilityFilteringRule, or your own rule.

You can find more details here.

Connection Pooling

Zuul does not use Ribbon for making outgoing connections and instead uses its own connection pool, using a Netty client. Zuul creates a connection pool per host, per event loop. It does this in order to reduce context switching between threads and to ensure sanity for both the inbound event loops and outbound event loops. The result is that the entire request is run on the same thread, regardless of which event loop is running it.

One of the side-effects of this strategy is that the minumum amount of connections made to each back-end server can be quite high if you have a lot of Zuul instances running, with a lot of event loops in each. This is important to keep in mind when configuring the connection pool.

Some useful settings, and their default values, for tweaking the connection pool are:

Ribbon Client Config Properties

<originName>.ribbon.ConnectionTimeout                // default: 500 (ms)
<originName>.ribbon.MaxConnectionsPerHost            // default: 50
<originName>.ribbon.ConnIdleEvictTimeMilliSeconds    // default: 60000 (ms)
<originName>.ribbon.ReceiveBufferSize                // default: 32 * 1024
<originName>.ribbon.SendBufferSize                   // default: 32 * 1024
<originName>.ribbon.UseIPAddrForServer               // default: true

Zuul Properties

# Max amount of requests any given connection will have before forcing a close
<originName>.netty.client.maxRequestsPerConnection    // default: 1000

# Max amount of connection per server, per event loop
<originName>.netty.client.perServerWaterline          // default: 4

# Netty configuration connection
<originName>.netty.client.TcpKeepAlive                // default: false
<originName>.netty.client.TcpNoDelay                  // default: false
<originName>.netty.client.WriteBufferHighWaterMark    // default: 32 * 1024
<originName>.netty.client.WriteBufferLowWaterMark     // default: 8 * 1024
<originName>.netty.client.AutoRead                    // default: false

The connection pool also outputs a lot of metrics, so take a look at the Spectator registry if you want to collect them.

Status Categories

Although HTTP statuses are universal they don't provide a lot of granularity. In order to have more specific failure modes, we've created an enumeration of possible failures.

StatusCategory Definition
SUCCESS Successful request
SUCCESS_NOT_FOUND Succesfully proxied but status was 404
SUCCESS_LOCAL_NOTSET Successful request but no StatusCategory was set
SUCCESS_LOCAL_NO_ROUTE Technically successful, but no routing found for the request
FAILURE_LOCAL Local Zuul failure (e.g. exception thrown)
FAILURE_LOCAL_THROTTLED_ORIGIN_SERVER_MAXCONN Request throttled due to max connection limit reached to origin server
FAILURE_LOCAL_THROTTLED_ORIGIN_CONCURRENCY Request throttled due to origin concurrency limit
FAILURE_LOCAL_IDLE_TIMEOUT Request failed due to idle connection timeout
FAILURE_CLIENT_CANCELLED Request failed because client cancelled
FAILURE_CLIENT_PIPELINE_REJECT Request failed because client attempted to send pipelined HTTP request
FAILURE_CLIENT_TIMEOUT Request failed due to read timeout from the client (e.g. truncated POST body)
FAILURE_ORIGIN The origin returned a failure (i.e. 500 status)
FAILURE_ORIGIN_READ_TIMEOUT The request to the origin timed out
FAILURE_ORIGIN_CONNECTIVITY Could not connect to origin
FAILURE_ORIGIN_THROTTLED Origin throttled the request (i.e. 503 status)
FAILURE_ORIGIN_NO_SERVERS Could not find any servers to connect to for the origin
FAILURE_ORIGIN_RESET_CONNECTION Origin reset the connection before the request could complete

You can get or set the status using the StatusCategoryUtils class. For example:

// set
StatusCategoryUtils.setStatusCategory(request.getContext(), ZuulStatusCategory.SUCCESS)
// get
StatusCategoryUtils.getStatusCategory(response)

Retries

One of the key features Netflix uses for resiliency is retries. In Zuul, we take retries seriously and make extensive use of them. We use the following logic to determine when to retry a request:

Retry on errors

  • If error is read timeout, reset connection or connect error

Retry on status codes

  • If the status code is 503
  • If the status code is a configurable idempotent status (see below) and method is one of: GET, HEAD or OPTIONS

We don't retry if we are in a transient state, more specifically:

  • If we have started to send the response back to the client
  • If we have lost any body chunks (partially buffered or truncated bodies)

Associated properties:

# Sets a retry limit for both error and status code retries
<originName>.ribbon.MaxAutoRetriesNextServer            // default: 0

# This is a comma-delimited list of status codes
zuul.retry.allowed.statuses.idempotent                  // default: 500

Request Passport

One of our best tools for debugging is the request passport. It is a time-ordered set of all of the states that a request transitioned through, with the associated timestamps in nanoseconds.

Example of successful request

This is a simple request that runs some filters, does some IO, proxies the request, runs filters on the response and then writes it out to the client.

CurrentPassport {start_ms=1523578203359,
[+0=IN_REQ_HEADERS_RECEIVED,
+260335=FILTERS_INBOUND_START,
+310862=IN_REQ_LAST_CONTENT_RECEIVED,
+1053435=MISC_IO_START,
+2202112=MISC_IO_STOP,
+3917598=FILTERS_INBOUND_END,
+4157288=ORIGIN_CH_CONNECTING,
+4218319=ORIGIN_CONN_ACQUIRE_START,
+4443588=ORIGIN_CH_CONNECTED,
+4510115=ORIGIN_CONN_ACQUIRE_END,
+4765495=OUT_REQ_HEADERS_SENDING,
+4799545=OUT_REQ_LAST_CONTENT_SENDING,
+4820669=OUT_REQ_HEADERS_SENT,
+4822465=OUT_REQ_LAST_CONTENT_SENT,
+4830443=ORIGIN_CH_ACTIVE,
+20811792=IN_RESP_HEADERS_RECEIVED,
+20961148=FILTERS_OUTBOUND_START,
+21080107=IN_RESP_LAST_CONTENT_RECEIVED,
+21109342=ORIGIN_CH_POOL_RETURNED,
+21539032=FILTERS_OUTBOUND_END,
+21558317=OUT_RESP_HEADERS_SENDING,
+21575084=OUT_RESP_LAST_CONTENT_SENDING,
+21594236=OUT_RESP_HEADERS_SENT,
+21595122=OUT_RESP_LAST_CONTENT_SENT,
+21659271=NOW]}

Example of a timeout

This is an example of a timeout. It's similar to the previous example but not the time gap between the outbound request and timeout event.

CurrentPassport {start_ms=1523578490446,
[+0=IN_REQ_HEADERS_RECEIVED,
+139712=FILTERS_INBOUND_START,
+1364667=MISC_IO_START,
+2235393=MISC_IO_STOP,
+3686560=FILTERS_INBOUND_END,
+3823010=ORIGIN_CH_CONNECTING,
+3891023=ORIGIN_CONN_ACQUIRE_START,
+4242502=ORIGIN_CH_CONNECTED,
+4311756=ORIGIN_CONN_ACQUIRE_END,
+4401724=OUT_REQ_HEADERS_SENDING,
+4453035=OUT_REQ_HEADERS_SENT,
+4461546=ORIGIN_CH_ACTIVE,
+45004599181=ORIGIN_CH_READ_TIMEOUT,
+45004813647=FILTERS_OUTBOUND_START,
+45004920343=ORIGIN_CH_CLOSE,
+45004945985=ORIGIN_CH_CLOSE,
+45005052026=ORIGIN_CH_INACTIVE,
+45005246081=FILTERS_OUTBOUND_END,
+45005359480=OUT_RESP_HEADERS_SENDING,
+45005379978=OUT_RESP_LAST_CONTENT_SENDING,
+45005399999=OUT_RESP_HEADERS_SENT,
+45005401335=OUT_RESP_LAST_CONTENT_SENT,
+45005486729=NOW]}

Example of a failed request

This is an example of a request that caused an exception. Again, it's similar to the previous ones, but note the retries and exception events.

CurrentPassport {start_ms=1523578533258,
[+0=IN_REQ_HEADERS_RECEIVED,
+161428=FILTERS_INBOUND_START,
+208805=IN_REQ_LAST_CONTENT_RECEIVED,
+934637=MISC_IO_START,
+1751747=MISC_IO_STOP,
+2606657=FILTERS_INBOUND_END,
+2734497=ORIGIN_CH_CONNECTING,
+2780877=ORIGIN_CONN_ACQUIRE_START,
+3181771=ORIGIN_CH_CONNECTED,
+3272876=ORIGIN_CONN_ACQUIRE_END,
+3376958=OUT_REQ_HEADERS_SENDING,
+3405924=OUT_REQ_LAST_CONTENT_SENDING,
+3557967=ORIGIN_RETRY_START,
+3590208=ORIGIN_CH_CONNECTING,
+3633635=ORIGIN_CONN_ACQUIRE_START,
+3663060=ORIGIN_CH_CLOSE,
+3664703=OUT_REQ_HEADERS_ERROR_SENDING,
+3674443=OUT_REQ_LAST_CONTENT_ERROR_SENDING,
+3681289=ORIGIN_CH_ACTIVE,
+3706176=ORIGIN_CH_INACTIVE,
+4022445=ORIGIN_CH_CONNECTED,
+4072050=ORIGIN_CONN_ACQUIRE_END,
+4144471=OUT_REQ_HEADERS_SENDING,
+4171228=OUT_REQ_LAST_CONTENT_SENDING,
+4186672=OUT_REQ_HEADERS_SENT,
+4187543=OUT_REQ_LAST_CONTENT_SENT,
+4192830=ORIGIN_CH_ACTIVE,
+4273401=ORIGIN_CH_EXCEPTION,
+4274124=ORIGIN_CH_EXCEPTION,
+4303020=ORIGIN_CH_IO_EX,
+4537569=FILTERS_OUTBOUND_START,
+4646348=ORIGIN_CH_CLOSE,
+4748074=ORIGIN_CH_INACTIVE,
+4957163=FILTERS_OUTBOUND_END,
+4968947=OUT_RESP_HEADERS_SENDING,
+4985532=OUT_RESP_LAST_CONTENT_SENDING,
+5003476=OUT_RESP_HEADERS_SENT,
+5004610=OUT_RESP_LAST_CONTENT_SENT,
+5062221=NOW]}

You can log the passport, add it to a header or ship it off to a persistent store for later debugging. To get it out of the request you can either use the channel or the session context. For example:

// from channel
CurrentPassport passport = CurrentPassport.fromChannel(channel);
// from context
CurrentPassport passport = CurrentPassport.fromSessionContext(context);

Request Attempts

Another very useful debugging feature is tracking the request attempts that Zuul makes. We typically add this as an internal-only header on every response, and it makes tracing and debugging requests much simpler for us and our internal partners.

Example of successful request

[{"status":200,"duration":192,"attempt":1,"region":"us-east-1","asg":"simulator-v154","instanceId":"i-061db2c67b2b3820c","vip":"simulator.netflix.net:7001"}]

Example of failed request

[{"status":503,"duration":142,"attempt":1,"error":"ORIGIN_SERVICE_UNAVAILABLE","exceptionType":"OutboundException","region":"us-east-1","asg":"simulator-v154","instanceId":"i-061db2c67b2b3820c","vip":"simulator.netflix.net:7001"},
{"status":503,"duration":147,"attempt":2,"error":"ORIGIN_SERVICE_UNAVAILABLE","exceptionType":"OutboundException","region":"us-east-1","asg":"simulator-v154","instanceId":"i-061db2c67b2b3820c","vip":"simulator.netflix.net:7001"}]

You can get the request attempts from the session context on an outbound filter. For example:

// from context
RequestAttempts attempts = RequestAttempts.getFromSessionContext(context);

HTTP/2

Zuul can run in HTTP/2 mode as demonstrated in the sample app. In this mode, it requires an SSL cert and if you're going to run Zuul behind an ELB you'll have to use the TCP listener.

Relevant HTTP/2 properties:

server.http2.max.concurrent.streams            // default: 100
server.http2.initialwindowsize                 // default: 5242880
server.http2.maxheadertablesize                // default: 65536
server.http2.maxheaderlistsize                 // default: 32768

Mutual TLS

Zuul can run in Mutual TLS mode as demonstrated in the sample app. In this mode, you'll have to have both an SSL cert and trust store for incoming certs. As with HTTP/2, you'll have to run this behind the TCP listener of the ELB.

Proxy Protocol

Proxy protocol is an important feature when using a TCP listener and can be enabled in Zuul using the follow server config properties:

// strip XFF headers since we can no longer trust them
channelConfig.set(CommonChannelConfigKeys.allowProxyHeadersWhen, StripUntrustedProxyHeadersHandler.AllowWhen.NEVER);

// prefer proxy protocol when available
channelConfig.set(CommonChannelConfigKeys.preferProxyProtocolForClientIp, true);

// enable proxy protocol
channelConfig.set(CommonChannelConfigKeys.withProxyProtocol, true);

The client IP will be set correctly on the HttpRequestMessage in the filters, and can also be retrieved directly from the channel:

String clientIp = channel.attr(SourceAddressChannelHandler.ATTR_SOURCE_ADDRESS).get();

GZip

Zuul comes with an outbound GZipResponseFilter that will gzip outgoing responses.

It makes the decision based on content type, body size and whether the request Accept-Encoding header contains gzip.