-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Core Features
Zuul is built to work seamlessly with Eureka but can also be configured to work with static server lists or a discovery service of your choice.
The standard approach with a Eureka server would look like this:
### Load balancing backends with Eureka
eureka.shouldUseDns=true
eureka.eurekaServer.context=discovery/v2
eureka.eurekaServer.domainName=discovery${environment}.netflix.net
eureka.eurekaServer.gzipContent=true
eureka.serviceUrl.default=http://${region}.${eureka.eurekaServer.domainName}:7001/${eureka.eurekaServer.context}
api.ribbon.NIWSServerListClassName=com.netflix.niws.loadbalancer.DiscoveryEnabledNIWSServerList
api.ribbon.DeploymentContextBasedVipAddresses=api-test.netflix.net:7001
In this configuration you have to specify your Eureka context and location. Given that, Zuul will automatically select the server list from Eureka with the given VIP for the api
Ribbon client. You can find more info about Ribbon configuration here.
To configure Zuul with a static server list or a different discovery provider, you'll have to keep the the listOfServers
property up to date:
### Load balancing backends without Eureka
eureka.shouldFetchRegistry=false
api.ribbon.listOfServers=100.66.23.88:7001,100.65.155.22:7001
api.ribbon.client.NIWSServerListClassName=com.netflix.loadbalancer.ConfigurationBasedServerList
api.ribbon.DeploymentContextBasedVipAddresses=api-test.netflix.net:7001
Notice in this configuration the server list class name is ConfigurationBasedServerList
instead of DiscoveryEnabledNIWSServerList
.
By default Zuul load balances using the ZoneAwareLoadBalancer from Ribbon. The algorithm is a round robin of the instances available in discovery, with availability zone success tracking for resiliency. The load balancer will keep stats for each zone and will drop a zone if the failure rates are above a configurable threshold.
If you want to use your own custom load balancer you can set the NFLoadBalancerClassName
property for that Ribbon client namespace or override the getLoadBalancerClass()
method in the DefaultClientChannelManager. Note that your class should extend DynamicServerListLoadBalancer.
Ribbon also allows you to configure the load balancing rule. For example, you can swap the RoundRobinRule
for the WeightedResponseTimeRule
, AvailabilityFilteringRule
, or your own rule.
You can find more details here.
Zuul does not use Ribbon for making outgoing connections and instead uses its own connection pool, using a Netty client. Zuul creates a connection pool per host, per event loop. It does this in order to reduce context switching between threads and to ensure sanity for both the inbound event loops and outbound event loops. The result is that the entire request is run on the same thread, regardless of which event loop is running it.
One of the side-effects of this strategy is that the minumum amount of connections made to each back-end server can be quite high if you have a lot of Zuul instances running, with a lot of event loops in each. This is important to keep in mind when configuring the connection pool.
Some useful settings, and their default values, for tweaking the connection pool are:
<originName>.ribbon.ConnectionTimeout // default: 500 (ms)
<originName>.ribbon.MaxConnectionsPerHost // default: 50
<originName>.ribbon.ConnIdleEvictTimeMilliSeconds // default: 60000 (ms)
<originName>.ribbon.ReceiveBufferSize // default: 32 * 1024
<originName>.ribbon.SendBufferSize // default: 32 * 1024
<originName>.ribbon.UseIPAddrForServer // default: true
# Max amount of requests any given connection will have before forcing a close
<originName>.netty.client.maxRequestsPerConnection // default: 1000
# Max amount of connection per server, per event loop
<originName>.netty.client.perServerWaterline // default: 4
# Netty configuration connection
<originName>.netty.client.TcpKeepAlive // default: false
<originName>.netty.client.TcpNoDelay // default: false
<originName>.netty.client.WriteBufferHighWaterMark // default: 32 * 1024
<originName>.netty.client.WriteBufferLowWaterMark // default: 8 * 1024
<originName>.netty.client.AutoRead // default: false
The connection pool also outputs a lot of metrics, so take a look at the Spectator registry if you want to collect them.
Although HTTP statuses are universal they don't provide a lot of granularity. In order to have more specific failure modes, we've created an enumeration of possible failures.
StatusCategory | Definition |
---|---|
SUCCESS | Successful request |
SUCCESS_NOT_FOUND | Succesfully proxied but status was 404 |
SUCCESS_LOCAL_NOTSET | Successful request but no StatusCategory was set |
SUCCESS_LOCAL_NO_ROUTE | Technically successful, but no routing found for the request |
FAILURE_LOCAL | Local Zuul failure (e.g. exception thrown) |
FAILURE_LOCAL_THROTTLED_ORIGIN_SERVER_MAXCONN | Request throttled due to max connection limit reached to origin server |
FAILURE_LOCAL_THROTTLED_ORIGIN_CONCURRENCY | Request throttled due to origin concurrency limit |
FAILURE_LOCAL_IDLE_TIMEOUT | Request failed due to idle connection timeout |
FAILURE_CLIENT_CANCELLED | Request failed because client cancelled |
FAILURE_CLIENT_PIPELINE_REJECT | Request failed because client attempted to send pipelined HTTP request |
FAILURE_CLIENT_TIMEOUT | Request failed due to read timeout from the client (e.g. truncated POST body) |
FAILURE_ORIGIN | The origin returned a failure (i.e. 500 status) |
FAILURE_ORIGIN_READ_TIMEOUT | The request to the origin timed out |
FAILURE_ORIGIN_CONNECTIVITY | Could not connect to origin |
FAILURE_ORIGIN_THROTTLED | Origin throttled the request (i.e. 503 status) |
FAILURE_ORIGIN_NO_SERVERS | Could not find any servers to connect to for the origin |
FAILURE_ORIGIN_RESET_CONNECTION | Origin reset the connection before the request could complete |
You can get or set the status using the StatusCategoryUtils
class. For example:
// set
StatusCategoryUtils.setStatusCategory(request.getContext(), ZuulStatusCategory.SUCCESS)
// get
StatusCategoryUtils.getStatusCategory(response)
One of the key features Netflix uses for resiliency is retries. In Zuul, we take retries seriously and make extensive use of them. We use the following logic to determine when to retry a request:
- If error is read timeout, reset connection or connect error
- If the status code is 503
- If the status code is a configurable idempotent status (see below) and method is one of:
GET
,HEAD
orOPTIONS
We don't retry if we are in a transient state, more specifically:
- If we have started to send the response back to the client
- If we have lost any body chunks (partially buffered or truncated bodies)
Associated properties:
# Sets a retry limit for both error and status code retries
<originName>.ribbon.MaxAutoRetriesNextServer // default: 0
# This is a comma-delimited list of status codes
zuul.retry.allowed.statuses.idempotent // default: 500
One of our best tools for debugging is the request passport. It is a time-ordered set of all of the states that a request transitioned through, with the associated timestamps in nanoseconds.
This is a simple request that runs some filters, does some IO, proxies the request, runs filters on the response and then writes it out to the client.
CurrentPassport {start_ms=1523578203359,
[+0=IN_REQ_HEADERS_RECEIVED,
+260335=FILTERS_INBOUND_START,
+310862=IN_REQ_LAST_CONTENT_RECEIVED,
+1053435=MISC_IO_START,
+2202112=MISC_IO_STOP,
+3917598=FILTERS_INBOUND_END,
+4157288=ORIGIN_CH_CONNECTING,
+4218319=ORIGIN_CONN_ACQUIRE_START,
+4443588=ORIGIN_CH_CONNECTED,
+4510115=ORIGIN_CONN_ACQUIRE_END,
+4765495=OUT_REQ_HEADERS_SENDING,
+4799545=OUT_REQ_LAST_CONTENT_SENDING,
+4820669=OUT_REQ_HEADERS_SENT,
+4822465=OUT_REQ_LAST_CONTENT_SENT,
+4830443=ORIGIN_CH_ACTIVE,
+20811792=IN_RESP_HEADERS_RECEIVED,
+20961148=FILTERS_OUTBOUND_START,
+21080107=IN_RESP_LAST_CONTENT_RECEIVED,
+21109342=ORIGIN_CH_POOL_RETURNED,
+21539032=FILTERS_OUTBOUND_END,
+21558317=OUT_RESP_HEADERS_SENDING,
+21575084=OUT_RESP_LAST_CONTENT_SENDING,
+21594236=OUT_RESP_HEADERS_SENT,
+21595122=OUT_RESP_LAST_CONTENT_SENT,
+21659271=NOW]}
This is an example of a timeout. It's similar to the previous example but not the time gap between the outbound request and timeout event.
CurrentPassport {start_ms=1523578490446,
[+0=IN_REQ_HEADERS_RECEIVED,
+139712=FILTERS_INBOUND_START,
+1364667=MISC_IO_START,
+2235393=MISC_IO_STOP,
+3686560=FILTERS_INBOUND_END,
+3823010=ORIGIN_CH_CONNECTING,
+3891023=ORIGIN_CONN_ACQUIRE_START,
+4242502=ORIGIN_CH_CONNECTED,
+4311756=ORIGIN_CONN_ACQUIRE_END,
+4401724=OUT_REQ_HEADERS_SENDING,
+4453035=OUT_REQ_HEADERS_SENT,
+4461546=ORIGIN_CH_ACTIVE,
+45004599181=ORIGIN_CH_READ_TIMEOUT,
+45004813647=FILTERS_OUTBOUND_START,
+45004920343=ORIGIN_CH_CLOSE,
+45004945985=ORIGIN_CH_CLOSE,
+45005052026=ORIGIN_CH_INACTIVE,
+45005246081=FILTERS_OUTBOUND_END,
+45005359480=OUT_RESP_HEADERS_SENDING,
+45005379978=OUT_RESP_LAST_CONTENT_SENDING,
+45005399999=OUT_RESP_HEADERS_SENT,
+45005401335=OUT_RESP_LAST_CONTENT_SENT,
+45005486729=NOW]}
This is an example of a request that caused an exception. Again, it's similar to the previous ones, but note the retries and exception events.
CurrentPassport {start_ms=1523578533258,
[+0=IN_REQ_HEADERS_RECEIVED,
+161428=FILTERS_INBOUND_START,
+208805=IN_REQ_LAST_CONTENT_RECEIVED,
+934637=MISC_IO_START,
+1751747=MISC_IO_STOP,
+2606657=FILTERS_INBOUND_END,
+2734497=ORIGIN_CH_CONNECTING,
+2780877=ORIGIN_CONN_ACQUIRE_START,
+3181771=ORIGIN_CH_CONNECTED,
+3272876=ORIGIN_CONN_ACQUIRE_END,
+3376958=OUT_REQ_HEADERS_SENDING,
+3405924=OUT_REQ_LAST_CONTENT_SENDING,
+3557967=ORIGIN_RETRY_START,
+3590208=ORIGIN_CH_CONNECTING,
+3633635=ORIGIN_CONN_ACQUIRE_START,
+3663060=ORIGIN_CH_CLOSE,
+3664703=OUT_REQ_HEADERS_ERROR_SENDING,
+3674443=OUT_REQ_LAST_CONTENT_ERROR_SENDING,
+3681289=ORIGIN_CH_ACTIVE,
+3706176=ORIGIN_CH_INACTIVE,
+4022445=ORIGIN_CH_CONNECTED,
+4072050=ORIGIN_CONN_ACQUIRE_END,
+4144471=OUT_REQ_HEADERS_SENDING,
+4171228=OUT_REQ_LAST_CONTENT_SENDING,
+4186672=OUT_REQ_HEADERS_SENT,
+4187543=OUT_REQ_LAST_CONTENT_SENT,
+4192830=ORIGIN_CH_ACTIVE,
+4273401=ORIGIN_CH_EXCEPTION,
+4274124=ORIGIN_CH_EXCEPTION,
+4303020=ORIGIN_CH_IO_EX,
+4537569=FILTERS_OUTBOUND_START,
+4646348=ORIGIN_CH_CLOSE,
+4748074=ORIGIN_CH_INACTIVE,
+4957163=FILTERS_OUTBOUND_END,
+4968947=OUT_RESP_HEADERS_SENDING,
+4985532=OUT_RESP_LAST_CONTENT_SENDING,
+5003476=OUT_RESP_HEADERS_SENT,
+5004610=OUT_RESP_LAST_CONTENT_SENT,
+5062221=NOW]}
You can log the passport, add it to a header or ship it off to a persistent store for later debugging. To get it out of the request you can either use the channel or the session context. For example:
// from channel
CurrentPassport passport = CurrentPassport.fromChannel(channel);
// from context
CurrentPassport passport = CurrentPassport.fromSessionContext(context);
Another very useful debugging feature is tracking the request attempts that Zuul makes. We typically add this as an internal-only header on every response, and it makes tracing and debugging requests much simpler for us and our internal partners.
[{"status":200,"duration":192,"attempt":1,"region":"us-east-1","asg":"simulator-v154","instanceId":"i-061db2c67b2b3820c","vip":"simulator.netflix.net:7001"}]
[{"status":503,"duration":142,"attempt":1,"error":"ORIGIN_SERVICE_UNAVAILABLE","exceptionType":"OutboundException","region":"us-east-1","asg":"simulator-v154","instanceId":"i-061db2c67b2b3820c","vip":"simulator.netflix.net:7001"},
{"status":503,"duration":147,"attempt":2,"error":"ORIGIN_SERVICE_UNAVAILABLE","exceptionType":"OutboundException","region":"us-east-1","asg":"simulator-v154","instanceId":"i-061db2c67b2b3820c","vip":"simulator.netflix.net:7001"}]
You can get the request attempts from the session context on an outbound filter. For example:
// from context
RequestAttempts attempts = RequestAttempts.getFromSessionContext(context);
Sometimes origins can get into trouble, particularly when the volume of requests exceeds their capacity. Given that we're a proxy, a bad origin could potentially affect other origins by saturating our connections and memory. In order to protect origins and Zuul, we have concurrency limits in place to help smooth our service interruptions.
There are two ways we manage origin concurrency:
zuul.origin.<originName>.concurrency.max.requests // default: 200
zuul.origin.<originName>.concurrency.protect.enabled // default: true
<originName>.ribbon.MaxConnectionsPerHost // default: 50
If an origin exceeds overall concurrency or per-host concurrency, Zuul will return a 503 to the client.
Zuul can run in HTTP/2 mode as demonstrated in the sample app. In this mode, it requires an SSL cert and if you're going to run Zuul behind an ELB you'll have to use the TCP listener.
Relevant HTTP/2 properties:
server.http2.max.concurrent.streams // default: 100
server.http2.initialwindowsize // default: 5242880
server.http2.maxheadertablesize // default: 65536
server.http2.maxheaderlistsize // default: 32768
Zuul can run in Mutual TLS mode as demonstrated in the sample app. In this mode, you'll have to have both an SSL cert and trust store for incoming certs. As with HTTP/2, you'll have to run this behind the TCP listener of the ELB.
Proxy protocol is an important feature when using a TCP listener and can be enabled in Zuul using the follow server config properties:
// strip XFF headers since we can no longer trust them
channelConfig.set(CommonChannelConfigKeys.allowProxyHeadersWhen, StripUntrustedProxyHeadersHandler.AllowWhen.NEVER);
// prefer proxy protocol when available
channelConfig.set(CommonChannelConfigKeys.preferProxyProtocolForClientIp, true);
// enable proxy protocol
channelConfig.set(CommonChannelConfigKeys.withProxyProtocol, true);
The client IP will be set correctly on the HttpRequestMessage
in the filters, and can also be retrieved directly from the channel:
String clientIp = channel.attr(SourceAddressChannelHandler.ATTR_SOURCE_ADDRESS).get();
Zuul comes with an outbound GZipResponseFilter that will gzip outgoing responses.
It makes the decision based on content type, body size and whether the request Accept-Encoding
header contains gzip
.
A Netflix Original Production
Tech Blog | Twitter @NetflixOSS | Jobs
-
Zuul 2.x
-
Zuul 1.x