在 Reddit 中代码部署的演进 #1777

steinliber · 2017-06-17T13:53:45Z

完成文章的翻译

steinliber · 2017-06-17T13:54:02Z

已经翻译好了 @sqrthree

zaraguo · 2017-06-19T01:46:39Z

@sqrthree 申请校队

linhe0x0 · 2017-06-19T02:15:11Z

@zaraguo 好哒

CACppuccino · 2017-06-19T22:12:44Z

@sqrthree 校对认领

linhe0x0 · 2017-06-20T02:15:04Z

@CACppuccino 好的

linhe0x0 · 2017-06-20T02:15:13Z

@zaraguo 别忘了来校对哈

zaraguo

@steinliber @sqrthree 校队完成

zaraguo · 2017-06-19T02:04:30Z

TODO/the-evolution-of-code-deploys-at-reddit.md


-**We’re constantly deploying code at Reddit**. Every engineer writes code, gets it reviewed, checks it in, and rolls it out to production regularly. This happens as often as 200 times each week and a deploy usually takes fewer than 10 minutes end-to-end.
+**在 Reddit 我们仍然不断的部署代码**。每个工程师都会编写代码，再让其他人审查这份代码，合并代码之后再定期把代码推到生产环境。这种情形每周经常会发生200次而且每次部署从开始到结束都不会超过 10 分钟。


仍然不断的 => 仍然不断地，生200次 => 生 200 次

zaraguo · 2017-06-19T02:26:58Z

TODO/the-evolution-of-code-deploys-at-reddit.md


-One thing that hasn’t changed over the years is that requests are classified at the load balancer and assigned to specific “pools” of otherwise identical application servers. For example, [listing](https://www.reddit.com/r/rarepuppers/) and [comment](https://www.reddit.com/r/AskReddit/comments/cq1q2/help_reddit_turned_spanish_and_i_cannot_undo_it/) pages are served from separate pools of servers. While any given r2 process could handle any kind of request, individual pools are isolated from spikes of traffic to other pools and can fail independently when they have different dependencies.
+一直到现在都没有改变的事是请求在负载均衡器上会被分类并被分配到其它独立应用服务器特殊的 "请求池" 中。比如说，[listing](https://www.reddit.com/r/rarepuppers/)  和 [comment](https://www.reddit.com/r/AskReddit/comments/cq1q2/help_reddit_turned_spanish_and_i_cannot_undo_it/)  页面是在不同的请求池里处理。虽然任何给定的 r2 进程都可以处理任何类型的请求，单个池与其他池的请求流量是隔离的，并且当它们有不同的依赖关系时，每个池的失败也是隔离的。


和 [comment](xxx) 前后有多余空格， spikes of traffic 是请求高峰。list、comment 可以翻译成列表页、评论页

zaraguo · 2017-06-19T03:08:03Z

TODO/the-evolution-of-code-deploys-at-reddit.md

@@ -38,39 +37,40 @@ foreach $h (@hostlist) {
 }
 ```

-The deploy was sequential. It worked its way through servers one by one. As simple as that sounds, this was actually a good thing: it allowed for a form of canary deploy. If you deployed to a few servers and noticed a new exception popping up, you’d know that you introduced a bug and could abort (Ctrl-C) and revert before affecting all requests. Because of the ease of deploys, it was easy to try things out in production and low friction to revert if it didn’t work out. It also meant it was necessary to do only one deploy at a time to ensure that new errors were from *your *deploy and not *that other one* so it was easier to know when and what to revert.
+整个部署过程是顺序的。它一个接一个的在服务器上完成它的工作。就像听起来那么简单，这实际上是一件很棒的事情：它允许一定形式的金丝雀部署。如果你部署了少数服务器的时候一个新的异常突然出现，这时你知道引入了一个 bug 就可以马上中断（Ctrl-C）部署并且回滚之前已经部署的服务器，这样就不会影响全部的请求。因为部署的简单性，我们可以很轻易的在生产环境尝试新事物并且在它不工作的情况下也可以很轻松的还原到之前状态。这也意味着在同一时间内只执行一次部署是很有必要的，这可以保证新的错误是源自于**你**的部署而不是**其他人**的部署，从而可以很简单的知道何时以及如何恢复之前的状态。


revert before affecting all requests

在影响所有的请求前回滚。

恢复 => 恢复到

zaraguo · 2017-06-19T06:55:33Z

TODO/the-evolution-of-code-deploys-at-reddit.md


-Then we hired a bunch, growing to six whole engineers, and now fit into a [somewhat larger conference room](https://redditblog.com/2011/07/06/its-time-for-us-to-pack-up-and-move-on-to-bigger-and-better-things/). We started to feel a need for better coordination around deploys, particularly when individuals were working from home. We modified the *push* tool to announce when deploys started and ended via an IRC chatbot. The bot just sat in IRC and announced events. The process for actually doing the deploy looked the same, but now the system did the work for you and told everyone what you were doing.
+这是我们在部署工作流中第一次使用开始使用聊天机器人。在这段时间里，有很多管理部署系统的会话都**源自于**聊天机器人，但是因为我们使用的是第三方的 IRC 服务器，所以我们对于生产环境的控制中并不能完全信任这个聊天室，所以它仍然是单向的信息。


第一次使用和开始使用重复了。
所以它仍然是单向的信息 => 所以它只保留了单项的信息流。

zaraguo · 2017-06-19T07:36:42Z

TODO/the-evolution-of-code-deploys-at-reddit.md


 ![](https://redditupvoted.files.wordpress.com/2017/06/unshuffled.png?w=720&amp;h=104)

-We used [uWSGI](https://uwsgi-docs.readthedocs.io/en/latest/) to manage worker processes and so when we told the application to restart it would kill the existing processes and spin up new ones. The new ones took some time to get ready to serve requests and, combined with incidentally targeting a single pool at a time, this would impact the capacity of that pool to serve requests. So we were limited in the rate we could safely deploy to servers. As the list of servers grew, so did the length of the deploys.
+我们使用  [uWSGI](https://uwsgi-docs.readthedocs.io/en/latest/) 来管理工作进程，当我们通知这个应用重启时，它将会关闭已经存在的进程并且生成新的进程。这个新的进程需要一段时间才能准备好处理请求，并且我们是在同一时间内处理一个池，这将会影响池处理请求的能力。所以我们把部署速度限制到可以保证安全的速度。当服务器数量增多时，部署的时间也会变长。


我们使用后有多余空格

zaraguo · 2017-06-20T07:12:26Z

TODO/the-evolution-of-code-deploys-at-reddit.md


-Reddit’s infrastructure needs to support the team as it grows and constantly builds new things. The rate of growth of the company is the highest it’s ever been in Reddit’s history, and we’re working on bigger more interesting projects than ever before. The big issues facing us today are twofold: improving engineer autonomy while maintaining system security in the production infrastructure, and evolving a safety net for engineers to deploy quickly with confidence.
+Reddit 的基础设施需要支持团队的扩大和新项目的构建。现在 Reddit 这家公司的发展速度比历史上的任何时候都要快，而且我们正在开发比以前更大，更有趣的项目。我们今天遇到的大问题有两个方面：首先要在保持生产环境基础设施安全的情况下提高工程师的自主权，还要逐步建立一个可以让工程师安全快速部署的安全网络。


安全 => 可以放心地

zaraguo · 2017-06-20T07:16:12Z

TODO/the-evolution-of-code-deploys-at-reddit.md


-With the increased number of servers dedicated to the monolith, the deploy time grew. We wanted to deploy with a high parallel count but doing so would have caused too many simultaneous restarts to the app servers. Hence, we were below capacity and unable to serve incoming requests, overloading other app servers.
+随着专用于单体应用的服务器数量增加，部署的时间也增长了。我们希望通过高并行数同时部署来解决这个问题，但是这样做会导致过多同时重新启动的应用服务器。这样我们服务器的容量就会不足，导致不足以处理接受的请求，使其它的应用服务器过载。


这样我们服务器 => 这样我们可用服务器

zaraguo · 2017-06-20T07:27:04Z

TODO/the-evolution-of-code-deploys-at-reddit.md


-Gunicorn’s main process used the same model as uWSGI and would restart all workers at once. While the new worker processes are booting, you are unable to serve any requests. The startup time of our monolith ranged from 10-30 seconds which meant during that period we would be unable to serve any requests. To work around this, we replaced the gunicorn master process with Stripe’s worker manager [Einhorn](https://github.com/stripe/einhorn), while [keeping gunicorn’s HTTP stack and WSGI container](https://github.com/reddit/reddit/blob/master/r2/r2/lib/einhorn.py). Einhorn restarts worker processes by spawning one new worker, waiting for it to declare itself ready, then reaping an old worker, and repeating until all are upgraded. This created a safety net and allowed us to be at capacity during a deploy.
+Gunicorn 的主进程使用的是和 uWSGI 相同的模式，它将会同时重启所有的工作进程。在新的工作进程启动阶段，你都不能处理任何请求。我们单体应用的工作进程启动时间为 10-30 秒，这意味着在这段时间内，我们将无法处理任何请求。为了解决这个问题，我们用 Stripe 的 worker 管理器 [Einhorn](https://github.com/stripe/einhorn) 取代了 gunicorn 的主进程，但是仍然[保存 gunicorn 的 HTTP 堆栈和 WSGI 容器](https://github.com/reddit/reddit/blob/master/r2/r2/lib/einhorn.py)。Einhorn 通过产生一个新的工作进程来重启旧的工作进程，等到这个新的进程声明已经准备好处理请求之后关闭老的工作进程，重复前面的步骤直到全部服务器都升级好。这样创建了一个安全的网络可以让我们在部署期间仍能保证服务器的容量。


Einhorn restarts worker processes by spawning one new worker, waiting for it to declare itself ready, then reaping an old worker, and repeating until all are upgraded.

这里“spawning one new worker, waiting for it to declare itself ready, then reaping an old worker, and repeating until all are upgraded” 都是 by 后面的宾语。所以可以翻成：用 Einhorn 重启工作进程的方式是：先产生一个新的工作进程，等到这个新的进程声明已经准备好处理请求之后关闭老的工作进程，重复前面的步骤直到全部服务器都升级好。

capacity 可以和前面一致翻译为处理能力

zaraguo · 2017-06-20T07:31:26Z

TODO/the-evolution-of-code-deploys-at-reddit.md


-This new mechanism allows us to deploy to a lot more machines concurrently, and deploy timings are down to 7 minutes for around 800 servers despite the extra waiting for safety.
+这个新的机制让我们可以并行地部署更多的服务器，无视因为安全而等待的额外时间，对于将近 800 台服务器部署的时间最多为 7 分钟。


最多为 => 降至

zaraguo · 2017-06-20T07:32:02Z

TODO/the-evolution-of-code-deploys-at-reddit.md


-## In retrospect
+## 忆古思今


CACppuccino

@sqrthree @steinliber 校对完毕，翻译质量很好，只做了一些润色

CACppuccino · 2017-06-19T23:41:15Z

TODO/the-evolution-of-code-deploys-at-reddit.md


 ![](https://redditupvoted.files.wordpress.com/2017/06/pools.png?w=720&amp;h=331)

-The *push* tool had a hard-coded list of servers in it and was built around the monolith’s deploy process. It would iterate through all the application servers, SSH into the machine, run a pre-set sequence of commands to update the copy of code on the server via git, then restart all the application processes. In essence (heavily distilled, not real code):
+**push** 工具在代码里有一个硬编码的服务器列表并且它是围绕单个应用部署的过程所构建的。它将会遍历所有的应用服务器，使用 SSH 登录到那台机器，运行一系列预设的命令来通过 git 更新服务器上的代码副本，然后重启所有的应用进程。实际上过程如下（大量简化，不是真实的代码）：


[push 工具在代码里有一个硬编码的服务器列表并且它是围绕单个应用部署的过程所构建的。] => [push 工具在代码里有一个硬编码的服务器列表并且它是围绕整个应用部署的过程所构建的。] monolith's建议译为整个，不是单个

CACppuccino · 2017-06-19T23:46:03Z

TODO/the-evolution-of-code-deploys-at-reddit.md

@@ -38,39 +37,40 @@ foreach $h (@hostlist) {
 }
 ```

-The deploy was sequential. It worked its way through servers one by one. As simple as that sounds, this was actually a good thing: it allowed for a form of canary deploy. If you deployed to a few servers and noticed a new exception popping up, you’d know that you introduced a bug and could abort (Ctrl-C) and revert before affecting all requests. Because of the ease of deploys, it was easy to try things out in production and low friction to revert if it didn’t work out. It also meant it was necessary to do only one deploy at a time to ensure that new errors were from *your *deploy and not *that other one* so it was easier to know when and what to revert.
+整个部署过程是顺序的。它一个接一个的在服务器上完成它的工作。就像听起来那么简单，这实际上是一件很棒的事情：它允许一定形式的金丝雀部署。如果你部署了少数服务器的时候一个新的异常突然出现，这时你知道引入了一个 bug 就可以马上中断（Ctrl-C）部署并且回滚之前已经部署的服务器，这样就不会影响全部的请求。因为部署的简单性，我们可以很轻易的在生产环境尝试新事物并且在它不工作的情况下也可以很轻松的还原到之前状态。这也意味着在同一时间内只执行一次部署是很有必要的，这可以保证新的错误是源自于**你**的部署而不是**其他人**的部署，从而可以很简单的知道何时以及如何恢复之前的状态。


[从而可以很简单的知道何时以及如何恢复之前的状态。] =>[从而可以很简单的知道何时以及哪里需要回滚。]

CACppuccino · 2017-06-19T23:52:19Z

TODO/the-evolution-of-code-deploys-at-reddit.md


-Then we hired a bunch, growing to six whole engineers, and now fit into a [somewhat larger conference room](https://redditblog.com/2011/07/06/its-time-for-us-to-pack-up-and-move-on-to-bigger-and-better-things/). We started to feel a need for better coordination around deploys, particularly when individuals were working from home. We modified the *push* tool to announce when deploys started and ended via an IRC chatbot. The bot just sat in IRC and announced events. The process for actually doing the deploy looked the same, but now the system did the work for you and told everyone what you were doing.
+这是我们在部署工作流中第一次使用开始使用聊天机器人。在这段时间里，有很多管理部署系统的会话都**源自于**聊天机器人，但是因为我们使用的是第三方的 IRC 服务器，所以我们对于生产环境的控制中并不能完全信任这个聊天室，所以它仍然是单向的信息。


[所以我们对于生产环境的控制中并不能完全信任这个聊天室] =>[所以我们并不能完全信任聊天室在生产环境的控制方面起到的帮助作用]

CACppuccino · 2017-06-19T23:58:23Z

TODO/the-evolution-of-code-deploys-at-reddit.md


-As traffic to the site grew, so did the infrastructure supporting it. We’d occasionally have to launch a new batch of application servers and put them into service. This was still a very manual process, including updating the list of hosts in *push*.
+当我们需要添加服务器容量时，我们经常会一次增加数个服务器来增大一个池。其结果是，顺序地遍历服务器列表会快速地接触同一个池中的多个服务器，而不是不同池中的服务器。


[顺序地遍历服务器列表会快速地接触同一个池中的多个服务器] =>[顺序地遍历服务器列表会以快速继承的方式接触同一个池中的多个服务器]

CACppuccino · 2017-06-20T00:09:04Z

TODO/the-evolution-of-code-deploys-at-reddit.md


-The previous improvements that automatically fetched the hostlist from DNS made this a natural transition. The hostlist changed a lot more often than before, but it was no different to the tool. What started out as a quality of life thing became integral to being able to launch the autoscaler.
+之前所做的自动从 DNS 获取主机列表的功能使这个变成了一个很自然的过渡。主机列表的更改频率比以前更加频繁，但是这对于工具来说并没有什么不同。这个一开始只是生活质量的东西现在成为了自动伸缩的一部分。


[这个一开始只是生活质量的东西现在成为了自动伸缩的一部分。] => [这个一开始只是为了提高效率（或“提高工程师生活质量”）的东西现在成为了自动伸缩的一部分。]

CACppuccino · 2017-06-20T00:19:49Z

TODO/the-evolution-of-code-deploys-at-reddit.md


-With the increased number of servers dedicated to the monolith, the deploy time grew. We wanted to deploy with a high parallel count but doing so would have caused too many simultaneous restarts to the app servers. Hence, we were below capacity and unable to serve incoming requests, overloading other app servers.
+随着专用于单体应用的服务器数量增加，部署的时间也增长了。我们希望通过高并行数同时部署来解决这个问题，但是这样做会导致过多同时重新启动的应用服务器。这样我们服务器的容量就会不足，导致不足以处理接受的请求，使其它的应用服务器过载。


[我们希望通过高并行数同时部署来解决这个问题] => [我们希望通过提高并行数量来解决这个问题]

CACppuccino · 2017-06-20T00:22:14Z

TODO/the-evolution-of-code-deploys-at-reddit.md


-Gunicorn’s main process used the same model as uWSGI and would restart all workers at once. While the new worker processes are booting, you are unable to serve any requests. The startup time of our monolith ranged from 10-30 seconds which meant during that period we would be unable to serve any requests. To work around this, we replaced the gunicorn master process with Stripe’s worker manager [Einhorn](https://github.com/stripe/einhorn), while [keeping gunicorn’s HTTP stack and WSGI container](https://github.com/reddit/reddit/blob/master/r2/r2/lib/einhorn.py). Einhorn restarts worker processes by spawning one new worker, waiting for it to declare itself ready, then reaping an old worker, and repeating until all are upgraded. This created a safety net and allowed us to be at capacity during a deploy.
+Gunicorn 的主进程使用的是和 uWSGI 相同的模式，它将会同时重启所有的工作进程。在新的工作进程启动阶段，你都不能处理任何请求。我们单体应用的工作进程启动时间为 10-30 秒，这意味着在这段时间内，我们将无法处理任何请求。为了解决这个问题，我们用 Stripe 的 worker 管理器 [Einhorn](https://github.com/stripe/einhorn) 取代了 gunicorn 的主进程，但是仍然[保存 gunicorn 的 HTTP 堆栈和 WSGI 容器](https://github.com/reddit/reddit/blob/master/r2/r2/lib/einhorn.py)。Einhorn 通过产生一个新的工作进程来重启旧的工作进程，等到这个新的进程声明已经准备好处理请求之后关闭老的工作进程，重复前面的步骤直到全部服务器都升级好。这样创建了一个安全的网络可以让我们在部署期间仍能保证服务器的容量。


[我们单体应用的工作进程启动时间为 10-30 秒] => [我们整体应用的工作进程启动时间为 10-30 秒]

CACppuccino · 2017-06-20T07:54:15Z

TODO/the-evolution-of-code-deploys-at-reddit.md


-This new model introduced a different problem. As mentioned earlier, it could take up to 30 seconds for a worker to be replaced and booted up. This meant that if your code had a bug, it wouldn’t surface right away and you could roll through a lot of servers. To prevent that, we introduced a way to block the deploy from moving on to another server until all the worker process had been restarted. This was done by simply polling einhorn’s state and waiting until all new workers are ready. To keep up speed, we just increased the parallelism, which was now safe to do.
+这个新模式引入了另一个问题。如前所述，一个工作进程可能需要长达 30 秒的时间来替换和启动。这意味着如果你的代码有一个 bug，它将不会立刻显露出来而且你继续会在很多服务器上做版本变更。为了防止这种情况，我们引入了一个部署方式，部署过程会阻塞一直到工作进程已经被重启才在另一个服务器上开始部署。这是通过简单的定时查询 einhorn 状态，一直到所有的新工作进程都准备好。为了保持部署的速度，我们只是增加了并行量，至少现在看这样做是安全的。


[部署过程会阻塞一直到工作进程已经被重启才在另一个服务器上开始部署] => [部署过程会被阻止，直到所有工作进程已经被重启，才会在另一个服务器上开始部署] （语句无问题，适当修改增加可读性）

steinliber · 2017-06-20T08:40:49Z

已经根据校对的意见进行了相应的修改 @sqrthree ，非常感谢 @zaraguo @CACppuccino 两位的校对 🙏🙏

linhe0x0 · 2017-06-21T02:27:59Z

@steinliber 已经 merge 啦~ 快快麻溜发布到掘金专栏然后给我发下链接，方便及时添加积分哟。

stein added 2 commits June 17, 2017 19:51

初步翻译

de8e653

翻译修改

42adbe7

steinliber changed the title ~~完成文章的翻译~~ 在 Reddit 中代码部署的演进 Jun 17, 2017

修复部分格式的错误

63a284e

linhe0x0 self-requested a review June 19, 2017 02:15

linhe0x0 added 后端校对认领正在校对 labels Jun 19, 2017

linhe0x0 mentioned this pull request Jun 19, 2017

在 Reddit 中代码部署的演进 #1735

Closed

linhe0x0 removed the 校对认领 label Jun 20, 2017

zaraguo reviewed Jun 20, 2017

View reviewed changes

CACppuccino reviewed Jun 20, 2017

View reviewed changes

根据校对的意见进行相应的更改

864e4b8

linhe0x0 approved these changes Jun 21, 2017

View reviewed changes

linhe0x0 merged commit 9a61f83 into xitu:master Jun 21, 2017

linhe0x0 added 翻译完成 and removed 正在校对 labels Jun 21, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

在 Reddit 中代码部署的演进 #1777

在 Reddit 中代码部署的演进 #1777

steinliber commented Jun 17, 2017

steinliber commented Jun 17, 2017

zaraguo commented Jun 19, 2017

linhe0x0 commented Jun 19, 2017

CACppuccino commented Jun 19, 2017

linhe0x0 commented Jun 20, 2017

linhe0x0 commented Jun 20, 2017

zaraguo left a comment

zaraguo Jun 19, 2017

zaraguo Jun 19, 2017

zaraguo Jun 19, 2017

zaraguo Jun 19, 2017

zaraguo Jun 19, 2017

zaraguo Jun 20, 2017

zaraguo Jun 20, 2017

zaraguo Jun 20, 2017

zaraguo Jun 20, 2017

zaraguo Jun 20, 2017

CACppuccino left a comment

CACppuccino Jun 19, 2017

CACppuccino Jun 19, 2017

CACppuccino Jun 19, 2017

CACppuccino Jun 19, 2017

CACppuccino Jun 20, 2017

CACppuccino Jun 20, 2017

CACppuccino Jun 20, 2017

CACppuccino Jun 20, 2017

steinliber commented Jun 20, 2017

linhe0x0 commented Jun 21, 2017


		We’re constantly deploying code at Reddit. Every engineer writes code, gets it reviewed, checks it in, and rolls it out to production regularly. This happens as often as 200 times each week and a deploy usually takes fewer than 10 minutes end-to-end.
		在 Reddit 我们仍然不断的部署代码。每个工程师都会编写代码，再让其他人审查这份代码，合并代码之后再定期把代码推到生产环境。这种情形每周经常会发生200次而且每次部署从开始到结束都不会超过 10 分钟。


		One thing that hasn’t changed over the years is that requests are classified at the load balancer and assigned to specific “pools” of otherwise identical application servers. For example, [listing](https://www.reddit.com/r/rarepuppers/) and [comment](https://www.reddit.com/r/AskReddit/comments/cq1q2/help_reddit_turned_spanish_and_i_cannot_undo_it/) pages are served from separate pools of servers. While any given r2 process could handle any kind of request, individual pools are isolated from spikes of traffic to other pools and can fail independently when they have different dependencies.
		一直到现在都没有改变的事是请求在负载均衡器上会被分类并被分配到其它独立应用服务器特殊的 "请求池" 中。比如说，[listing](https://www.reddit.com/r/rarepuppers/) 和 [comment](https://www.reddit.com/r/AskReddit/comments/cq1q2/help_reddit_turned_spanish_and_i_cannot_undo_it/) 页面是在不同的请求池里处理。虽然任何给定的 r2 进程都可以处理任何类型的请求，单个池与其他池的请求流量是隔离的，并且当它们有不同的依赖关系时，每个池的失败也是隔离的。


		Then we hired a bunch, growing to six whole engineers, and now fit into a [somewhat larger conference room](https://redditblog.com/2011/07/06/its-time-for-us-to-pack-up-and-move-on-to-bigger-and-better-things/). We started to feel a need for better coordination around deploys, particularly when individuals were working from home. We modified the push tool to announce when deploys started and ended via an IRC chatbot. The bot just sat in IRC and announced events. The process for actually doing the deploy looked the same, but now the system did the work for you and told everyone what you were doing.
		这是我们在部署工作流中第一次使用开始使用聊天机器人。在这段时间里，有很多管理部署系统的会话都源自于聊天机器人，但是因为我们使用的是第三方的 IRC 服务器，所以我们对于生产环境的控制中并不能完全信任这个聊天室，所以它仍然是单向的信息。


		Reddit’s infrastructure needs to support the team as it grows and constantly builds new things. The rate of growth of the company is the highest it’s ever been in Reddit’s history, and we’re working on bigger more interesting projects than ever before. The big issues facing us today are twofold: improving engineer autonomy while maintaining system security in the production infrastructure, and evolving a safety net for engineers to deploy quickly with confidence.
		Reddit 的基础设施需要支持团队的扩大和新项目的构建。现在 Reddit 这家公司的发展速度比历史上的任何时候都要快，而且我们正在开发比以前更大，更有趣的项目。我们今天遇到的大问题有两个方面：首先要在保持生产环境基础设施安全的情况下提高工程师的自主权，还要逐步建立一个可以让工程师安全快速部署的安全网络。


		With the increased number of servers dedicated to the monolith, the deploy time grew. We wanted to deploy with a high parallel count but doing so would have caused too many simultaneous restarts to the app servers. Hence, we were below capacity and unable to serve incoming requests, overloading other app servers.
		随着专用于单体应用的服务器数量增加，部署的时间也增长了。我们希望通过高并行数同时部署来解决这个问题，但是这样做会导致过多同时重新启动的应用服务器。这样我们服务器的容量就会不足，导致不足以处理接受的请求，使其它的应用服务器过载。


		Gunicorn’s main process used the same model as uWSGI and would restart all workers at once. While the new worker processes are booting, you are unable to serve any requests. The startup time of our monolith ranged from 10-30 seconds which meant during that period we would be unable to serve any requests. To work around this, we replaced the gunicorn master process with Stripe’s worker manager [Einhorn](https://github.com/stripe/einhorn), while [keeping gunicorn’s HTTP stack and WSGI container](https://github.com/reddit/reddit/blob/master/r2/r2/lib/einhorn.py). Einhorn restarts worker processes by spawning one new worker, waiting for it to declare itself ready, then reaping an old worker, and repeating until all are upgraded. This created a safety net and allowed us to be at capacity during a deploy.
		Gunicorn 的主进程使用的是和 uWSGI 相同的模式，它将会同时重启所有的工作进程。在新的工作进程启动阶段，你都不能处理任何请求。我们单体应用的工作进程启动时间为 10-30 秒，这意味着在这段时间内，我们将无法处理任何请求。为了解决这个问题，我们用 Stripe 的 worker 管理器 [Einhorn](https://github.com/stripe/einhorn) 取代了 gunicorn 的主进程，但是仍然[保存 gunicorn 的 HTTP 堆栈和 WSGI 容器](https://github.com/reddit/reddit/blob/master/r2/r2/lib/einhorn.py)。Einhorn 通过产生一个新的工作进程来重启旧的工作进程，等到这个新的进程声明已经准备好处理请求之后关闭老的工作进程，重复前面的步骤直到全部服务器都升级好。这样创建了一个安全的网络可以让我们在部署期间仍能保证服务器的容量。


		This new mechanism allows us to deploy to a lot more machines concurrently, and deploy timings are down to 7 minutes for around 800 servers despite the extra waiting for safety.
		这个新的机制让我们可以并行地部署更多的服务器，无视因为安全而等待的额外时间，对于将近 800 台服务器部署的时间最多为 7 分钟。


		As traffic to the site grew, so did the infrastructure supporting it. We’d occasionally have to launch a new batch of application servers and put them into service. This was still a very manual process, including updating the list of hosts in push.
		当我们需要添加服务器容量时，我们经常会一次增加数个服务器来增大一个池。其结果是，顺序地遍历服务器列表会快速地接触同一个池中的多个服务器，而不是不同池中的服务器。


		The previous improvements that automatically fetched the hostlist from DNS made this a natural transition. The hostlist changed a lot more often than before, but it was no different to the tool. What started out as a quality of life thing became integral to being able to launch the autoscaler.
		之前所做的自动从 DNS 获取主机列表的功能使这个变成了一个很自然的过渡。主机列表的更改频率比以前更加频繁，但是这对于工具来说并没有什么不同。这个一开始只是生活质量的东西现在成为了自动伸缩的一部分。


		This new model introduced a different problem. As mentioned earlier, it could take up to 30 seconds for a worker to be replaced and booted up. This meant that if your code had a bug, it wouldn’t surface right away and you could roll through a lot of servers. To prevent that, we introduced a way to block the deploy from moving on to another server until all the worker process had been restarted. This was done by simply polling einhorn’s state and waiting until all new workers are ready. To keep up speed, we just increased the parallelism, which was now safe to do.
		这个新模式引入了另一个问题。如前所述，一个工作进程可能需要长达 30 秒的时间来替换和启动。这意味着如果你的代码有一个 bug，它将不会立刻显露出来而且你继续会在很多服务器上做版本变更。为了防止这种情况，我们引入了一个部署方式，部署过程会阻塞一直到工作进程已经被重启才在另一个服务器上开始部署。这是通过简单的定时查询 einhorn 状态，一直到所有的新工作进程都准备好。为了保持部署的速度，我们只是增加了并行量，至少现在看这样做是安全的。

在 Reddit 中代码部署的演进 #1777

在 Reddit 中代码部署的演进 #1777

Conversation

steinliber commented Jun 17, 2017

steinliber commented Jun 17, 2017

zaraguo commented Jun 19, 2017

linhe0x0 commented Jun 19, 2017

CACppuccino commented Jun 19, 2017

linhe0x0 commented Jun 20, 2017

linhe0x0 commented Jun 20, 2017

zaraguo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CACppuccino left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

steinliber commented Jun 20, 2017

linhe0x0 commented Jun 21, 2017