-
Notifications
You must be signed in to change notification settings - Fork 0
/
feed.xml
1 lines (1 loc) · 121 KB
/
feed.xml
1
<?xml version="1.0" encoding="UTF-8"?> <rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"> <channel> <title>Charles Xu</title> <description>Essays, books, wiki on technologies, career, markets, and more.</description> <link>/</link> <atom:link href="/feed.xml" rel="self" type="application/rss+xml"/> <pubDate>Sun, 12 May 2024 21:10:40 +0000</pubDate> <lastBuildDate>Sun, 12 May 2024 21:10:40 +0000</lastBuildDate> <generator>Jekyll v4.3.2</generator> <item> <title>Kube-proxy and mysterious DNS timeout</title> <description><p>This post reviews how iptables-mode kube-proxy works, why some DNS requests to <code class="language-plaintext highlighter-rouge">kube-dns</code> were blackholed, and how to mitigate the issue.</p> <h3 id="background-how-kube-proxy-works">Background: How kube-proxy works</h3> <p>The kube-dns <code class="language-plaintext highlighter-rouge">Service</code> uses a label selector to select all CoreDNS Pods. The <code class="language-plaintext highlighter-rouge">Service</code> has a <code class="language-plaintext highlighter-rouge">ClusterIP</code>. Requests to such <code class="language-plaintext highlighter-rouge">ClusterIP</code> will be DNAT-ed to one of the CoreDNS Pod IPs. The DNAT is performed by kube-proxy, which runs as a DaemonSet. Kube-proxy is not a real proxy (data plane) but configures the <code class="language-plaintext highlighter-rouge">iptables</code> rules and <code class="language-plaintext highlighter-rouge">conntrack</code> tables on the Node to implement the DNAT.</p> <p>DNS is primarily over UDP. Although UDP is a connectionless protocol, kube-proxy still uses <code class="language-plaintext highlighter-rouge">conntrack</code> for UDP to remember the NAT translations applied to each pair of the source and destination IP addresses and ports, ensuring that responses can be correctly routed back to the originating Pod.</p> <p>When the CoreDNS Deployment had a rolling restart, new CoreDNS Pods had new IPs, and old CoreDNS Pods were removed so their IPs became stale. Thus, kube-proxy needs to update the Node’s <code class="language-plaintext highlighter-rouge">iptables</code> rules and <code class="language-plaintext highlighter-rouge">conntrack</code> tables.</p> <h3 id="why-some-dns-requests-were-blackholed">Why Some DNS Requests Were Blackholed</h3> <p>The concurrent restart of the kube-proxy DaemonSet and the CoreDNS Deployment created a race condition. Any DaemonSet, including kube-proxy, can not perform surge upgrade (i.e. bring up a new Pod on the same Node then remove the old Pod), because a DaemonSet guarantees there will be at most one kube-proxy Pod on a Node. Thus, when kube-proxy was in upgrade, Kubelet must terminate the existing kube-proxy Pod, then start a new one. In between the delete-then-create, there may be new CoreDNS Pods coming up and old CoreDNS Pods removed. This creates two problems:</p> <ol> <li> <p>Until the new kube-proxy Pod was up and ensured iptables and conntrack were up to date, traffic to CoreDNS might be routed to a stale Pod IP that no longer exists (the IP of a deleted CoreDNS Pod). Pod IP works as secondary (alias) IP ranges on the Node. The destination Node’s iptables will just drop the packet if no matching Pod IP.</p> </li> <li> <p>In some cases, new kube-proxy Pod cannot remove stale rules. Some kube-proxy Pod’s log showed (reformatted to be more readable):</p> </li> </ol> <div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 4 5 6 7 8 9 10 </pre></td><td class="rouge-code"><pre>"Failed to delete endpoint connections" error deleting conntrack entries for udp peer {172.20.0.10, 10.6.30.154}, conntrack command returned: conntrack v1.4.4 (conntrack-tools): Operation failed: such conntrack doesn't exist udp 17 2 src=10.6.12.242 dst=172.20.0.10 sport=42451 dport=53 src=10.6.30.154 dst=10.6.12.242 sport=53 dport=42451 mark=0 use=1 udp 17 28 src=10.6.5.121 dst=172.20.0.10 sport=53669 dport=53 src=10.6.30.154 dst=10.6.5.121 sport=53 dport=53669 mark=0 use=1 udp 17 1 src=10.6.10.175 dst=172.20.0.10 sport=36264 dport=53 src=10.6.30.154 dst=10.6.10.175 sport=53 dport=36264 mark=0 use=1 error message: exit status 1 servicePortName="kube-system/kube-dns:dns" </pre></td></tr></tbody></table></code></pre></div></div> <p>The source code that produces such an error message is <a href="https://github.com/kubernetes/kubernetes/blob/v1.28.9/pkg/proxy/conntrack/conntrack.go#L100-L103">here</a> and has the comment below. The “TODO” is still on the main branch of Kubernetes.</p> <div class="language-go highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 4 5 6 7 </pre></td><td class="rouge-code"><pre><span class="c">// TODO: Better handling for deletion failure.</span> <span class="c">// When failure occur, stale udp connection may not get flushed.</span> <span class="c">// These stale udp connection will keep black hole traffic.</span> <span class="c">// Making this a best effort operation for now, since it</span> <span class="c">// is expensive to baby sit all udp connections to kubernetes services.</span> <span class="k">return</span> <span class="n">fmt</span><span class="o">.</span><span class="n">Errorf</span><span class="p">(</span><span class="s">"error deleting conntrack entries for udp peer {%s, %s}, error: %v"</span><span class="p">,</span> <span class="n">origin</span><span class="p">,</span> <span class="n">dest</span><span class="p">,</span> <span class="n">err</span><span class="p">)</span> </pre></td></tr></tbody></table></code></pre></div></div> <p>Unfortunately, <code class="language-plaintext highlighter-rouge">conntrack</code> has no log files, and the kube-proxy log is not verbose enough to provide more insight into <code class="language-plaintext highlighter-rouge">conntrack</code>.</p> <p>Once kube-proxy got an error from <code class="language-plaintext highlighter-rouge">conntrack</code>, it does not retry, as shown in the source code (<a href="https://github.com/kubernetes/kubernetes/blob/44bd04c0cbddde69aaeb7a90d3bd3de4e417f27f/pkg/proxy/conntrack/cleanup.go#L97">1</a>, <a href="https://github.com/kubernetes/kubernetes/blob/v1.28.9/pkg/proxy/conntrack/conntrack.go#L95">2</a>, <a href="https://github.com/kubernetes/kubernetes/blob/v1.28.9/pkg/proxy/conntrack/conntrack.go#L61">3</a>).</p> <h3 id="mitigations">Mitigations</h3> <p><strong>Cordon and drain all affected Nodes</strong>. Find affected Notes by searching for logs <code class="language-plaintext highlighter-rouge">Failed to delete endpoint connections</code>. To assist forensic analysis, prevent the node from being deleted by cluster-autoscaler by annotating the node with “cluster-autoscaler.kubernetes.io/scale-down-disabled=true” (<a href="https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-prevent-cluster-autoscaler-from-scaling-down-a-particular-node">doc</a>)</p> <p><strong>Run kube-proxy with verbose level 4 to get more details about UDP connections</strong>, such as why conntrack exited 1. See https://github.com/kubernetes/kubernetes/pull/95694</p> <p><strong>Per-node DNS monitoring</strong>. Deploy the <a href="https://github.com/kubernetes/node-problem-detector">node-problem-detector</a> as DaemonSet on every cluster. Build a plugin for DNS monitoring. Bonus point is that this agent will be a generalized framework for node-locel issue detection. For example, you may want to cover machine learning use cases such as detecting bad TensorCore Silicon and ECC errors.</p> <p><strong>Deploy node-local-dns</strong>. <a href="https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/">Node-local-dns</a> allows DNS lookup to skip iptables DNAT and connection tracking. Connections from the node-local caching agent to the kube-dns service are upgraded to TCP. TCP conntrack entries will be removed on connection close and would reduce tail latency attributed to dropped UDP packets. Observability of DNS requests at a node level.</p> <p><em>If you run EKS, you may be self-managing kube-proxy, or use the EKS-managed one. Both options require you to manage the life cycle of kube-proxy. Consider hardening the kube-proxy lifecycle management process with:</em></p> <p><strong>Log-based metrics for kube-proxy errors</strong>. Alert the person on-call about such errors.</p> <p><strong>Upgrade kube-proxy only on node pools of a new kubernetes version</strong>. Don’t do in-place upgrade of kube-proxy. GKE and AKS consider kube-proxy a managed component similar to kubelet, and they upgrade kube-proxy only if the node version is upgraded. We should do the same.</p> <p><strong>Don’t upgrade coredns and kube-proxy together</strong>. Define a strong dependency and ordering between the two.</p> </description> <pubDate>Sun, 12 May 2024 00:00:00 +0000</pubDate> <link>/kube-proxy-bug/</link> <guid isPermaLink="true">/kube-proxy-bug/</guid> <category>kubernetes</category> <category>cloud</category> <category>networking</category> <category>microservices</category> </item> <item> <title>Notes: The Lean Startup</title> <description><p>Careful planning and execution work for general management but not for startups. Perfect execution is futile if you end up building something nobody wants (waste). The real progress for startups is not how many JIRA tickets we closed but how fast we gain validated learnings—what creates value for customers and their willingness to pay—while minimizing waste.</p> <p>Systematically break down a business plan and test each part—define a clear hypothesis (predictions about what is supposed to happen) then A/B test to prove the predictions. Two leap-of-faith assumptions:</p> <ul> <li>Value hypothesis: do customers find value</li> <li>Growth hypothesis: how new customers discover the product</li> </ul> <p>Test assumptions with MVP that targets early adopters, in the Build-Measure-Learn loop. Any work beyond what was required to learn is waste, no matter how important it seemed. If we do not know who the customer is, we do not know what quality is. Even a “low-quality” MVP can act in service of building a great high-quality product. Plan the Build-Measure-Learn loop in reverse order: decide what we need to learn, then figure out what we need to measure, then see what product we need to build to experiment. Launch MVP under a different brand name if you worry about branding risk. Commit to iteration: don’t give up hope due to bad news from MVPs, but experiment more, learn more, and maybe pivot.</p> <p>Do not blindly copy successful startups’ techniques. Charging customers from day one works for Stripe but not Facebook. Low-quality early prototypes works for Facebook but not mission-critical industries. Always experiment to see what works in our unique circumstances.</p> <p>Eventually a successful startup will compete with fast followers. A head start is rarely large enough to matter. The only way to win is to learn faster than everyone else.</p> <p>Vanity metrics, such as gross number of customers, are not actionable. We cannot conclude whether metrics growth is due to</p> <ol> <li>latest product development, or</li> <li>driven by decisions the team had made and that current initiatives have no impact.</li> </ol> <p>Use cohort-based metrics (e.g. among users who signed up in June, what percentage exhibits behaviors we want), and use A/B test to conclude causality. Measure team’s productivity in units of validated learning, not the production of new features.</p> <p>Pivot: test a new fundamental hypothesis in business model, product road map, partnership, customer segments, engine of growth. The decision to pivot depends on data &amp; intuition. Misguided decision to persevere is value destructive. Signs of time to pivot: the decreasing effectiveness of product experiments and the general feeling that product development should be more productive. Startup’s runway is the number of pivots it can still make. To extend runway, achieve the same amount of validated learning at lower cost in shorter time. Schedule in advance regular “pivot or persevere” meetings. In pivots, don’t throw out everything and start over. Repurpose what has been built and learned.</p> <blockquote> <p>My personal thought: <br /> The search for PMF is like gradiant descent, a combination of intuition and data. Gradiant descent is an optimization algorithm: start at a point (your best guess), then iteratively descent in the direction of “slope”. The process is almost mechanical, but you need to start with an intuitive guess. Pivot means to find a new starting point. You need to pivot when <br /> 1) the “slope” around the current point is flat in all directions. Nothing you do seems to improve the metrics, or <br /> 2) you are trapped in a local minimum (saturating early adoptors) and want to find a better local minimum (mainstream customers).</p> </blockquote> <p>Once you have found success with early adopters, you want to sell to mainstream customers. Early adopter market is saturated quickly despite prior “up and to the right” results. Mainstream customers have different and more demanding requirements. This is a customer segment pivot. The actions we need to win mainstream customers is different from how we won early adopters. Pivot requires new MVP. Just as lean manufacturing uses just-in-time production to reduce in-process inventory, Lean Startups practice just-in-time scalability, conducting product experiments with small batch size. Imagine that the letters didn’t fit in the envelopes. With the large-batch approach, we wouldn’t find that out until nearly the end. With small batches, we’d know almost immediately. Smaller batch size (small diff in product code change) means shorter Build-Measure-Learn cycle and less WIP waste.</p> <p><b>Sustainable growth</b>: new customers come from the actions of past customers: word of mouth, product usage (wearing designer cloths), funded advertising, repeat purchase.</p> <p><b>Sticky Engine of Growth</b>: repeat usage; use customer retention rate to test growth hypothesis. Other metrics like activation rate and revenue per customer can test value hypothesis but has little impact on growth. If the rate of new customer acquisition exceeds the churn rate, the product will grow.</p> <p><b>Viral Engine of Growth</b>: focus on increasing the viral coefficient. Many viral products do not charge customers but advertisers, because viral products cannot afford to have any friction impede the process of signing customers up.</p> <p><b>Paid Engine of Growth</b>: advertising, outbound sales, foot traffic. Use LTV/CAC to test growth hypothesis. Over time, CAC is bid up by competition.</p> <p>A startup can evaluate PMF by evaluating each Build-Measure-Learn iteration using innovation accounting. Every engine is tied to a set of customers and their habits, preferences, channels, and interconnections and thus eventually runs out of gas.</p> <p>If the boss tends to split the difference, the best way to influence the boss is to take the most extreme position. Your opponants will do the same. Over time, everyone will take the most polarized positions. Don’t split the difference. Instead create a sandbox for innovation that will contain the impact but not methods of the new innovation. It works as follows:</p> <ol> <li>Any team can create a true split-test experiment that affects only the sandboxed parts of the product or service (for a multipart product) or only certain customer segments or territories (for a new product).</li> <li>One team must see the whole experiment through from end to end.</li> <li>No experiment can run longer than a specified amount of time.</li> <li>No experiment can affect more than a specified percentage of customers.</li> <li>Every experiment has to be evaluated on the basis of a single standard report of five to ten actionable metrics.</li> <li>Every team that works inside the sandbox use the same metrics to evaluate success.</li> <li>Any team that creates an experiment must monitor the metrics and customer reactions (support calls, social media reaction, forum threads, etc.) while the experiment is in progress and abort it if something catastrophic happens.</li> </ol> <p>If you like notes like this, check out my <a href="/bookshelf">bookshelf</a>.</p> </description> <pubDate>Sun, 21 Jan 2024 00:00:00 +0000</pubDate> <link>/lean-startup/</link> <guid isPermaLink="true">/lean-startup/</guid> <category>startup</category> </item> <item> <title>Notes: Venture Deals</title> <description><p><b>Before Fundraise</b>: Allow minimum three to six months to raise money. Have a clean cut from last job to avoid IP disputes. Prepare data site (Certificate of Incorporation, Bylaws, board minutes, cap table, customer list, product roadmap, org chart, employment agreements, budgets, financial statements). Some VC deals fail to close because of one missing IP assignment agreement. If you want money, ask for advice. Develop relationships with VC before fundraising. Research &amp; make the outreach personal. Find mentors, not fundraising advisors asking for a cut.</p> <p><b>During Fundraise</b>: Do not email a teaser, hoping to share more in meeting. Whatever you send a VC may be your last, so send the full yet concise pitch. The presentation is to communicate the same info in the executive summary but with more examples and visuals. Aim for 10 slides or fewer. Offer a prototype or demo that VC can interact. Demo is the best way to show your vision. Watch VC reactions during demo. Did their eyes light up? Do they understand the domain? Raise enough to get to the next milestone, plus buffer. Hire experienced startup lawyer with fee cap. Don’t let VC talk you out of your lawyer choice. After you’ve had a second meeting, ask what the process is going forward. Get multiple term sheets to create competition. If a VC passes, insist on feedback, which improves your next pitch.</p> <p><b>Decision Maker</b>: Have a lead investor representing the entire syndicate. If party round, set up a special-purpose limited partnership, not to chase down 75 signatures. At every VC, find out &amp; talk to the decision makers. Reference check the VC: founders who went through hard times, like fired as CEO, learn how the VC handled tough situations. Add more to data site, but watch out for busywork that associates assigns, and the risk of leaking to a competing portco. Again, make sure decision makers are involved.</p> <p><b>Option Pool</b>: Typical size of early-stage company option pool is 10% to 20%. Smaller pools for later stage. “We have enough options to cover our needs. If we need to expand the pool before the next financing, we will provide full antidilution protection for you.” VCs want to minimize future dilution by enlarging the option pool up front. Founders should push back with an option budget that lists out futures hires until next financing and the option grant to land each hire.</p> <p><b>Liquidation Preference</b>: In early stage, it’s in the best interest of both VC and founder to have a simple liquidation preference and no participation. In future rounds, the terms are often inherited from the early stage terms. If the seed investor doesn’t invest in future rounds, his economics in many outcomes could be worse with participating and end up looking like the common holders (in terms of returns), since their preference amounts are so small.</p> <p><b>Pay to Play</b>: Investors must invest pro-ratably in future financings (paying) not to have their preferred stock converted to common (playing) or lose anti-dilution rights. Not a lifetime guarantee but an incentive to follow on, if other investors decide to invest in next round. Pay to play reduces liquidation preferences for the nonparticipating investors and ensures only committed investors have preferred stock. If VC pushes back, “Why? Are you not going to fund the company in the future if other investors agree to?” Avoid the pay-to-play scenario where VC has the right to force a recapitalization (e.g. financing at a $0 pre-money valuation) if fellow investors don’t play in the new round.</p> <p><b>Founder Vesting</b>: Negociate to treat vesting as a clawback with an 83(b) election. Single-trigger: accelerated vesting upon M&amp;A. Double-trigger: accelerated vesting upon M&amp;A and being fired. Balanced approach: double trigger with one-year acceleration.</p> <p><b>Anti-Dilution</b>: protect prior investors in a down round (equity issued at lower price). Flavors: weighted average (normal), or full ratchet (rare; reduce earlier round price to new price). Antidilution is often implemented as a price reduction for conversion to common. More exceptions in antidilution carve-outs, the better for founders.</p> <p><b>Board</b>: Be wary of observers. Question what values they bring. Often they are associates. Sometimes they disclose board topics to brag to their friends. Get a small board. Independent board members usually get stock options 0.25% to 0.5% vest over 2 to 4 years. Observers don’t get options. Instead of controlling the board, VC uses protective provisions (veto rights on certain actions). Next-round investors want protective provisions too, but founders should push for all Preferred voting as a single class, instead of each Series voting separately. VCs charge all expenses associated with board meetings to the company. Mandate frugality. Place a cap early on the percentage of directors who can be VCs (not independent). Preemptively offer observer rights to dethroned director, or establish an executive committee of the board that can meet without everyone else.</p> <p><b>Drag Along</b>: A compromise is to grant drag-along rights to the majority of the common stock, not the preferred. Preferred can convert some to common to force a majority at the cost of lowering the overall liquidation preference. IPO: preferred convert to common. Never give different automatic conversion terms for different series of preferred. Push for low threshold to conversion.</p> <p><b>Redemption</b>: Ensure dividends require board majority approval. Allow investors to sell shares back to the company for a guaranteed return. Never agree to “Adverse Change Redemption” because it is vague, punitive, and investors can act on arbitrary judgments.</p> <p><b>No-shop</b>: Do not agree to pay for legal fees until deal done. Avoid pre-financing contingence: 1. “Approval by Investors’ partnerships” means the term sheet has not been approved. 2. “Employment Agreements acceptable to investors”: review &amp; negotiate full terms (e.g. what happens on termination?) before signing term sheet. Limit the no-shop period to 30 days (worst case 60 days), automatically canceled if VC terminates. Commitment should be bidirectional. You agree not to shop deals, VC agrees to close timely. Ask for exception for acquisitions. Frequently financings and acquisitions follow each other.</p> <p><b>Registration Rights</b>: Always offered to investors. Lawyers often make innocuous edits on this section. Unnecessary. Upon IPO when the rights apply, investment bankers will restructure the deal.</p> <p><b>Right of First Refusal</b>: Define “major investor”, only give such right to them and only if they play in subsequent rounds. Enforce stock sale also transfers obligations the original owner signed up for.</p> <p><b>Co-sale Agreement</b>: This right says if a founder sells shares, investors can sell too. Hard to remove this right, but founders should ask for a floor. Why should VC hold it up if a founder just sells a small amount to pay off mortgage?</p> <p><b>SAFE</b>: Some VCs consider valuation cap as a price ceiling to the next round, so do not disclose seed-round terms until you have negociated new price. Legal fee for priced equity round has dropped, no more than SAFE.</p> <p><b>Zombie VC</b>: VC who past their investment period (usually 5 years) and did not raise a new fund. They can’t invest but still meet you. Waste time. Ask “when was your last investment” (more than a year = zombie). “How many more investments will you make out of the current fund?”</p> <p><b>Reserve</b>: Fund approaching end of life creates pressure for liquidity. Underreserved VC may resist new financings, limit round size, or push for sale of company to limit dilution, even when more funding is right for commons. Pay-to-play creates more resistance in this case. Follow-on might come from new fund, or a different fund vehicle (“Opportunity” funds)</p> <p><b>Corporate VC</b>: They look for more control, such as right of first refusal on acquisition, which you should never give.</p> <p><b>Negociation</b>: Goals: achieving a good and fair result, not killing your personal relationship. Preparation is key. Focus on valuation, option pool, liquidation preferences, board, and voting controls. Know what concetions you are ok with and when to walk away. Get to know the other side ahead of time, play to their strengths, weaknesses, biases, curiosities, and insecurities. First-time founder has one advantage over seasoned VC: time. They got family, LPs, portcos to deal with. You got one company and this negocitation. Ask what the 3 most important terms are for VC. Explain yours too. Call them out if they pound hard on minor points. Don’t make threat you can’t deliver. Don’t say who else you are pitching to. Never provide term sheet from other VC. Don’t address deal points in order but focus on whole picture. “That’s the way it is because it’s market.” probe on why the market condition applies to you. Talk to other founders to get market intelligence. Push back with “Wait a minute, this term creates incentive misalignment. Let’s avoid a divisive relationship.” If stuck with terms you don’t love, next-round investors may fix them because they want your team happy and motivated. After big wins and some time together, renegociate with existing investors for founder-friend terms.</p> <p><b>Price</b>: High valuation is risky because 1) VCs hold out for a higher exit (by big perf stack, or forbid sales below $X), then founders can’t sell at a price they would have been happy with. 2) At higher price, sophisticated investors demand more structure, resulting in significant outcome misalignment between early and late stage investors.</p> <p><b>Investment Banks</b>: Avoid them at early stage. Hire them in acquisition. They maximize exit value. Best source of bankers: your board members, investors, colleagues, and other senior executives you trust. Hire a banker who knows your sector, like “enterprise SaaS”.</p> <p><b>Daily Operation</b>: Hire an employment lawyer when a founder or exec leaves. Make sure equity &amp; IP are settled to protect future fundraising and acquisition. If a company used a professional valuation firm, the valuation would be assumed to be correct unless the IRS could prove otherwise. File 83(b) election to start earlier the clock for long-term capital gains. Pay at least minimum-wage cash comp to full-time founders &amp; execs.</p> <p>If you like notes like this, check out my <a href="/bookshelf">bookshelf</a>.</p> </description> <pubDate>Fri, 19 Jan 2024 00:00:00 +0000</pubDate> <link>/venture-deals/</link> <guid isPermaLink="true">/venture-deals/</guid> <category>startup</category> <category>investment</category> </item> <item> <title>Scaling Istio</title> <description><p>In a large, busy cluster, how do you scale Istio to address Istio-proxy Container being OOM-Killed and Istiod crashes if too many connected istio-proxies?</p> <h3 id="istio-proxy-container-oom-killed">Istio-proxy Container OOM-Killed</h3> <h4 id="problem">Problem</h4> <p>If istio-proxy dies, Pod disconnects from the world, because istio routes the Pod’s ingress and egress through the istio-proxy container. Thus, the main application container cannot communicate with other services, and clients cannot reach the application either. This disrupts existing connections and risks cascading failure when loads shift to other replicas.</p> <p>Out-of-memory kill is #1 reason for the istio-proxy death. The istio-proxy is configured with resource limits for CPU and memory, to avoid starving other workloads sharing the k8s Node. The istio-proxy is killed once it exceeds the memory limit.</p> <p>Restarting istio-proxy won’t help: By default, Kubernetes uses the restart policy “Always” for Pods. Thus, if the istio-proxy container is OOM killed, Kubernetes will restart it. However, because the usage pattern has not changed, istio-proxy will enter OOMKilled again. This forms a crash loop and continued disruption to applications.</p> <p>To keep bumping the memory limit is expensive and whack-a-mole. Overtime, you have increased the memory limit from 256Mi to 2Gi, which is per istio-proxy container. Given tens of thousands of Pods in istio mesh cross the hundreds of clusters, it is expensive to keep raising the limit. Furthermore, many people only increase the limit is when the oncall got paged about crash-looping Pods, which already impact customer traffic.</p> <h4 id="solution">Solution</h4> <h5 id="use-sidecar-object-to-trim-unused-xds-config">Use <code class="language-plaintext highlighter-rouge">Sidecar</code> object to trim unused xDS config</h5> <p>By default, Istio programs all sidecar proxies with the configuration to reach every workload in the mesh, as well as accept traffic on all the ports associated with the workload.</p> <p>But if you have a locked down Istio mesh, and if a tenant must request for allow-listing such source namespace using some onboarding config, then the istio-proxy container does not need the full mesh config.</p> <p>The <code class="language-plaintext highlighter-rouge">Sidecar</code> API object can restrict the set of services that the proxy can reach. Adopting the Sidecar objects will reduce the number of xDS pushes and overall xDS config size. You could templatize the <code class="language-plaintext highlighter-rouge">Sidecar</code> objects and render them based on the per-namespace onboarding configs.</p> <p>Below is an example Sidecar, which allows istio-proxies in the namespace “observability-cortex” to egress to four other namespaces.</p> <div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 4 5 6 7 8 9 10 11 12 </pre></td><td class="rouge-code"><pre><span class="na">apiVersion</span><span class="pi">:</span> <span class="s">networking.istio.io/v1beta1</span> <span class="na">kind</span><span class="pi">:</span> <span class="s">Sidecar</span> <span class="na">metadata</span><span class="pi">:</span> <span class="na">name</span><span class="pi">:</span> <span class="s">default</span> <span class="na">namespace</span><span class="pi">:</span> <span class="s">myapp</span> <span class="na">spec</span><span class="pi">:</span> <span class="na">egress</span><span class="pi">:</span> <span class="pi">-</span> <span class="na">hosts</span><span class="pi">:</span> <span class="pi">-</span> <span class="s2">"</span><span class="s">istio-system/*"</span> <span class="pi">-</span> <span class="s2">"</span><span class="s">my-upstream-ns/*"</span> <span class="pi">-</span> <span class="s2">"</span><span class="s">kube-system/*"</span> <span class="pi">-</span> <span class="s2">"</span><span class="s">observability/*"</span> </pre></td></tr></tbody></table></code></pre></div></div> <h5 id="use-telemetry-object-to-reduce-metrics-generation">Use <code class="language-plaintext highlighter-rouge">Telemetry</code> object to reduce metrics generation</h5> <p>Istio collects and exports a wide range of Prometheus metrics. Metrics collection impacts memory usage. Istio-proxy doesn’t need to generate all metrics but only those we use. Consider customizing the metrics that Istio collects and exports.</p> <div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 </pre></td><td class="rouge-code"><pre><span class="nn">---</span> <span class="na">apiVersion</span><span class="pi">:</span> <span class="s">telemetry.istio.io/v1alpha1</span> <span class="na">kind</span><span class="pi">:</span> <span class="s">Telemetry</span> <span class="na">metadata</span><span class="pi">:</span> <span class="na">name</span><span class="pi">:</span> <span class="s">drop-unused-metrics-and-tags</span> <span class="na">namespace</span><span class="pi">:</span> <span class="s">istio-system</span> <span class="na">spec</span><span class="pi">:</span> <span class="c1"># no selector specified, applies to all workloads in the namespace</span> <span class="na">metrics</span><span class="pi">:</span> <span class="pi">-</span> <span class="na">providers</span><span class="pi">:</span> <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">prometheus</span> <span class="na">overrides</span><span class="pi">:</span> <span class="pi">-</span> <span class="na">match</span><span class="pi">:</span> <span class="na">metric</span><span class="pi">:</span> <span class="s">ALL_METRICS</span> <span class="na">tagOverrides</span><span class="pi">:</span> <span class="na">connection_security_policy</span><span class="pi">:</span> <span class="na">operation</span><span class="pi">:</span> <span class="s">REMOVE</span> <span class="na">destination_app</span><span class="pi">:</span> <span class="na">operation</span><span class="pi">:</span> <span class="s">REMOVE</span> <span class="na">destination_canonical_service</span><span class="pi">:</span> <span class="na">operation</span><span class="pi">:</span> <span class="s">REMOVE</span> <span class="na">destination_canonical_revision</span><span class="pi">:</span> <span class="na">operation</span><span class="pi">:</span> <span class="s">REMOVE</span> <span class="na">destination_principal</span><span class="pi">:</span> <span class="na">operation</span><span class="pi">:</span> <span class="s">REMOVE</span> <span class="s">...</span> <span class="pi">-</span> <span class="na">match</span><span class="pi">:</span> <span class="na">metric</span><span class="pi">:</span> <span class="s">REQUEST_DURATION</span> <span class="na">disabled</span><span class="pi">:</span> <span class="kc">true</span> <span class="pi">-</span> <span class="na">match</span><span class="pi">:</span> <span class="na">metric</span><span class="pi">:</span> <span class="s">REQUEST_SIZE</span> <span class="na">disabled</span><span class="pi">:</span> <span class="kc">true</span> <span class="pi">-</span> <span class="na">match</span><span class="pi">:</span> <span class="na">metric</span><span class="pi">:</span> <span class="s">RESPONSE_SIZE</span> <span class="na">disabled</span><span class="pi">:</span> <span class="kc">true</span> <span class="pi">-</span> <span class="na">match</span><span class="pi">:</span> <span class="na">metric</span><span class="pi">:</span> <span class="s">TCP_CLOSED_CONNECTIONS</span> <span class="na">disabled</span><span class="pi">:</span> <span class="kc">true</span> </pre></td></tr></tbody></table></code></pre></div></div> <h5 id="istio-ambient-mesh">Istio Ambient Mesh</h5> <p>We can solve sidecar problems if we don’t run sidecar at all. Istio <a href="https://istio.io/latest/blog/2022/introducing-ambient-mesh/">ambient mesh</a> is a sidecar-less approach to service mesh, replacing sidecar proxies with per-node and (not always necessary) per-namespace proxies. With fewer proxies, it will save us lots of money in CPU/Memory and provide shorter latency.</p> <p>The general problems with sidecars and benefits of ambient mesh:</p> <ul> <li>Kubernetes does not have first-class support for sidecars (<a href="https://kubernetes.io/blog/2023/08/25/native-sidecar-containers/">until k8s 1.28</a>). App container might start before proxy ready, decide itself is unhealthy, and be in a restart loop. Short-lived Pods (Job) need to explicitly kill proxy for Pod to complete.</li> <li>Istio upgrade requires restarting every pod to inject newer-version Istio proxies</li> <li>Sidecar resources are underutilized</li> <li>Difficult to calculate namespace quotas (<code class="language-plaintext highlighter-rouge">ResourceQuotas</code>) because sidecars are transparent to tenants but consume namespace quotas.</li> </ul> <p>If you use Calico to enforce L4 NetworkPolicy for Pods, you might face a blocker to adopting ambient mesh because of conflicting IPTables rules that Calico owned (GitHub <a href="https://github.com/istio/istio/issues/40973">issue</a> still open). But I encourage you to do another proof of concept, because someone (GitHub <a href="https://github.com/istio/istio/issues/43871">issue</a>) used eBPF instead of IPTables to redirect traffic to ambient-mode proxies, thus working around the conflicting Calico IPTables rules.</p> <h3 id="istiod-crash-if-too-many-connected-istio-proxies">Istiod crash if too many connected istio-proxies</h3> <h4 id="problem-1">Problem</h4> <p>Istiod is the control plane of istio. All istio-proxies connect to istiod. Istiod may crash when there were too many connected istio-proxies, specifically if they all were added at the same time by a tenant workload scaling out.</p> <p>Most people run Istiod as a <code class="language-plaintext highlighter-rouge">Deployment</code> with a <code class="language-plaintext highlighter-rouge">HorizontalPodAutoscaler</code> (HPA). You could mitigate the scaling issue by setting a high minimum for HPA, but doing so leads to low resource utilization at night and weekends, at odds with the very purpose of autoscaling. Moreover, istiod is still at risk when the tenants scale out aggressively.</p> <h4 id="solution-1">Solution</h4> <h5 id="use-discoveryselectors-to-watch-in-mesh-namespaces-only">Use <code class="language-plaintext highlighter-rouge">discoverySelectors</code> to watch in-mesh Namespaces only</h5> <p>The <code class="language-plaintext highlighter-rouge">discoverySelectors</code> configuration enables us to dynamically restrict the set of namespaces that are part of the mesh. The <code class="language-plaintext highlighter-rouge">discoverySelectors</code> configuration declares what Istio control plane watches and processes. Not all tenant namespaces enable istio, so istiod could benefit from having to process less k8s events.</p> <h5 id="fine-tune-hpa">Fine-tune HPA</h5> <p>The default scale-up stabilization window is 300 seconds. We should reduce it to 10 seconds to be more responsive, but keep the scale-down stabilization window at 300s to avoid threshing.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 </pre></td><td class="rouge-code"><pre> apiVersion: autoscaling/v2beta2 kind: HorizontalPodAutoscaler metadata: name: istiod namespace: istio-system labels: app: istiod release: istio istio.io/rev: system install.operator.istio.io/owning-resource: unknown operator.istio.io/component: "Pilot" spec: maxReplicas: 48 <span class="gd">- minReplicas: 32 </span><span class="gi">+ minReplicas: 3 </span> scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: istiod <span class="gi">+ behavior: + scaleUp: + stabilizationWindowSeconds: 10s # default is 300s + scaleDown: + stabilizationWindowSeconds: 300s </span> metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 65 </pre></td></tr></tbody></table></code></pre></div></div> <h5 id="distribute-istio-proxy-connections-across-istiod-pods">Distribute istio-proxy connections across Istiod Pods</h5> <p>Istio doesn’t explicitly set a default maximum connection time between istio-proxy sidecars and istiod. Typically, the connections from the sidecars to istiod are long-lived gRPC connections used for service discovery, configuration updates, and certificate rotation, and they are expected to be maintained as long as istiod and the sidecars are running. This creates uneven distribution of loads on istiod Pods over time.</p> <p>One idea is to set a max connection idle timeout for the istio-proxy to istiod connections, so the proxy will reconnect over time, hopefully landing on a new istiod Pods.</p> <div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 </pre></td><td class="rouge-code"><pre><span class="na">apiVersion</span><span class="pi">:</span> <span class="s">networking.istio.io/v1alpha3</span> <span class="na">kind</span><span class="pi">:</span> <span class="s">EnvoyFilter</span> <span class="na">metadata</span><span class="pi">:</span> <span class="na">name</span><span class="pi">:</span> <span class="s">istio-proxy-to-istiod-timeouts</span> <span class="na">namespace</span><span class="pi">:</span> <span class="s">istio-system</span> <span class="na">spec</span><span class="pi">:</span> <span class="na">workloadSelector</span><span class="pi">:</span> <span class="na">labels</span><span class="pi">:</span> <span class="pi">{}</span> <span class="na">configPatches</span><span class="pi">:</span> <span class="pi">-</span> <span class="na">applyTo</span><span class="pi">:</span> <span class="s">HTTP_ROUTE</span> <span class="na">match</span><span class="pi">:</span> <span class="na">context</span><span class="pi">:</span> <span class="s">SIDECAR_OUTBOUND</span> <span class="na">routeConfiguration</span><span class="pi">:</span> <span class="na">vhost</span><span class="pi">:</span> <span class="na">name</span><span class="pi">:</span> <span class="s">istiod.istio-system.svc.cluster.local:443</span> <span class="na">patch</span><span class="pi">:</span> <span class="na">operation</span><span class="pi">:</span> <span class="s">MERGE</span> <span class="na">value</span><span class="pi">:</span> <span class="na">typed_config</span><span class="pi">:</span> <span class="s1">'</span><span class="s">@type'</span><span class="err">:</span> <span class="s">type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager</span> <span class="s">common_http_protocol_options</span><span class="err">:</span> <span class="na">idle_timeout</span><span class="pi">:</span> <span class="s">300s</span> </pre></td></tr></tbody></table></code></pre></div></div> </description> <pubDate>Sun, 22 Oct 2023 00:00:00 +0000</pubDate> <link>/scaling-istio/</link> <guid isPermaLink="true">/scaling-istio/</guid> <category>kubernetes</category> <category>cloud</category> <category>networking</category> <category>istio</category> <category>microservices</category> </item> <item> <title>Work Around Max Count of Security Group Rules on EKS</title> <description><p>AWS EKS on VPC networks need AWS Security Group Rules (SG) to receipt ingress traffic. But what if you reach the max rules count in your SG?</p> <h3 id="background">Background</h3> <h4 id="loadbalancer-type-service-and-security-group-rules">LoadBalancer-type Service and Security Group Rules</h4> <p>Kubernetes users can expose a Service in two ways:</p> <ul> <li>Register with the Istio ingress gateways—the golden path for most tenants</li> <li>Create a dedicated LoadBalancer-type Service object, which tells the cloud provider to create a load balancer and set up health checks.</li> </ul> <p>EKS recommends <a href="https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.5/">aws-load-balancer-controller</a> to react to updates to <code class="language-plaintext highlighter-rouge">LoadBalancer</code>-type Service objects and set up NLB accordingly. For example, suppose a Service object exposes ports 80 and 443, the controller will create five Security Group (SG) Rules on EKS worker Nodes:</p> <ol> <li>allow ingress source <code class="language-plaintext highlighter-rouge">0.0.0.0/0</code> to the corresponding <a href="https://kubernetes.io/docs/concepts/services-networking/service/#type-nodeport">NodePort</a> for port <code class="language-plaintext highlighter-rouge">80</code></li> <li>allow ingress source <code class="language-plaintext highlighter-rouge">0.0.0.0/0</code> to the corresponding NodePort for port <code class="language-plaintext highlighter-rouge">443</code></li> <li>allow EKS zonal subnet in <code class="language-plaintext highlighter-rouge">us-west-2a</code> to ingress to the health-check NodePort.</li> <li>allow EKS zonal subnet in <code class="language-plaintext highlighter-rouge">us-west-2b</code> to ingress to the health-check NodePort</li> <li>allow EKS zonal subnet in <code class="language-plaintext highlighter-rouge">us-west-2c</code> to ingress to the health-check NodePort</li> </ol> <p>Note: health-check will fail if a) the Node does not host any target Pods or b) none of the target Pods on this Node is ready, determined by the Pod’s readiness probe</p> <p>The SG Rules are added to an SG attached to all worker Nodes in the given EKS.</p> <h4 id="security-group-limits">Security Group Limits</h4> <p>For each AWS account, there are two quota limits on Security Groups:</p> <ol> <li>Max number of inbound rules per SG</li> <li>Max number of SGs per network interface</li> </ol> <p>These limits can be adjusted subject to the constraint that the product of the two quotas cannot exceed 1000 (AWS <a href="https://docs.aws.amazon.com/vpc/latest/userguide/amazon-vpc-limits.html#vpc-limits-security-groups">doc</a>). It means a network interface can not have more than 1000 SG rules.</p> <h3 id="problem">Problem</h3> <p>Once your EKS cluster approaches the limit of SG rules, it restricts your ability to create new load balancers. It means you won’t be able to perform blue-green upgrade of the load balancer, because you need to provision two sets of load balancers simultaneously. The lack of headroom also means you can no longer onboard more applications that requires a dedicated load balancer.</p> <h3 id="solutions">Solutions</h3> <p>The following solutions are not mutually exclusive. They can be used together.</p> <h4 id="second-dedicated-sg-for-each-node-pools">Second dedicated SG for each node pools</h4> <p>Suppose your current setup is that all worker Nodes, regardless node pool, has a shared SG attached named “worker”. The <code class="language-plaintext highlighter-rouge">aws-load-balancer-controller</code> adds new rules to the “worker” SG.</p> <p>You can keep the shared “worker” SG to store common rules but create a new SG for each node pool, and use the new SG for NLBs ingress. You need to change the node pool launch template to attach the new SG.</p> <p>If you decide to continue letting the AWS LB controller manage SG rules for us, you should tag the new SG with <code class="language-plaintext highlighter-rouge">kubernetes.io/cluster/{{ .ClusterName }}: shared</code>. This is necessary when there are multiple security groups attached to an ENI, so that the controller knows which SG to add new rules to. Because the existing “worker” SG has this tag already, we need to create a duplicate SG, say “worker2”, which does NOT have the SG tag for NLB. Then, we will attach to the node pool the “worker2” SG and the per-pool SG.</p> <h4 id="optimize-sg-rules-outside-of-aws-lb-controller">Optimize SG rules outside of aws LB controller</h4> <p>Recall the <code class="language-plaintext highlighter-rouge">aws-load-balancer-controller</code> implementation creates 5 inbound SG rules per envoy-ingress Service. We can optimize this by managing the SG rules ourselves and asking the controller to skip SG rules creation. We can reduce the need to 2 inbound SG rules per envoy-ingress Service.</p> <p>Add the <code class="language-plaintext highlighter-rouge">service.beta.kubernetes.io/aws-load-balancer-manage-backend-security-group-rules: false</code>` annotation to the LoadBalancer-type Service object. Documentation about this annotation is <a href="https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.4/guide/service/annotations/#manage-backend-sg-rules">here</a>.</p> <p><strong>Reserve 3 static NodePorts for each Service.</strong> One for NLB to health check the EKS nodes. One for frontend port 80. One for frontend port 443. You <a href="https://kubernetes.io/docs/tasks/access-application-cluster/create-external-load-balancer/#preserving-the-client-source-ip">can choose</a> a static <code class="language-plaintext highlighter-rouge">healthCheckNodePort</code> if you set <code class="language-plaintext highlighter-rouge">externalTrafficPolicy: Local</code> (which comes with the benefits to preserve source IP address). The two regular NodePorts can be static regardless.</p> <p><strong>The two regular NodePorts should be consecutive</strong>, so one SG rule can cover both. The <code class="language-plaintext highlighter-rouge">healthCheckNodePort</code> does not need to be consecutive, because the source IP range in the SG rule is different (i.e. only allow NLB to healthcheck the nodes).</p> <p>Consider the following example:</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 </pre></td><td class="rouge-code"><pre> apiVersion: v1 kind: Service metadata: annotations: external-dns.alpha.kubernetes.io/hostname: acmecorp.com service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: instance service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing service.beta.kubernetes.io/aws-load-balancer-type: external <span class="gi">+ service.beta.kubernetes.io/aws-load-balancer-manage-backend-security-group-rules: false </span> name: myapp namespace: myapp spec: externalTrafficPolicy: Local <span class="gi">+ healthCheckNodePort: 30218 </span> ports: - name: https <span class="gi">+ nodePort: 30212 </span> port: 443 protocol: TCP targetPort: 8095 - name: http <span class="gi">+ nodePort: 30213 </span> port: 80 protocol: TCP targetPort: 8089 selector: app: myapp type: LoadBalancer </pre></td></tr></tbody></table></code></pre></div></div> <p>The optimized SG rules would be:</p> <ul> <li><del style="color: #9c9c9c">allow ingress source <code class="language-plaintext highlighter-rouge">0.0.0.0/0</code> to the corresponding NodePort for port <code class="language-plaintext highlighter-rouge">80</code></del></li> <li><del style="color: #9c9c9c">allow ingress source <code class="language-plaintext highlighter-rouge">0.0.0.0/0</code> to the corresponding NodePort for port <code class="language-plaintext highlighter-rouge">443</code></del></li> <li> <p>allow source <code class="language-plaintext highlighter-rouge">0.0.0.0/</code>0 to ingress to NodePort range from <code class="language-plaintext highlighter-rouge">30212</code> to <code class="language-plaintext highlighter-rouge">30213</code></p> </li> <li><del style="color: #9c9c9c">allow EKS zonal subnet in <code class="language-plaintext highlighter-rouge">us-west-2a</code> to ingress to the health-check NodePort</del></li> <li><del style="color: #9c9c9c">allow EKS zonal subnet in <code class="language-plaintext highlighter-rouge">us-west-2b</code> to ingress to the health-check NodePort</del></li> <li><del style="color: #9c9c9c">allow EKS zonal subnet in <code class="language-plaintext highlighter-rouge">us-west-2c</code> to ingress to the health-check NodePort</del></li> <li>allow EKS VPC network in region <code class="language-plaintext highlighter-rouge">us-west-2</code> to ingress to the health-check NodePort</li> </ul> <h4 id="raise-max-inbound-rules-per-sg-by-reducing-sg-count-per-eni">Raise max inbound rules per SG by reducing SG count per ENI</h4> <p>The solution picks a different point on the trade-off spectrum between #Inbound rules per SG and #SG per ENI.</p> <p>SG quota is set for each and whole AWS account, so any adjustment will affect other workloads in the same account. Thus, we need to verify whether the existing AWS account has ENI with max number of SGs attached already.</p> <h4 id="build-eks-clusters-in-a-separate-aws-account">Build EKS clusters in a separate AWS account</h4> <p>Building new clusters and shifting tenants over are expensive. Try other solutions first.</p> </description> <pubDate>Tue, 26 Sep 2023 00:00:00 +0000</pubDate> <link>/eks-sg/</link> <guid isPermaLink="true">/eks-sg/</guid> <category>kubernetes</category> <category>cloud</category> <category>networking</category> </item> <item> <title>Layer-4 Load Balancer & Zero-downtime Autoscaling and Upgrade</title> <description><p>Your Kubernetes cluster probably has a shared ingress for north-south traffic, coming from a cloud load balancer and lands on your favorite proxies like Envoy, or Istio gateways, or Nginx.</p> <p>If you</p> <ul> <li>use a LoadBalancer-type <code class="language-plaintext highlighter-rouge">Service</code> to create a Layer-4 Load Balancer fronting your Kubernetes ingress</li> <li>retain source IP address by setting <code class="language-plaintext highlighter-rouge">externalTrafficPolicy: Local</code></li> </ul> <p>Then horizontal autoscaling (scale-in) and rolling upgrade will incur some downtime for you.</p> <p>This post</p> <ul> <li>explains why there is partial disruption, and how much disruption to expect</li> <li>discusses several options to achieve zero downtime upgrade and autoscaling</li> </ul> <p>For simplicity, the rest of the doc assumes Envoy as the ingress gateway.</p> <h3 id="background">Background</h3> <h4 id="layer-4-cloud-load-balancer">Layer-4 cloud load balancer</h4> <p>The routing of traffic to Envoy is facilitated by a layer-4 (L4) cloud load balancer, known as Network Load Balancer (NLB) in AWS terminology. The <a href="https://github.com/kubernetes-sigs/aws-load-balancer-controller">aws-load-balancer-controller</a> provisions such load balancer (LB) by watching LoadBalancer-type Service objects in Kubernetes. Each Service object opens dedicated NodePort on all Nodes in selected Envoy node pools. Traffic to Envoy will first be routed to NodePort on the Node hosting Envoy Pod, then DNAT-ed (iptables) to the Pod on the same Node, as shown in the following diagram.</p> <div style="text-align: center"> <p><img src="/assets/images/source-ip-autoscale/1.png" width="680" /></p> </div> <p>(image <a href="https://kubernetes.io/blog/2022/12/30/advancements-in-kubernetes-traffic-engineering/#traffic-loss-from-load-balancers-during-rolling-updates">source</a>)</p> <p>The LB periodically check the HealthCheck NodePort. The HealthCheck NodePort will fail if</p> <ul> <li>the Node does not host any target Pods, or</li> <li>none of the target Pods on this Node is ready, determined by the Pod’s readiness probe</li> </ul> <h4 id="externaltrafficpolicy-local">externalTrafficPolicy: Local</h4> <p>Suppose the Kubernetes Service object is configured with <code class="language-plaintext highlighter-rouge">externalTrafficPolicy: Local</code>. Then, the kube-proxy directs packets exclusively to Envoy Pods residing on the same Node, even if there are other Nodes running Envoy. This setup has two benefits: one less hop (lower latency) and preserving source IP address (for allowlist or rate limiting).</p> <p>But <code class="language-plaintext highlighter-rouge">externalTrafficPolicy: Local</code> is problematic during rolling upgrades or scale-in. The reason is that traffic arriving at NodePort will be dropped by kube-proxy if node has no ready Envoy Pods. LB will keep forwarding traffic to this Node until LB detects the HealthCheck NodePort is failing. Then, LB will mark the Node as unhealthy.</p> <p>There is a certain delay between two key events in this setup:</p> <ul> <li>An Envoy Pod becoming NotReady (for example, if it enters the “Terminating” state during a rolling upgrade).</li> <li>The subsequent periodic health check carried out by the load balancer.</li> </ul> <p>During such delay, client traffic to this Node is blackholed.</p> <div style="text-align: center"> <p><img src="/assets/images/source-ip-autoscale/2.png" width="680" /></p> </div> <p>(image <a href="https://kubernetes.io/blog/2022/12/30/advancements-in-kubernetes-traffic-engineering/#traffic-loss-from-load-balancers-during-rolling-updates">source</a>)</p> <h3 id="partial-downtime-during-upgrade-and-autoscale-in">Partial downtime during upgrade and autoscale-in</h3> <h4 id="why-is-there-some-downtime">Why is there some downtime</h4> <p>As discussed in the previous section, client traffic to an Envoy Node is blackholed during the time between the envoy Pod on such Node enters the <code class="language-plaintext highlighter-rouge">Terminating</code> state and the LB performs the next health check. Kube-proxy will remove forwarding rules from NodePort to the Pod once the Pod enters the <code class="language-plaintext highlighter-rouge">Terminating</code> state. Kubernetes 1.24 and 1.25 considers the <code class="language-plaintext highlighter-rouge">Terminating</code> state as not ready.</p> <p>For the same reason, horizontal scale-in will also cause downtime. For a while, I was just running Envoy as a DaemonSet on a node pool that does not autoscale.</p> <h4 id="why-is-the-downtime-partial">Why is the downtime partial</h4> <p>This downtime only affects one Node at a time, because currently, Envoy DaemonSet has the following upgrade strategy:</p> <div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 4 5 </pre></td><td class="rouge-code"><pre> <span class="na">updateStrategy</span><span class="pi">:</span> <span class="na">type</span><span class="pi">:</span> <span class="s">RollingUpdate</span> <span class="na">rollingUpdate</span><span class="pi">:</span> <span class="na">maxUnavailable</span><span class="pi">:</span> <span class="m">1</span> <span class="na">maxSurge</span><span class="pi">:</span> <span class="m">0</span> </pre></td></tr></tbody></table></code></pre></div></div> <p>Thus, Kubernetes will terminate one Pod at a time, then create new Pod on the same Node. There are 6 Pods in each DaemonSet, so not all envoy Pods are down at the same time.</p> <p>The reason for <code class="language-plaintext highlighter-rouge">maxSurge: 0</code> is that envoy-ingress Pods run on host networking. It means we cannot have 2 envoy Pods running on the same Node, because they both bind to the same ports. Thus, the current update strategy is to kill a Pod, then start a new one.</p> <h4 id="why-host-networking">Why host networking</h4> <p>Running Envoy in host networking means traffic bypasses the Pod overlay network (Normally, each Pod runs in separate network namespaces). Thus, host networking reduces the overhead of network hops and encapsulation due to overlays. This results in lower latency and higher throughput.</p> <p>But how much performance gain exactly? It depends on many factors like hardware and bandwidth. Cilium did some <a href="https://cilium.io/blog/2021/05/11/cni-benchmark">benchmark</a>—take this marketing with a grain of salt—that suggests host networking could improve throughput by 20% and latency by 25%. They didn’t say how many iptables rules (which scale linearly) are on the given hosts.</p> <h4 id="how-much-downtime">How much downtime</h4> <p>After NLB detects in its target group an unhealthy instance, NLB will stop creating new connections to that target. However, existing connections are not immediately terminated until a default 300s of draining timeout, or <code class="language-plaintext highlighter-rouge">RST</code> by clients or Envoy. Thus, in the worst case, the blackhole period per Pod is 310 seconds.</p> <p>In practice, the startup time of a new Envoy Pod on the same Node will be shorter than 300s. NLB continues health-checking the unhealthy Node, and will mark the Node as healthy as gain once the new Pod is ready. But for the worst-case analysis, let’s assume the blackhole period per Node is 310 seconds.</p> <p>Given 6 Nodes, Envoy DaemonSet will exhibit a 16.7% error rate for a total of 310 * 6 seconds, which is 1860 seconds, or 31 minutes in the worst case.</p> <p>The 16.7% error rate comes from the fact that 1 of the 6 Pods are in the <code class="language-plaintext highlighter-rouge">Terminating</code> state. Still, 16.7% is an appropriation, because another downside of <code class="language-plaintext highlighter-rouge">externalTrafficPolicy: Local</code> is that connections may not distribute evenly, especially if there are long-running connections on the Terminating Pod. NLB does not support the least-connections load balancing scheme.</p> <h3 id="solutions">Solutions</h3> <h4 id="use-pod-ips-as-lb-backends">Use Pod IPs as LB backends</h4> <p>In this case, NLB sends traffic directly to the Pods selected by k8s Service. The benefits are:</p> <ul> <li>Eliminate the extra network hop (NodePort) through the worker Nodes</li> <li>Allow NLB to keep sending traffic to Pods in the <code class="language-plaintext highlighter-rouge">Terminating</code> state but mark the target as Draining</li> </ul> <p>AWS load balancer controller supports this feature natively with “NLB IP-mode”. On other cloud, you can implement such controller yourself, watching Pod events and reconcile with L4 LB target groups.</p> <p>To enable IP-mode, we just need to update the Service annotations</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 4 5 6 7 8 9 10 11 </pre></td><td class="rouge-code"><pre><span class="gd">-service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: instance </span><span class="gi">+service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip </span><span class="err"> </span># Health check the Pods directly <span class="gi">+service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol: http +service.beta.kubernetes.io/aws-load-balancer-healthcheck-port: "9901" +service.beta.kubernetes.io/aws-load-balancer-healthcheck-path: /ready </span><span class="err"> </span># NLB with IP targets by default does not pass the client source IP address, # unless we specifically configure the target group attributes. <span class="gi">+service.beta.kubernetes.io/aws-load-balancer-target-group-attributes: preserve_client_ip.enabled=true </span></pre></td></tr></tbody></table></code></pre></div></div> <p>To achieve zero-downtime upgrade, we need to additionally configure on the envoy Pod a <code class="language-plaintext highlighter-rouge">preStop</code> hook like below. When Pod enters the Terminating state, k8s will execute the <code class="language-plaintext highlighter-rouge">preStop</code> hook and keep the Pod in <code class="language-plaintext highlighter-rouge">Terminating</code> until the preS`top hook completes.</p> <div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 </pre></td><td class="rouge-code"><pre><span class="c1"># We must define a longer terminationGracePeriodSeconds, which by default</span> <span class="c1"># is 30s, upon which the Pod is killed even if preStop has not completed.</span> <span class="na">terminationGracePeriodSeconds</span><span class="pi">:</span> <span class="m">305</span> <span class="na">containers</span><span class="pi">:</span> <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">envoy</span> <span class="na">lifecycle</span><span class="pi">:</span> <span class="na">preStop</span><span class="pi">:</span> <span class="na">exec</span><span class="pi">:</span> <span class="c1"># The default target group attribute</span> <span class="c1"># “deregistration_delay.timeout_seconds” is 300s, configurable</span> <span class="c1"># through Service annotation.</span> <span class="na">command</span><span class="pi">:</span> <span class="pi">-</span> <span class="s">/bin/sh</span> <span class="pi">-</span> <span class="s">-c</span> <span class="pi">-</span> <span class="s">curl -X POST http://localhost:9901/healthcheck/fail &amp;&amp; sleep </span><span class="m">300</span> </pre></td></tr></tbody></table></code></pre></div></div> <p>By failing the envoy health check but keeping envoy running in the <code class="language-plaintext highlighter-rouge">Terminating</code> state, envoy can still process traffic. Once NLB deems the Envoy Pod unhealthy, it halts new request routing to the Pod but maintains existing connections. Consequently, active TCP connections persist, with client requests continuing to the now-unhealthy NLB target (Envoy Pod) until either client or envoy closes the connection or idle timeout expiry, defaulting to 300 seconds for NLB.</p> <h4 id="proxyterminatingendpoints">ProxyTerminatingEndpoints</h4> <p><a href="https://github.com/kubernetes/enhancements/blob/master/keps/sig-network/1669-proxy-terminating-endpoints/README.md">ProxyTerminatingEndpoints</a> is a new beta feature in Kubernetes version 1.26. It is enabled by default.</p> <p>When there is a rolling update and a Node only contains terminating Pods, kube-proxy will route traffic to the terminating Pods based on their readiness. At the same time, kube-proxy will actively fail the health check NodePort if there are only terminating Pods available. By doing so, kube-proxy alerts the external load balancer that new connections should not be sent to that Node but will gracefully handle requests for existing connections.</p> <div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 </pre></td><td class="rouge-code"><pre><span class="c1"># We must define a longer terminationGracePeriodSeconds, which by default</span> <span class="c1"># is 30s, upon which the Pod is killed even if preStop has not completed.</span> <span class="na">terminationGracePeriodSeconds</span><span class="pi">:</span> <span class="m">305</span> <span class="na">containers</span><span class="pi">:</span> <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">envoy</span> <span class="na">lifecycle</span><span class="pi">:</span> <span class="na">preStop</span><span class="pi">:</span> <span class="na">exec</span><span class="pi">:</span> <span class="c1"># The default target group attribute</span> <span class="c1"># “deregistration_delay.timeout_seconds” is 300s, configurable</span> <span class="c1"># through Service annotation.</span> <span class="na">command</span><span class="pi">:</span> <span class="pi">-</span> <span class="s">/bin/sh</span> <span class="pi">-</span> <span class="s">-c</span> <span class="pi">-</span> <span class="s">sleep </span><span class="m">300</span> </pre></td></tr></tbody></table></code></pre></div></div> <p>Note that here we must NOT call <code class="language-plaintext highlighter-rouge">POST http://localhost:9901/healthcheck/fail</code> on Envoy, different from what NLB IP-mode needs. The reason is that <code class="language-plaintext highlighter-rouge">Terminating</code> Pods need to pass the readiness probe to continue receiving traffic, so we cannot fail the envoy health check. Since kube-proxy will actively fail the health check NodePort if there are only terminating Pods available on the Node, NLB will start the draining process.</p> <h4 id="customize-nlb-keep-host-networking">Customize NLB, keep host networking</h4> <p>Forget about the NodePort and HealthCheck NodePort opened by kube-proxy. We can create the NLB not through k8s Service object, but using infra-as-code tools such as pulumi. This bypasses the kube-proxy. The NLB will look like this</p> <table> <thead> <tr> <th style="text-align: left">NLB frontend port</th> <th style="text-align: left">Target port (=NodePort =Pod port because of host networking)</th> </tr> </thead> <tbody> <tr> <td style="text-align: left">443</td> <td style="text-align: left">8443</td> </tr> <tr> <td style="text-align: left">80</td> <td style="text-align: left">8080</td> </tr> </tbody> </table> <p>The NLB will find all Nodes running envoy Pods using the autoscaling group for the envoy-ingress node pool. Yes, we can autoscale with solution 4.3. This setup is similar to Section 4.1.1 NLB IP-mode, except the NLB is not created by Kubernetes. We need the following Pod spec change.</p> <div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 </pre></td><td class="rouge-code"><pre><span class="c1"># We must define a longer terminationGracePeriodSeconds, which by default</span> <span class="c1"># is 30s, upon which the Pod is killed even if preStop has not completed.</span> <span class="na">terminationGracePeriodSeconds</span><span class="pi">:</span> <span class="m">305</span> <span class="na">containers</span><span class="pi">:</span> <span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">envoy</span> <span class="na">lifecycle</span><span class="pi">:</span> <span class="na">preStop</span><span class="pi">:</span> <span class="na">exec</span><span class="pi">:</span> <span class="c1"># The default target group attribute</span> <span class="c1"># “deregistration_delay.timeout_seconds” is 300s, configurable</span> <span class="c1"># through Service annotation.</span> <span class="na">command</span><span class="pi">:</span> <span class="pi">-</span> <span class="s">/bin/sh</span> <span class="pi">-</span> <span class="s">-c</span> <span class="pi">-</span> <span class="s">curl -X POST http://localhost:9901/healthcheck/fail &amp;&amp; sleep </span><span class="m">300</span> </pre></td></tr></tbody></table></code></pre></div></div> <p>We also need to expose the “/ready” endpoint from envoy to the host. Then, we need to update the Service annotations like the following.</p> <div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><table class="rouge-table"><tbody><tr><td class="rouge-gutter gl"><pre class="lineno">1 2 3 4 </pre></td><td class="rouge-code"><pre># Health check the Pods directly through NodePort 9901 <span class="gi">+service.beta.kubernetes.io/aws-load-balancer-healthcheck-protocol: http +service.beta.kubernetes.io/aws-load-balancer-healthcheck-port: "9901" +service.beta.kubernetes.io/aws-load-balancer-healthcheck-path: /ready </span></pre></td></tr></tbody></table></code></pre></div></div> </description> <pubDate>Sun, 06 Aug 2023 00:00:00 +0000</pubDate> <link>/source-ip-autoscale/</link> <guid isPermaLink="true">/source-ip-autoscale/</guid> <category>kubernetes</category> <category>cloud</category> <category>networking</category> </item> <item> <title>Enterprise Sales</title> <description><h3 id="how-to-do-product-led-growth-and-hands-on-outbound-sales-at-the-same-time">How to do product-led growth and hands-on outbound sales at the same time?</h3> <blockquote> <p>Every PLG company eventually has to embrace enterprise.</p> <p>– Annie Pearl, Chef Product Officer at Calendly</p> </blockquote> <p>The upper limit of PLG seems to be $100M to $200M ARR (e.g. DataDog around IPO). Beyond triple-digit million ARR, you quickly saturate the market of users who buy things by pulling out their credit cards. The growth of the PLG channel naturally slows at some point. <!-- As you expend from selling to individuals/teams to selling to cross-functional organizations, you need t --></p> <p>Most companies start with self-serve PLG and then layer in enterprise sales. This may not work well. Your entry-level product could cannibalize your enterprise product. <mark>Think hard about differentiation that justifies the premium. </mark> You cannot just add single-sign-on (SSO) and call it enterprise edition.</p> <p>Some common enterprise features are:</p> <table> <thead> <tr> <th style="text-align: left">Feature</th> <th style="text-align: left">Description</th> </tr> </thead> <tbody> <tr> <td style="text-align: left">Customer supports</td> <td style="text-align: left">Support is what makes or breaks the enterprise product. Your company success is defined solely by customer success. 24/7 or business hours? Email, chat, or phone support? SLA on response time?</td> </tr> <tr> <td style="text-align: left">Dedicated account manager</td> <td style="text-align: left">Evangelize best practices and champion support.</td> </tr> <tr> <td style="text-align: left">Fine-grain authorization policy</td> <td style="text-align: left">Roles and permissions.</td> </tr> <tr> <td style="text-align: left">Audit logs</td> <td style="text-align: left">User and workload identity, actions, timestamp, easy export.</td> </tr> <tr> <td style="text-align: left">Compliance</td> <td style="text-align: left">For example, FedRAMP, HIPAA, GDPR, SOC2.</td> </tr> <tr> <td style="text-align: left">High availability</td> <td style="text-align: left">For example, multi-region disaster recovery, cross-region replication, four-nine uptime SLA.</td> </tr> <tr> <td style="text-align: left">Private network</td> <td style="text-align: left">Private link, VPC peering, single-tenant deployment.</td> </tr> <tr> <td style="text-align: left">Self-host</td> <td style="text-align: left">User-managed private deployment</td> </tr> <tr> <td style="text-align: left">Data retention period</td> <td style="text-align: left">For example, DataDog allows free accounts to view metrics from last 24h</td> </tr> <tr> <td style="text-align: left">Performance</td> <td style="text-align: left">For example, run the same pytorch model 100x faster.</td> </tr> </tbody> </table> <p>Another pro tip: <mark>within minutes of self-serve signups, directly reach out to that new user</mark>, assuming they pass certain criteria about company size and title. The reason is, your product is top of mind for the user right now, who clearly has been doing research about this space. If they give you the phone number, call them directly.</p> <h3 id="when-to-layer-in-enterprise-sales">When to layer in enterprise sales?</h3> <p>Similarly, when to make the first sales hire?</p> <p>You need product-market fit (PMF) and annual recurring revenue (ARR). PMF is shown through retention, exponential and organic customer growth, and net promoter score (NPS, or how upset would you be if this product disappears? How likely would you recommend this product to your friends?). To learn more about PMF, I love this <a href="https://www.lennysnewsletter.com/p/how-to-know-if-youve-got-productmarket">post</a> by Lenny Rachitsky.</p> <p>Rule of thumb: you need at least $1M ARR to show you can sell outside of your network (friends and family).</p> <h3 id="should-your-first-sales-hire-be-head-of-sales-or-junior-sales-rep">Should your first sales hire be head of sales or junior sales rep?</h3> <p>You probably don’t yet need a head of sales because</p> <ul> <li>The target customers and the deal size remain uncertain, so it is hard to know if a candidate is the right person.</li> <li>Head of sales is expensive and often commands a structure in marketing, finance, legal, etc. Big overhead.</li> <li>Head of sales probably has not been down in the weeds for a while.</li> </ul> <p>You should not hire anyone junior because it takes your time to train them, such as how to write emails and pricing proposal.</p> <p>An ideal candidate would be an account executive (AE) at a hypergrowth company with great track record, or someone recently moved into front-line management who used to be an AE and who desires the opportunity to grow into the head of sales role.</p> <p>Some revealing questions to ask:</p> <ul> <li>Tell me about a deal that you lost.</li> <li>Let’s do a mock discovery call together.</li> <li>Out of the 8 SDRs on your team, where did you come up on the leaderboard? What are the folks front-running you doing differently?</li> </ul> <h3 id="when-do-you-start-outbound">When do you start outbound?</h3> <p>According to Maggie Hott, Director of Sales at Webflow, <mark>the answer is always now. The goal of cold outbound is not to close deals but to build brand awareness and educate potential customers</mark>. You almost never hit the prospects at the right time, but when they are ready to buy, prior outbounds make you part of the evaluation. Whereas marketing targets a market segment, outbounds are personalized to each prospect.</p> <p>Maggie recommends a 10-80-10 outbound strategy:</p> <ul> <li>First 10% is personalized and explains how the prospect could relate to your product. <ul> <li>For example, “Hey John, amazing talk at KubeCon last week with such great lessons for multi-cloud adoption. I am curious how your team approaches multi-cloud observability and cost attribution. All the folks I talked to are struggling with this. I think our product can really help.”</li> </ul> </li> <li>Mid 80% is repeatable product marketing.</li> <li>Last 10% is to ask for that call or meeting, and end with a playful closing.</li> </ul> <p>Lauren Schwartz, VP of Sales at Fivetran, stressed the importance of having multiple sponsors in the prospect’s organization. People change jobs and have different types of influence.</p> <h3 id="should-we-give-discounts-for-testimonials">Should we give discounts for testimonials?</h3> <p>In the early days, <mark>worry less about the contract size but focus on getting those logos</mark>. The bulk of the enterprise market is the early majority and late majority. They are almost never early adopters. They need testimony and success stories.</p> <p>Maybe you could get more testimonials through discounts, but be judicious about discounts, because people talk, and soon more customers will ask for discounts. <mark>The best reason to give discounts is to control the close date and payment structure of the deal</mark>. You can always bring up the ask for testimonials if the prospect comes back asking for more discounts.</p> <!-- Frank: don't have CS team. / Add notes from Annie Pearl --> </description> <pubDate>Sun, 01 Jan 2023 00:00:00 +0000</pubDate> <link>/sales-lessons/</link> <guid isPermaLink="true">/sales-lessons/</guid> <category>startup</category> </item> <item> <title>More Career Advices</title> <description><p><em>Make sure to check out the previous post: <a href="/advices/">Advices I wish I got at the start of my career</a>.</em></p> <h3 id="ask-for-help">Ask for help</h3> <blockquote> <p>Son, your ego is writing checks your body can’t cash.</p> <p>– Captain Tom “Stinger” Jordan, Top Gun</p> </blockquote> <p>The number one reason why senior people fail is that they do not ask for help. We are all shareholders. You do whatever you can to unblock yourself. It is about time to market and showing results.</p> <h3 id="promote-thought-leadership">Promote thought leadership</h3> <blockquote> <p>If you want to build a ship, don’t drum up the men to gather wood, divide the work and give orders. Instead, teach them to yearn for the vast and endless sea.</p> <p>Antoine de Saint-Exupery, author of The Little Prince</p> </blockquote> <p>In tech, most decisions are based on influence, not hierarchy. Thought leadership is a great way to gain influence. To be a leader, you should start by acting like one. Be confident, give tech talks, and voice your opinions in planning, design review, and postmortems. Doing so makes you the obvious choice for the next big project or the next leadership roles.</p> <p>The outcome and how you drove that outcome are both important. Be careful with your reputation.</p> <h3 id="befriend-jeff-bezos-before-he-gets-rich">Befriend Jeff Bezos before he gets rich</h3> <p>The best time to become friends with Jeff Bezos is before he becomes rich and famous. Networking does not mean you must reach upward. Invest in your peers, who are more receptive to getting to know you. Imagine you are Stripe in 2012. Rather than running into walls with F500 enterprises, you should onboard hundreds of startups, because among them are the next Airbnb and Lyft. Relationships compound. Start early.</p> <p>Do informational interviews. Ask people at other firms what they like and don’t like about their job. Always follow up to maintain weak ties, such as</p> <ul> <li>Saying hello to someone you met at a conference last year. Or asking if they’ll be attending after this year’s agenda is published.</li> <li>Share interesting news about your old company with a former colleague.</li> <li>Send them news, event, commentary related to their interests. Examples: <ul> <li>“Tom, I just read this great white paper on blockchain. I know you’d get a lot out of this.”</li> <li>“Alice, congrats on the new job. Enclosed please find a copy of the best book I’ve read on starting a new job, The First 90 Days. Call me if you want to compare notes.”</li> <li>“James, I just got an invite to a private class at this new gym but can’t go. You mentioned you love Crossfit — want my ticket?”</li> </ul> </li> </ul> <h3 id="stop-productivity-porn-bias-towards-action">Stop productivity porn. Bias towards action.</h3> <p>Watching others lifting weights is not going to make you fit. Many of us spend so much time collecting books we want to read but haven’t, or studying how others slice up their days to get more done. You get more done by doing and by starting now.</p> <h3 id="develope-relationship-skills">Develope relationship skills</h3> <blockquote> <p>A major reason change efforts so often fail is that successful implementation eventually requires people to have difficult conversations … With everyone taking for granted that their own view is right, and readily assuming that others’ opposition is self-interested, progress quickly grinds to a halt. Decisions are delayed, and when finally made they are often imposed without buy-in from those who have to implement them. Relationships sour. Eventually people give up in frustration, and those driving the effort get distracted by new challenges or the next next big thing. The ability to manage difficult conversations effectively is foundational, then, to achieving almost any significant change.</p> <p>– Douglas Stone, Author of “Difficult Conversations: How to Discuss What Matters”</p> </blockquote> <p>Relationship problems are at the heart of every organization. Take product managers (PM) and engineering managers (EM) for example. PM &amp; EM have overlapping scopes by definition. When you seek better scoping, you don’t have a scoping problem, you have a relationship problem.</p> <p>Here are some tips:</p> <p><strong>Be a good listener</strong>. People rarely change. People just want to be understood. Listening is more persuasive than talking. Listening fosters a reciprocal relationship. Listening is not just about paraphrasing back. Ask deep and relevant questions, take notes, and maintain eye contact. For many, listening needs to be a trained response. When you are frustrated, you are the least curious—you have so much noise in your internal head that left little space to worry about what’s on the other person’s head. Learn to lean into the conflicts, just like firefighters learn to run towards the fire.</p> <p><strong>Lean into conflicts</strong>. Avoiding conflict is the worst kind of measure of relationship health. Staying quiet creates resentment. You must confront, but do so with skills and preparation. It is mature to share your feeling and inquire about others’. Doing so builds trust. For example: “Bob, I feel frustrated. It seems this conversation is not getting anywhere, and I want to understand why.” Acknowledge the differences between the two parties, not who is right or better. I love this example “Jill, you and I seem to have different preferences about when code reviews should be done. I wonder if that’s something we could talk about?”</p> <p><strong>Resolve email conflicts in person</strong>. If a conflict starts on email, it is hard to solve on email. In this case, just meet in person or pick up the phone. It is hard to communicate emotions through email—no tone, no voice, no facial expression, no body language.</p> <p><strong>Apology diffuses the tension</strong>. In most conflicts, blaming does not help. Most of the time, you share part of the blame, even just 5%. Apologizing and acknowledging the fault on your side could really diffuse the tension. Apology is an underrated and underutilized skill. Apology needs to be genuine. Saying “I am sorry that you feel that way” is not genuine. Here is how to make a good apology:</p> <ol> <li>Acknowledge the harm. “I am sorry that I interrupted you in the meeting.”</li> <li>Say why it is wrong. “It was disrespectful and discourage the full exchange of ideas.”</li> <li>Say what you <em>will</em> do next time, not what you won’t do. “I will make sure to let you finish before I chime in.”</li> <li>Ask for forgiveness. Bring cupcakes.</li> </ol> <h3 id="go-straight-to-the-job-you-want">Go straight to the job you want</h3> <p>Don’t let inertia drive you. Take some risks when you are young. No one in their 40s said they took too much risk.</p> <p>If you are unhappy about your job, move on. Every job change you make, you always wish you make it 6 months earlier. Life is so short. Do not spend time on jobs that you do not like. You will be so productive in jobs you like. Don’t assume that you have to do this job, then get that job and then that job, and then you can do what you really want. Go direct.</p> <p>Know your alternative. Negotiate hard on your second best offer, then negotiate with your first choice, knowing what you can walk away towards. Pay attention to details. Be specific about equity grant date, vesting schedule, etc.</p> <h3 id="what-to-look-for-in-the-next-job-growth">What to look for in the next job: Growth</h3> <p>With the right opportunities, you can 10x your impact every decade. Because of compounding, what seems to be golden handcuffs today is dwarfed by the opportunity to accelerate growth. The exponential curve actually consists of many little S-curves. If you find yourself approaching the flattening end of the S-curve, it is time for a change.</p> </description> <pubDate>Tue, 06 Dec 2022 00:00:00 +0000</pubDate> <link>/more-advices/</link> <guid isPermaLink="true">/more-advices/</guid> <category>career</category> </item> <item> <title>Interviewing Adrien Treuille, Founder CEO of Streamlit</title> <description><p><em>Streamlit, about to raise its Series-C, was acquired by Snowflake for $800M in March 2022. In this conversation with Adrien, we chatted about OSS metrics, licenses, open-core vs freemium vs free trial, PLG vs sales motion, third party contributions, and lessons from building Streamlit. Insights belong to Adrien. Errors and omissions are my own.</em></p> <h4 id="given-streamlit-is-an-open-source-product-what-are-the-most-important-metrics-you-watch-for-while-you-build-this-product-why">Given Streamlit is an open-source product, what are the most important metrics you watch for while you build this product? Why?</h4> <p>Open-source telemetry is a gray area in the open-source world. Because the things that you’d like to track are typically not the things that like open-source projects are supposed to track like utilization. There are two kinds of utilization metrics:</p> <p><strong>Indirect measure of utilization</strong>: downloads, GitHub stars, and engagement metrics on forums (slack, stackoverflow). <br /> <strong>Direct measure of utilization</strong>: which features were used when. This is a SaaS-like approach, people don’t always like this</p> <p>Streamlit did the latter. We made it very clear when you install Streamlit that we’re going to collect the statistics, and here’s how you turn off the data collection. We wanted to be good citizens in that regard. This opt-out feature means we may not be aware of all utilization patterns. Conversely, we were able to better visibility into the Streamlit community, such as the monthly active developers and viewers.</p> <p>Active users are trailing metrics, not leading metrics. They don’t really inform product decisions but are a overall score. You brought up a good point about these metrics are more common in consumer software. Streamlit may consider itself as a B2D company, D as in developers. B2D is not too different from B2C, so I want to optimize for virality and engagement. Taking some members of the community and making them famous is a really key strategy. We were doing all that stuff like crazy.</p> <h4 id="have-you-considered-freemium-or-free-trial-what-makes-open-source-a-better-fit-for-streamlit">Have you considered freemium or free trial? What makes open source a better fit for Streamlit?</h4> <p>If you target an existing workload at companies, focus on exactly that customer set, make them as happy as possible, do better than the competition, and you might not have to open source. HEX is an example, which is saying like, hey, we’re gonna make this about notebook, but it’s like super annoying, so we’re going to improve on it in like, six, seven ways.</p> <p>However, for Streamlit, it was clear that we were inventing a new workload. Early adopters were usually groups working on super high-tech things that like they themselves, their processes were so wide open that they could determine everything we fashion and instrument to work perfectly. These early adopters convinced me to start a company. For example, Uber was using Streamlit to figure out where to put chargers for the electric bikes. If you’re inventing new workloads, then the strategy is you have to become universal, we just had to open source.</p> <p>Charles: It’s a very similar approach, especially in the infrastructure world, where you really have to be the de facto standard. Thus, you need to earn users’ trust so they’re willing to invest in this platform to get the kind of snowball effect rolling.</p> <p>Adrien: Exactly. It’s all like famous for being famous.</p> <h4 id="streamlit-uses-an-apache2-license-have-you-considered-mongodb-and-elastics-licensing-model-why-not">Streamlit uses an Apache2 license. Have you considered MongoDB and Elastic’s licensing model? Why not?</h4> <p>We could always transition to a model like that. Mongo is an example of a company that changed licenses. But the truth is that we never really got to a point where the Apache2 license was an issue, as we were out there trying to win the community. We did in some ways pull away from the pack of people who were doing similar things two or three years ago.</p> <p>Our next big challenge was to monetize. We had a theory for how to do so, though it was certainly not proven. We were literally onboarding our first paying customers, when snowflake approached us for acquisition. And we said no actually, because we had great term sheets from amazing investors and we had the revenue. Snowflake said, we don’t want you to figure out how to make money, because if you do, you’re gonna get way too expensive. Snowflake matched our term sheet valuation and went over a bit to catch the projected revenue. The term sheet we had was to raise $95 million, which buys years of runways, so we would have figured out the business problems along the way.</p> <h4 id="is-it-possible-to-do-both-plg-and-sales-motion-at-the-same-time">Is it possible to do both PLG and sales motion at the same time?</h4> <p>Charles: Integration is always a challenge with any acquisition. Specifically, Streamlit started as an open-source project, and it’s about to get into monetization with product lead growth, which is different from snowflake’s sales-driven model. How do we best integrate the two products together?</p> <p>Adrien: The quick answer is yes. The question is, what does that actually look like?</p> <p>I think what it looks like is perhaps less PLG. For me, true PLG looks like this: we’re gonna convince you to pay up like $1,000 a year for us, and then before you know it, you’ll be paying like a million dollars a year for us because we’re just gonna prove our worth to the entire organization and the adoption growth is bottom-up.</p> <p>Our ambition at snowflake is not to turn Snowflake sales motion into a PLG motion but to piggyback on snowflake’s unbelievably successful sales motion. What we can be is beloved by developers and be a reason why a deal cuts in Snowflake direction. This is the lower ambition. The higher ambition is the above plus driving a ton of credit consumption and indulgence. Snowflake and Streamlit have a lot of joint users. If the next prospect, who is doing diligence on Snowflake, asks internally that Snowflake comes with this Streamlit thing, who has heard of it? And the data science teams all say that would be awesome. This totally goes for snowflake, right? And then all of a sudden, a massive workload moves over to snowflake. That is success. Whether you call it PLG or not, I think it’s completely compatible with Snowflake’s sales motion.</p> <h4 id="how-do-you-prioritize-community-feature-requests-vs-your-product-roadmap-what-to-do-with-voluntary-and-unsolicited-contributions">How do you prioritize community feature requests vs your product roadmap? What to do with voluntary and unsolicited contributions?</h4> <p>Charles: I have this question because I get common feedback from open-source maintainers that contributions from individual community members are great, but once we accept their contributions, we have to maintain the features going forward in all future releases. By then the original contributors are gone. However, rejecting their contributions would be such a blow to their love of your products. Would you prioritize differently based on the feedback and contributions you got?</p> <p>Adrien: The good news about your story about the contributor release is that in practice, it never works that way. For all the serious open-source projects that I know of, there are no nontrivial yet drive-by contributions. The actual so-called community contributions are more things like adding a comma in README. For Streamlit, if someone wants to merge a fundamentally new feature that lets you, for example, parse URL parameters and do whatever, we would just take a look at it and say no, because it was not in our roadmap, and it wasn’t the way we would have done it. We are not gonna let people check random things into Streamlit.</p> <p>But, the thing that the community does do extremely well, which is kind of evergreen in its own way, is to provide a ton of IP around the project. For example, every StackOverflow answer and every example code in the public repos. GitHub Copilot writes fantastic Streamlit code, which is amazing. You can literally add a comment like show Yahoo stock pricing stream, and Copilot popped out a beautiful app. All those are community contributions.</p> <h4 id="what-remains-the-biggest-challenge-in-data-infra">What remains the biggest challenge in data infra?</h4> <p>Streamlit is just like one piece of a huge ecosystem of data infrastructure, all of which is changing really quickly. Whether snowflake keeps up with and leaves the pack is a question that far transcends Streamlit. We are just going to play a role in a positive direction. In many ways, the promise is still ahead of us, in the sense that the actual number of companies—that are really committed to using us and have like amazing results—wasn’t that big, but those that did got really solid results. The challenge is replicating that experience, like 10x 100x 1000x.</p> <h4 id="if-you-were-to-start-streamlit-again-what-would-you-do-differently">If you were to start Streamlit again, what would you do differently?</h4> <p>The biggest mistake I had was not hiring leaders fast enough and growing the organization’s maturity. I was worried that a hiring mistake could backfire for us, but in reality, the executives we hired worked out extremely well. When I interviewed them, I wished I had met them sooner because they could truly bring lots of value to the team. They really increased the execution velocity by making the organization scalable.</p> </description> <pubDate>Mon, 21 Nov 2022 00:00:00 +0000</pubDate> <link>/streamlit-interview/</link> <guid isPermaLink="true">/streamlit-interview/</guid> <category>oss</category> <category>startup</category> </item> <item> <title>Kubernetes Networking From the First Principles</title> <description><p>We go from containers and network namespace to Pod-to-Pod, Pod-to-Service, and external-client-to-Service networking.</p> <h3 id="pods">Pods</h3> <p>Containers of the same Pod share the same Linux network namespace, isolated from the host network namespace.</p> <p>Each Pod gets a separate network namespace and is assigned a cluster-wide unique IP address from the cluster’s Pod CIDR range. Many managed Kubernetes offerings use Host-local IPAM (IP Address Management), so that each Node is fist assigned a subnet of Pod CIDR. Then, each Pod gets its IP address from the Node the Pod is on.</p> <p>The Kubernetes networking model requires that a container in Pod A can reach a container in Pod B, crossing network namespaces, regardless of whether Pods A and B are on the same Node or not.</p> <h4 id="pod-to-pod-on-the-same-node">Pod to Pod on the same Node</h4> <p>For each Pod, kubelet will create a VETH (Virtual Ethernet Device) pair in the host network namespace. Packets transmitted on one device in the pair are immediately received on the other device. Then Kubelet will move one device of such pair into the Pod’s network namespace and rename this device to <code class="language-plaintext highlighter-rouge">eth0</code> in the Pod’s namespace. Each VETH device remained in the host network namespace will be assigned a Pod IP and be connected to a software bridge <code class="language-plaintext highlighter-rouge">cbr0</code>.</p> <div style="text-align: center"> <p><img src="/assets/images/k8s-net/p1.png" width="400" /></p> </div> <h4 id="pod-to-pod-on-another-node">Pod to Pod on another Node</h4> <div style="text-align: center"> <p><img src="/assets/images/k8s-net/p2.png" width="1000" /></p> </div> <p>On <code class="language-plaintext highlighter-rouge">Node-1</code>, IP packets (whose source and destination addresses are the Pod IPs) sent by <code class="language-plaintext highlighter-rouge">pod-1</code> will be encapsulated as Ethernet frames being sent to <code class="language-plaintext highlighter-rouge">cbr0</code>.</p> <p>The <code class="language-plaintext highlighter-rouge">cbr0</code> switch has a match-all forwarding rule for all packets destinated to anything but <code class="language-plaintext highlighter-rouge">Node-1</code>’s subnet of Pod CIDR. The CNI plugin in VXLAN mode will encapsulate the Ethernet frames as UDP packets. These UDP packets’ source and destination addresses are the Node IPs.</p> <h3 id="clusterip-type-services">ClusterIP-type Services</h3> <p>Even though Pod IPs are routable, Pods (and hence the Pod IPs) are ephemeral by design. Hence, it is more reliable and recommended to use the Kubernetes <code class="language-plaintext highlighter-rouge">Service</code>, which provides a static cluster IP and load balancing over a group of Pods. The most basic Service type is <code class="language-plaintext highlighter-rouge">ClusterIP</code>, which represents a service available within the cluster but not exposed to the internet.</p> <p>ClusterIP-type Services are implemented using kube-proxy, which is not a real proxy (data plane) but configures iptables to capture and NAT traffic to cluster IP of the Service.</p> <p>Below is an example of <code class="language-plaintext highlighter-rouge">pod-1</code> sending a request to <code class="language-plaintext highlighter-rouge">service-1</code> backed by <code class="language-plaintext highlighter-rouge">pod-0</code> and <code class="language-plaintext highlighter-rouge">pod-2</code>. Encapsulation details covered in the previous section are omitted from the diagram below.</p> <div style="text-align: center"> <p><img src="/assets/images/k8s-net/p3.png" width="1000" /></p> </div> <p>Kube-proxy will choose at random one of the backing Pods to serve the request, by dNATing the destination IP from the Service’s cluster IP to the IP of the chosen Pod, which in this example is <code class="language-plaintext highlighter-rouge">pod-2</code>. Note that the response from <code class="language-plaintext highlighter-rouge">pod-2</code> will be sNATed back to the Service’s cluster IP, so that kube-proxy remains transparent to workloads. Otherwise, <code class="language-plaintext highlighter-rouge">pod-1</code> only has connection states about <code class="language-plaintext highlighter-rouge">service-1</code>, not <code class="language-plaintext highlighter-rouge">pod-2</code>, and thus will reset the connection.</p> <h3 id="nodeport-type-services">NodePort-type Services</h3> <p>To expose a service to clients outside of the cluster, use the NodePort-type Service, which reserves the same port on each Node such that the client could access the service by hitting the NodePort on any Nodes.</p> <p>The example below assumes <code class="language-plaintext highlighter-rouge">externalTrafficPolicy</code> is set to <code class="language-plaintext highlighter-rouge">Cluster</code>, which means traffic can be routed to a backing Pod on a different Node. Here, the cluster-external client send requests to the NodePort of <code class="language-plaintext highlighter-rouge">service-1</code> on <code class="language-plaintext highlighter-rouge">Node-0</code>, and kube-proxy chooses <code class="language-plaintext highlighter-rouge">pod-2</code> to serve the request.</p> <div style="text-align: center"> <p><img src="/assets/images/k8s-net/p4.png" width="1000" /></p> </div> <p>Obviously the destination address will be masqueraded from <code class="language-plaintext highlighter-rouge">Node-0</code> to <code class="language-plaintext highlighter-rouge">pod-2</code>, but notice that source address is also masqueraded, from the client IP to <code class="language-plaintext highlighter-rouge">Node-0</code>’s IP. Source-NATing is necessary, because otherwise <code class="language-plaintext highlighter-rouge">pod-2</code> will respond directly to the client, who assume it is maintaining a connection to <code class="language-plaintext highlighter-rouge">Node-0</code>.</p> </description> <pubDate>Tue, 01 Mar 2022 00:00:00 +0000</pubDate> <link>/k8s-net/</link> <guid isPermaLink="true">/k8s-net/</guid> <category>kubernetes</category> <category>networking</category> </item> </channel> </rss>