From bd2cf14243d2a3a0024f6aa20ab19a05e8b4c319 Mon Sep 17 00:00:00 2001 From: Ben Lower Date: Mon, 17 Jul 2023 16:15:57 -0700 Subject: [PATCH] Update Guide for Performance (#191) content and formatting changes --------- Co-authored-by: Nick Heiner --- packages/docs/docs/guides/performance.md | 70 ++++++++++++++---------- 1 file changed, 40 insertions(+), 30 deletions(-) diff --git a/packages/docs/docs/guides/performance.md b/packages/docs/docs/guides/performance.md index 041bd73c1..9d5afc6e3 100644 --- a/packages/docs/docs/guides/performance.md +++ b/packages/docs/docs/guides/performance.md @@ -6,34 +6,36 @@ sidebar_position: 6 :::note See Also -- [Architecture](./architecture.mdx) -- [AI+UI](./ai-ui.md) +- [Deployment Architectures](./architecture.mdx) - Discusses the various options (and the trade-offs) for how to build and deploy your AI.JSX powered applications. +- [AI+UI](./ai-ui.md) - Learn how AI.JSX can free you up from having to sweat the details of your UI and instead have the AI generate it dynamically at runtime. ::: -AI programming brings a new set of performance considerations. The fundamental difference is that model calls (e.g. to GPT-4) are an order of magnitude slower than traditional API calls. Generating a few sentences can take a few seconds. +AI programming brings a new set of performance considerations. The fundamental difference is that model calls (e.g. to GPT-4) are an order of magnitude slower than traditional API calls. Generating a few sentences can take a few seconds. In this guide, we will explain the key strategies you can use to improve the performance (actual and perceived) of your apps and how AI.JSX helps you along the way. -The key strategies are: +## Five Strategies for AI App Performance -- [Stream](#stream) -- [Minimize how long the output needs to be](#minimize-how-long-the-output-needs-to-be) -- [Avoid waterfalls / roundtrips](#avoid-waterfalls--roundtrips) -- [Defer execution](#defer-execution) -- [Use a faster model](#use-a-faster-model) +The key performance strategies are: -Fortunately, AI.JSX helps you with all of these. +1. [Streaming Responses](#strategy-1--streaming-responses) +1. [Minimizing Output Length](#strategy-2--minimizing-output-length) +1. [Avoiding Waterfalls + Roundtrips](#strategy-3--avoiding-waterfalls--roundtrips) +1. [Deferring Execution](#strategy-4--deferring-execution) +1. [Using a Faster Model](#strategy-5--using-a-faster-model) :::note Performance vs. Reliability -Aside from "stream", all these strategies trade off between performance and reliability of correctness. You'll have to find the tradeoff that works for your application. In general, we recommend starting with correctness, then trading off for performance while keeping correctness above an acceptable threshold. No one cares how fast your app is if the results are bad. :smile: +Aside from "streaming responses", all these strategies make a trade-off between performance vs. the reliability of correctness. You'll have to find the trade-offs that make sense for your application. In general, we recommend starting with correctness, then making trade-offs for performance while keeping correctness at or above an acceptable threshold. No one cares how fast your app is if the results are bad. :smile: ::: -## Stream +## Strategy #1: Streaming Responses -When you do a model call, you can either wait until it's fully complete before showing the user anything, or you can stream the output as it arrives. Streaming greatly improves perceived performance, and gives the user a chance to cancel the generation if it's heading down the wrong path. +When you make a call to an LLM, you can either wait until the model has fully completed its response before showing the user anything, or you can stream the output as it arrives from the model. Streaming greatly improves perceived performance, and gives the user a chance to cancel the generation if it's heading down the wrong path. +:::info By default, AI.JSX streams responses for you. +::: -### From UI +### Streaming From the UI ```tsx file="app.tsx" // React component @@ -49,7 +51,7 @@ By default, AI.JSX streams responses for you. This will call the chat model and automatically stream the results into the DOM. -### From the API +### Streaming From the API ```tsx function MyComponent() { @@ -70,9 +72,9 @@ const finalResult = await AI.createRenderContext().render(, { For more detail, see [Intermediate Results](./rules-of-jsx.md#intermediate-results). -## Minimize how long the output needs to be +## Strategy #2: Minimizing Output Length -A model generation takes linearly more time as the output length increases, so the shorter your output can be, the faster it'll be. However, this inherently means the model is spending less time "thinking" about your result, [which could degrade accuracy](./brand-new.md#thinking-out-loud). You'll need to balance these tradeoffs. +A model generation takes linearly more time as the output length increases, so the shorter your output can be, the faster the response will be completed. However, this inherently means the model is spending less time "thinking" about your result, [which could degrade accuracy](./brand-new.md#thinking-out-loud). You'll need to balance these trade-offs. For example, you may have a scenario where you instruct the model to give you output in a structured format, like JSON. If you can come up with a simpler JSON format that takes fewer characters, the model will produce it faster. @@ -84,9 +86,9 @@ Additionally, if you know the limit of how long you want your response to be, yo ``` -This forces the model to end its response within 200 tokens. If you only want short responses, this both improves your correctness, and prevents the model from droning on by producing an unwanted long response. +This forces the model to end its response within 200 tokens. If you only want short responses, this both improves your correctness, and prevents the model from droning on by producing an unwanted, lengthy response. -## Avoid waterfalls / roundtrips +## Strategy #3: Avoiding Waterfalls + Roundtrips Just like with UI engineering, waterfalls can hurt your performance. Sometimes they're unavoidable, but be mindful when introducing them. @@ -125,17 +127,17 @@ function CharacterGenerator() { } ``` -If you can get the model to do what you want in a single shot, that'll be more performant. However, asking the model to do more at once decreases reliability. It's more robust to split your workload into simple, discrete tasks for the model. So, there are tradeoffs to balance here. You want the task size to be as complicated as the model can reliably do in a single pass, but no more complicated. +If you can get the model to do what you want in a single shot, that'll be more performant. However, asking the model to do more at once decreases reliability. It's more robust to split your workload into simple, discrete tasks for the model. So, there are trade-offs to balance here. You want the task size to be as complicated as the model can reliably do in a single pass, but no more complicated. :::note What are roundtrips? A roundtrip is when your client needs to make a connection to your backend. Depending on the quality of the user's network connection, this can have a big negative impact on performance. As a result, many performance strategies involve minimizing roundtrips. Any clientside app can have roundtrips (calling out to APIs, etc). With AI apps, we add a new type of roundtrip: calling out to a model provider (e.g. OpenAI). -The amount of roundtrips in your logic depends on how you structure your AI.JSX program. A program with many sequential calls out to a model will have more roundtrips than one that does a single shot. (Of course, unlike traditional API calls, model calls are so slow that the client/server latency is a less important contributor to the overall performance profile.) +The number of roundtrips in your logic depends on how you structure your AI.JSX program. A program with many sequential calls out to a model will have more roundtrips than one that does a single shot. (Of course, unlike traditional API calls, model calls are so slow that the client/server latency is a less important contributor to the overall performance profile.) ::: -## Defer Execution +## Strategy #4: Deferring Execution AI.JSX's engine will maximally defer execution, giving you optimal parallelism. To take full advantage of this, avoid manual `render`s within a component: @@ -147,14 +149,18 @@ function StoryGenerator(props, { render }) { return ( - Write a story about a hero named {heroName} and a villain named named {villainName}.{heroName} should be nice. + Write a story about a hero named {heroName} and a villain named named {villainName}. {heroName} should be nice. ); } ``` -If you run this example, you'll see that you have two different `heroName`s generated, because each time you include the component in the JSX, it's a new instance. To get around this, you might reach for `render` to burn the results into a string: +If you run this example, you'll see that you have two different `heroName`s generated. This isn't what we want but happens because each time you include `{heroName}` in the JSX, a new instance is created. To get around this, you might be tempted to reach for `render` to get the name of our hero and burn the results into a string: + +:::caution +This code illustrates an anti-pattern. Do not do this. +::: ```tsx function StoryGenerator(props, { render }) { @@ -166,7 +172,7 @@ function StoryGenerator(props, { render }) { return ( - Write a story about a hero named {heroName} and a villain named named {villainName}.{heroName} should be nice. + Write a story about a hero named {heroName} and a villain named named {villainName}. {heroName} should be nice. ); @@ -175,7 +181,11 @@ function StoryGenerator(props, { render }) { This hurts parallelism, because the `StoryGenerator` component can't be returned until both those renders complete. (And, if you had the `heroName` behind a conditional, the render might be wasted.) -Instead, use `memo`: +:::tip +Use AI.JSX's built-in support for memoization instead. +::: + +Instead, use [`memo`](/api/modules/core_memoize#memo): ```tsx import { memo } from 'ai-jsx/core/memoize'; @@ -188,7 +198,7 @@ function StoryGenerator(props, { render }) { return ( - Write a story about a hero named {heroName} and a villain named named {villainName}.{heroName} should be nice. + Write a story about a hero named {heroName} and a villain named named {villainName}. {heroName} should be nice. ); @@ -197,10 +207,10 @@ function StoryGenerator(props, { render }) { With this approach, you're still deferring execution optimally, but you also ensure that each instance of `heroName` will resolve to the same generated result. -## Use a faster model +## Strategy #5: Using a Faster Model -Different models have different performance profiles. GPT-4 is slower than GPT-3.5-Turbo, for instance. But, unfortunately, the slower models tend to be more correct. So you'll have to find the tradeoff that works for you. +Different models have different performance profiles. GPT-4 is slower than GPT-3.5-Turbo, for instance. Unfortunately, the slower models tend to be more correct. So you'll have to find the trade-off that works for your app. OpenAI's recommendation is to start with the most powerful model, get your app working, then move to faster models if it's possible to do so without sacrificing correctness. -You may also want to consider different model providers; OpenAI vs. Anthropic may have different performance and uptime profiles. AI.JSX makes it easy to switch model providers for a part or whole of your app via [context](./rules-of-jsx.md#context). +You may also want to consider different model providers; OpenAI vs. Anthropic may have different performance and uptime profiles. AI.JSX makes it easy to switch model providers for any individual part (or the whole thing) of your app via [context](./rules-of-jsx.md#context).