Update text splitter docs (#5424)

langchain-ai · May 16, 2024 · 06045b9 · 06045b9
1 parent 141efd5
commit 06045b9
Show file tree

Hide file tree

Showing 6 changed files with 295 additions and 168 deletions.
diff --git a/docs/core_docs/.gitignore b/docs/core_docs/.gitignore
@@ -59,6 +59,8 @@ docs/how_to/structured_output.md
 docs/how_to/structured_output.mdx
 docs/how_to/streaming.md
 docs/how_to/streaming.mdx
+docs/how_to/split_by_token.md
+docs/how_to/split_by_token.mdx
 docs/how_to/sequence.md
 docs/how_to/sequence.mdx
 docs/how_to/recursive_text_splitter.md

diff --git a/docs/core_docs/docs/how_to/character_text_splitter.ipynb b/docs/core_docs/docs/how_to/character_text_splitter.ipynb
@@ -7,19 +7,27 @@
    "source": [
     "# How to split by character\n",
     "\n",
-    "This is the simplest method. This splits based on a given character sequence, which defaults to `\"\\n\\n\"`. Chunk length is measured by number of characters.\n",
+    ":::info Prerequisites\n",
+    "\n",
+    "This guide assumes familiarity with the following concepts:\n",
+    "\n",
+    "- [Text splitters](/docs/concepts#text-splitters)\n",
+    "\n",
+    ":::\n",
+    "\n",
+    "This is the simplest method for splitting text. This splits based on a given character sequence, which defaults to `\"\\n\\n\"`. Chunk length is measured by number of characters.\n",
     "\n",
     "1. How the text is split: by single character separator.\n",
     "2. How the chunk size is measured: by number of characters.\n",
     "\n",
-    "To obtain the string content directly, use `.split_text`.\n",
+    "To obtain the string content directly, use `.splitText()`.\n",
     "\n",
-    "To create LangChain [Document](https://api.js.langchain.com/classes/langchain_core_documents.Document.html) objects (e.g., for use in downstream tasks), use `.createDocuments`."
+    "To create LangChain [Document](https://api.js.langchain.com/classes/langchain_core_documents.Document.html) objects (e.g., for use in downstream tasks), use `.createDocuments()`."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 2,
+   "execution_count": 1,
    "id": "313fb032",
    "metadata": {},
    "outputs": [
@@ -35,16 +43,18 @@
     }
    ],
    "source": [
-    "import { CharacterTextSplitter } from \"@langchain/textsplitters\"\n",
+    "import { CharacterTextSplitter } from \"@langchain/textsplitters\";\n",
+    "import * as fs from \"node:fs\";\n",
     "\n",
     "// Load an example document\n",
-    "const stateOfTheUnion = await Deno.readTextFile(\"../../../../examples/state_of_the_union.txt\");\n",
+    "const rawData = await fs.readFileSync(\"../../../../examples/state_of_the_union.txt\");\n",
+    "const stateOfTheUnion = rawData.toString();\n",
     "\n",
     "const textSplitter = new CharacterTextSplitter({\n",
     "    separator: \"\\n\\n\",\n",
     "    chunkSize: 1000,\n",
     "    chunkOverlap: 200,\n",
-    "})\n",
+    "});\n",
     "const texts = await textSplitter.createDocuments([stateOfTheUnion]);\n",
     "console.log(texts[0])"
    ]
@@ -54,12 +64,12 @@
    "id": "dadcb9d6",
    "metadata": {},
    "source": [
-    "Use `.createDocuments` to propagate metadata associated with each document to the output chunks:"
+    "You can also propagate metadata associated with each document to the output chunks:"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": 2,
    "id": "1affda60",
    "metadata": {},
    "outputs": [
@@ -75,10 +85,12 @@
     }
    ],
    "source": [
-    "const metadatas = [{ document: 1 }, { document: 2 }]\n",
+    "const metadatas = [{ document: 1 }, { document: 2 }];\n",
+    "\n",
     "const documents = await textSplitter.createDocuments(\n",
     "    [stateOfTheUnion, stateOfTheUnion], metadatas\n",
     ")\n",
+    "\n",
     "console.log(documents[0])"
    ]
   },
@@ -87,12 +99,12 @@
    "id": "ee080e12-6f44-4311-b1ef-302520a41d66",
    "metadata": {},
    "source": [
-    "Use `.splitText` to obtain the string content directly:"
+    "To obtain the string content directly, use `.splitText()`:"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 6,
+   "execution_count": 3,
    "id": "2a830a9f",
    "metadata": {},
    "outputs": [
@@ -102,13 +114,27 @@
        "\u001b[32m\"Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and th\"\u001b[39m... 839 more characters"
       ]
      },
-     "execution_count": 6,
+     "execution_count": 3,
      "metadata": {},
      "output_type": "execute_result"
     }
    ],
    "source": [
-    "(await textSplitter.splitText(stateOfTheUnion))[0]"
+    "const chunks = await textSplitter.splitText(stateOfTheUnion);\n",
+    "\n",
+    "chunks[0];"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cd4dd67a",
+   "metadata": {},
+   "source": [
+    "## Next steps\n",
+    "\n",
+    "You've now learned a method for splitting text by character.\n",
+    "\n",
+    "Next, check out a [more advanced way of splitting by character](/docs/how_to/recursive_text_splitter), or the [full tutorial on retrieval-augmented generation](/docs/tutorials/rag)."
    ]
   }
  ],