Skip to content

Commit

Permalink
Update text splitter docs (#5424)
Browse files Browse the repository at this point in the history
  • Loading branch information
jacoblee93 authored May 16, 2024
1 parent 141efd5 commit 06045b9
Show file tree
Hide file tree
Showing 6 changed files with 295 additions and 168 deletions.
2 changes: 2 additions & 0 deletions docs/core_docs/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,8 @@ docs/how_to/structured_output.md
docs/how_to/structured_output.mdx
docs/how_to/streaming.md
docs/how_to/streaming.mdx
docs/how_to/split_by_token.md
docs/how_to/split_by_token.mdx
docs/how_to/sequence.md
docs/how_to/sequence.mdx
docs/how_to/recursive_text_splitter.md
Expand Down
54 changes: 40 additions & 14 deletions docs/core_docs/docs/how_to/character_text_splitter.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -7,19 +7,27 @@
"source": [
"# How to split by character\n",
"\n",
"This is the simplest method. This splits based on a given character sequence, which defaults to `\"\\n\\n\"`. Chunk length is measured by number of characters.\n",
":::info Prerequisites\n",
"\n",
"This guide assumes familiarity with the following concepts:\n",
"\n",
"- [Text splitters](/docs/concepts#text-splitters)\n",
"\n",
":::\n",
"\n",
"This is the simplest method for splitting text. This splits based on a given character sequence, which defaults to `\"\\n\\n\"`. Chunk length is measured by number of characters.\n",
"\n",
"1. How the text is split: by single character separator.\n",
"2. How the chunk size is measured: by number of characters.\n",
"\n",
"To obtain the string content directly, use `.split_text`.\n",
"To obtain the string content directly, use `.splitText()`.\n",
"\n",
"To create LangChain [Document](https://api.js.langchain.com/classes/langchain_core_documents.Document.html) objects (e.g., for use in downstream tasks), use `.createDocuments`."
"To create LangChain [Document](https://api.js.langchain.com/classes/langchain_core_documents.Document.html) objects (e.g., for use in downstream tasks), use `.createDocuments()`."
]
},
{
"cell_type": "code",
"execution_count": 2,
"execution_count": 1,
"id": "313fb032",
"metadata": {},
"outputs": [
Expand All @@ -35,16 +43,18 @@
}
],
"source": [
"import { CharacterTextSplitter } from \"@langchain/textsplitters\"\n",
"import { CharacterTextSplitter } from \"@langchain/textsplitters\";\n",
"import * as fs from \"node:fs\";\n",
"\n",
"// Load an example document\n",
"const stateOfTheUnion = await Deno.readTextFile(\"../../../../examples/state_of_the_union.txt\");\n",
"const rawData = await fs.readFileSync(\"../../../../examples/state_of_the_union.txt\");\n",
"const stateOfTheUnion = rawData.toString();\n",
"\n",
"const textSplitter = new CharacterTextSplitter({\n",
" separator: \"\\n\\n\",\n",
" chunkSize: 1000,\n",
" chunkOverlap: 200,\n",
"})\n",
"});\n",
"const texts = await textSplitter.createDocuments([stateOfTheUnion]);\n",
"console.log(texts[0])"
]
Expand All @@ -54,12 +64,12 @@
"id": "dadcb9d6",
"metadata": {},
"source": [
"Use `.createDocuments` to propagate metadata associated with each document to the output chunks:"
"You can also propagate metadata associated with each document to the output chunks:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": 2,
"id": "1affda60",
"metadata": {},
"outputs": [
Expand All @@ -75,10 +85,12 @@
}
],
"source": [
"const metadatas = [{ document: 1 }, { document: 2 }]\n",
"const metadatas = [{ document: 1 }, { document: 2 }];\n",
"\n",
"const documents = await textSplitter.createDocuments(\n",
" [stateOfTheUnion, stateOfTheUnion], metadatas\n",
")\n",
"\n",
"console.log(documents[0])"
]
},
Expand All @@ -87,12 +99,12 @@
"id": "ee080e12-6f44-4311-b1ef-302520a41d66",
"metadata": {},
"source": [
"Use `.splitText` to obtain the string content directly:"
"To obtain the string content directly, use `.splitText()`:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"execution_count": 3,
"id": "2a830a9f",
"metadata": {},
"outputs": [
Expand All @@ -102,13 +114,27 @@
"\u001b[32m\"Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and th\"\u001b[39m... 839 more characters"
]
},
"execution_count": 6,
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"(await textSplitter.splitText(stateOfTheUnion))[0]"
"const chunks = await textSplitter.splitText(stateOfTheUnion);\n",
"\n",
"chunks[0];"
]
},
{
"cell_type": "markdown",
"id": "cd4dd67a",
"metadata": {},
"source": [
"## Next steps\n",
"\n",
"You've now learned a method for splitting text by character.\n",
"\n",
"Next, check out a [more advanced way of splitting by character](/docs/how_to/recursive_text_splitter), or the [full tutorial on retrieval-augmented generation](/docs/tutorials/rag)."
]
}
],
Expand Down
Loading

0 comments on commit 06045b9

Please sign in to comment.