Skip to content

Commit

Permalink
docs[minor]: Unstructured MD loader doc (#5489)
Browse files Browse the repository at this point in the history
* docs[minor]: Unstructured MD loader doc

* chore: lint files

* html loader

* update firecrawl loader

* fix firecrawl loader

* cr

* Add links to docs from how to index

* cr

* chore: lint files
  • Loading branch information
bracesproul authored May 21, 2024
1 parent 62275b1 commit a4e3bc4
Show file tree
Hide file tree
Showing 5 changed files with 481 additions and 4 deletions.
176 changes: 176 additions & 0 deletions docs/core_docs/docs/how_to/document_loader_html.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "0c6c50fc-15e1-4767-925a-53a37c430b9b",
"metadata": {},
"source": [
"# How to load HTML\n",
"\n",
"The HyperText Markup Language or [HTML](https://en.wikipedia.org/wiki/HTML) is the standard markup language for documents designed to be displayed in a web browser.\n",
"\n",
"This covers how to load `HTML` documents into a LangChain [Document](https://v02.api.js.langchain.com/classes/langchain_core_documents.Document.html) objects that we can use downstream.\n",
"\n",
"Parsing HTML files often requires specialized tools. Here we demonstrate parsing via [Unstructured](https://unstructured-io.github.io/unstructured/). Head over to the integrations page to find integrations with additional services, such as [FireCrawl](/docs/integrations/document_loaders/web_loaders/firecrawl).\n",
"\n",
":::info Prerequisites\n",
"\n",
"This guide assumes familiarity with the following concepts:\n",
"\n",
"- [Documents](/docs/concepts#document)\n",
"- [Document Loaders](/docs/concepts#document-loaders)\n",
"\n",
":::\n",
"\n",
"## Installation\n",
"\n",
"```{=mdx}\n",
"import Npm2Yarn from \"@theme/Npm2Yarn\"\n",
"\n",
"<Npm2Yarn>\n",
" @langchain/community\n",
"</Npm2Yarn>\n",
"```"
]
},
{
"cell_type": "markdown",
"id": "868cfb85",
"metadata": {},
"source": [
"## Setup\n",
"\n",
"Although Unstructured has an open source offering, you're still required to provide an API key to access the service. To get everything up and running, follow these two steps:\n",
"\n",
"1. Download & start the Docker container:\n",
" \n",
"```bash\n",
"docker run -p 8000:8000 -d --rm --name unstructured-api quay.io/unstructured-io/unstructured-api:latest --port 8000 --host 0.0.0.0\n",
"```\n",
"\n",
"2. Get a free API key & API URL [here](https://unstructured.io/api-key), and set it in your environment (as per the Unstructured website, it may take up to an hour to allocate your API key & URL.):\n",
"\n",
"```bash\n",
"export UNSTRUCTURED_API_KEY=\"...\"\n",
"# Replace with your `Full URL` from the email\n",
"export UNSTRUCTURED_API_URL=\"https://<ORG_NAME>-<SECRET>.api.unstructuredapp.io/general/v0/general\" \n",
"```"
]
},
{
"cell_type": "markdown",
"id": "a4d93b2e",
"metadata": {},
"source": [
"## Loading HTML with Unstructured"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "7d167ca3-c7c7-4ef0-b509-080629f0f482",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[\n",
" Document {\n",
" pageContent: 'Word of the Day',\n",
" metadata: {\n",
" category_depth: 0,\n",
" languages: [Array],\n",
" filename: 'wordoftheday.html',\n",
" filetype: 'text/html',\n",
" category: 'Title'\n",
" }\n",
" },\n",
" Document {\n",
" pageContent: ': April 10, 2023',\n",
" metadata: {\n",
" emphasized_text_contents: [Array],\n",
" emphasized_text_tags: [Array],\n",
" languages: [Array],\n",
" parent_id: 'b845e60d85ff7d10abda4e5f9a37eec8',\n",
" filename: 'wordoftheday.html',\n",
" filetype: 'text/html',\n",
" category: 'UncategorizedText'\n",
" }\n",
" },\n",
" Document {\n",
" pageContent: 'foible',\n",
" metadata: {\n",
" category_depth: 1,\n",
" languages: [Array],\n",
" parent_id: 'b845e60d85ff7d10abda4e5f9a37eec8',\n",
" filename: 'wordoftheday.html',\n",
" filetype: 'text/html',\n",
" category: 'Title'\n",
" }\n",
" },\n",
" Document {\n",
" pageContent: 'play',\n",
" metadata: {\n",
" category_depth: 0,\n",
" link_texts: [Array],\n",
" link_urls: [Array],\n",
" link_start_indexes: [Array],\n",
" languages: [Array],\n",
" filename: 'wordoftheday.html',\n",
" filetype: 'text/html',\n",
" category: 'Title'\n",
" }\n",
" },\n",
" Document {\n",
" pageContent: 'noun',\n",
" metadata: {\n",
" category_depth: 0,\n",
" emphasized_text_contents: [Array],\n",
" emphasized_text_tags: [Array],\n",
" languages: [Array],\n",
" filename: 'wordoftheday.html',\n",
" filetype: 'text/html',\n",
" category: 'Title'\n",
" }\n",
" }\n",
"]\n"
]
}
],
"source": [
"import { UnstructuredLoader } from \"@langchain/community/document_loaders/fs/unstructured\";\n",
"\n",
"const filePath = \"../../../../libs/langchain-community/src/tools/fixtures/wordoftheday.html\"\n",
"\n",
"const loader = new UnstructuredLoader(filePath, {\n",
" apiKey: process.env.UNSTRUCTURED_API_KEY,\n",
" apiUrl: process.env.UNSTRUCTURED_API_URL,\n",
"});\n",
"\n",
"const data = await loader.load()\n",
"console.log(data.slice(0, 5));"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "TypeScript",
"language": "typescript",
"name": "tslab"
},
"language_info": {
"codemirror_mode": {
"mode": "typescript",
"name": "javascript",
"typescript": true
},
"file_extension": ".ts",
"mimetype": "text/typescript",
"name": "typescript",
"version": "3.7.2"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading

0 comments on commit a4e3bc4

Please sign in to comment.