Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add embaas document extraction api endpoints #6048

Merged
merged 7 commits into from
Jun 13, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
167 changes: 167 additions & 0 deletions docs/modules/indexes/document_loaders/examples/embaas.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,167 @@
{
"cells": [
{
"cell_type": "markdown",
"source": [
"# Embaas\n",
"[embaas](https://embaas.io) is a fully managed NLP API service that offers features like embedding generation, document text extraction, document to embeddings and more. You can choose a [variety of pre-trained models](https://embaas.io/docs/models/embeddings).\n",
"\n",
"### Prerequisites\n",
"Create a free embaas account at [https://embaas.io/register](https://embaas.io/register) and generate an [API key](https://embaas.io/dashboard/api-keys)\n",
"\n",
"### Document Text Extraction API\n",
"The document text extraction API allows you to extract the text from a given document. The API supports a variety of document formats, including PDF, mp3, mp4 and more. For a full list of supported formats, check out the API docs (link below)."
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"# Set API key\n",
"embaas_api_key = \"YOUR_API_KEY\"\n",
"# or set environment variable\n",
"os.environ[\"EMBAAS_API_KEY\"] = \"YOUR_API_KEY\""
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "markdown",
"source": [
"#### Using a blob (bytes)"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"from langchain.document_loaders.embaas import EmbaasBlobLoader\n",
"from langchain.document_loaders.blob_loaders import Blob"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"blob_loader = EmbaasBlobLoader()\n",
"blob = Blob.from_path(\"example.pdf\")\n",
"documents = blob_loader.load(blob)"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"# You can also directly create embeddings with your preferred embeddings model\n",
"blob_loader = EmbaasBlobLoader(params={\"model\": \"e5-large-v2\", \"should_embed\": True})\n",
"blob = Blob.from_path(\"example.pdf\")\n",
"documents = blob_loader.load(blob)\n",
"\n",
"print(documents[0][\"metadata\"][\"embedding\"])"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"start_time": "2023-06-12T22:19:48.366886Z",
"end_time": "2023-06-12T22:19:48.380467Z"
}
}
},
{
"cell_type": "markdown",
"source": [
"#### Using a file"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"from langchain.document_loaders.embaas import EmbaasLoader"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"file_loader = EmbaasLoader(file_path=\"example.pdf\")\n",
"documents = file_loader.load()"
],
"metadata": {
"collapsed": false
}
},
{
"cell_type": "code",
"execution_count": 15,
"outputs": [],
"source": [
"# Disable automatic text splitting\n",
"file_loader = EmbaasLoader(file_path=\"example.mp3\", params={\"should_chunk\": False})\n",
"documents = file_loader.load()"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"start_time": "2023-06-12T22:24:31.880857Z",
"end_time": "2023-06-12T22:24:31.894665Z"
}
}
},
{
"cell_type": "markdown",
"source": [
"For more detailed information about the embaas document text extraction API, please refer to [the official embaas API documentation](https://embaas.io/api-reference)."
],
"metadata": {
"collapsed": false
}
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.6"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
93 changes: 39 additions & 54 deletions docs/modules/models/text_embedding/examples/embaas.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -2,142 +2,127 @@
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Embaas\n",
"\n",
"[embaas](https://embaas.io) is a fully managed NLP API service that offers features like embedding generation, document text extraction, document to embeddings and more. You can choose a [variety of pre-trained models](https://embaas.io/docs/models/embeddings).\n",
"\n",
"In this tutorial, we will show you how to use the embaas Embeddings API to generate embeddings for a given text.\n",
"\n",
"### Prerequisites\n",
"Create your free embaas account at [https://embaas.io/register](https://embaas.io/register) and generate an [API key](https://embaas.io/dashboard/api-keys)."
],
"metadata": {
"collapsed": false
}
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Set API key\n",
"embaas_api_key = \"YOUR_API_KEY\"\n",
"# or set environment variable\n",
"os.environ[\"EMBAAS_API_KEY\"] = \"YOUR_API_KEY\""
],
"metadata": {
"collapsed": false
}
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from langchain.embeddings import EmbaasEmbeddings"
],
"metadata": {
"collapsed": false
}
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"embeddings = EmbaasEmbeddings()"
],
"metadata": {
"collapsed": false
}
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"ExecuteTime": {
"end_time": "2023-06-10T11:17:55.940265Z",
"start_time": "2023-06-10T11:17:55.938517Z"
}
},
"outputs": [],
"source": [
"# Create embeddings for a single document\n",
"doc_text = \"This is a test document.\"\n",
"doc_text_embedding = embeddings.embed_query(doc_text)"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"start_time": "2023-06-10T11:17:55.938517Z",
"end_time": "2023-06-10T11:17:55.940265Z"
}
}
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Print created embedding\n",
"print(doc_text_embedding)"
],
"metadata": {
"collapsed": false
}
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"ExecuteTime": {
"end_time": "2023-06-10T11:19:25.237161Z",
"start_time": "2023-06-10T11:19:25.235320Z"
}
},
"outputs": [],
"source": [
"# Create embeddings for multiple documents\n",
"doc_texts = [\"This is a test document.\", \"This is another test document.\"]\n",
"doc_texts_embeddings = embeddings.embed_documents(doc_texts)"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"start_time": "2023-06-10T11:19:25.235320Z",
"end_time": "2023-06-10T11:19:25.237161Z"
}
}
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Print created embeddings\n",
"for i, doc_text_embedding in enumerate(doc_texts_embeddings):\n",
" print(f\"Embedding for document {i + 1}: {doc_text_embedding}\")"
],
"metadata": {
"collapsed": false
}
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"ExecuteTime": {
"end_time": "2023-06-10T11:22:26.139769Z",
"start_time": "2023-06-10T11:22:26.138357Z"
}
},
"outputs": [],
"source": [
"# Using a different model and/or custom instruction\n",
"embeddings = EmbaasEmbeddings(model=\"instructor-large\", instruction=\"Represent the Wikipedia document for retrieval\")"
],
"metadata": {
"collapsed": false,
"ExecuteTime": {
"start_time": "2023-06-10T11:22:26.138357Z",
"end_time": "2023-06-10T11:22:26.139769Z"
}
}
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For more detailed information about the embaas Embeddings API, please refer to [the official embaas API documentation](https://embaas.io/api-reference)."
],
"metadata": {
"collapsed": false
}
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
Expand All @@ -155,5 +140,5 @@
}
},
"nbformat": 4,
"nbformat_minor": 0
"nbformat_minor": 1
}
3 changes: 3 additions & 0 deletions langchain/document_loaders/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
OutlookMessageLoader,
UnstructuredEmailLoader,
)
from langchain.document_loaders.embaas import EmbaasBlobLoader, EmbaasLoader
from langchain.document_loaders.epub import UnstructuredEPubLoader
from langchain.document_loaders.evernote import EverNoteLoader
from langchain.document_loaders.excel import UnstructuredExcelLoader
Expand Down Expand Up @@ -250,4 +251,6 @@
"WikipediaLoader",
"YoutubeLoader",
"SnowflakeLoader",
"EmbaasLoader",
"EmbaasBlobLoader",
]
Loading