feat: Add custom embedder #2236

vonodiripsa · 2024-06-12T18:52:47Z

Related Issues/PRs

#xxx

What changes are proposed in this pull request?

Briefly describe the changes included in this Pull Request.

How is this patch tested?

I have written tests (not required for typo or doc fix) and confirmed the proposed feature/bug-fix/change works.

Does this PR change any dependencies?

No. You can skip this section.
Yes. Make sure the dependencies are resolved correctly, and list changes here.

Does this PR add a new feature? If so, have you added samples on website?

No. You can skip this section.
Yes. Make sure you have added samples following below steps.

Find the corresponding markdown file for your new feature in website/docs/documentation folder.
Make sure you choose the correct class estimators/transformers and namespace.
Follow the pattern in markdown file and add another section for your new API, including pyspark, scala (and .NET potentially) samples.
Make sure the DocTable points to correct API link.
Navigate to website folder, and run yarn run start to make sure the website renders correctly.
Don't forget to add  before each python code blocks to enable auto-tests for python samples.
Make sure the WebsiteSamplesTests job pass in the pipeline.

mhamilton723 · 2024-06-12T18:55:02Z

deep-learning/src/main/python/synapse/ml/dl/sentence_embedding_transformer.py

+        self,
+        inputCol=None,
+        outputCol=None,
+        useTRTFlag=None,


nit: useTRTFlag -> runtime: "cpu", "gpu", "tensorrt", default cpu

mhamilton723 · 2024-06-12T18:55:27Z

deep-learning/src/main/python/synapse/ml/dl/sentence_embedding_transformer.py

+
+    # Define additional parameters
+    useTRT = Param(Params._dummy(), "useTRT", "True if use TRT acceleration")
+    driverOnly = Param(


nit: remove driver Only code

mhamilton723 · 2024-06-12T18:57:55Z

deep-learning/src/main/python/synapse/ml/dl/sentence_embedding_transformer.py

+            inputCol="combined",
+            outputCol="embeddings",


look at other examples of proper defaults for these columns in library

mhamilton723 · 2024-06-12T18:58:53Z

deep-learning/src/main/python/synapse/ml/dl/sentence_embedding_transformer.py

+            for batch_size in [64, 32, 16, 8, 4, 2, 1]:
+                for sentence_length in [20, 300, 512]:
+                    yield (batch_size, sentence_length)


make these magic numbers, parameters with defaults

mhamilton723 · 2024-06-12T18:59:52Z

deep-learning/src/main/python/synapse/ml/dl/sentence_embedding_transformer.py

+            """
+            Create a data loader with synthetic data using Faker.
+            """
+            faker = Faker()


nit: lets try to remove this dependency

mhamilton723 · 2024-06-12T19:00:21Z

deep-learning/src/main/python/synapse/ml/dl/sentence_embedding_transformer.py

+                for sentence_length in [20, 300, 512]:
+                    yield (batch_size, sentence_length)
+
+        def get_dataloader(repeat_times: int = 2):


nit: _get_dataloader

mhamilton723 · 2024-06-12T19:00:41Z

deep-learning/src/main/python/synapse/ml/dl/sentence_embedding_transformer.py

+            func, dataloader=tqdm(get_dataloader(), total=total_batches), config=conf
+        )
+
+    def run_on_driver(self, queries, spark):


mhamilton723 · 2024-06-12T19:01:34Z

deep-learning/src/main/python/synapse/ml/dl/sentence_embedding_transformer.py

+        """
+        return self._defaultCopy(extra)
+
+    def load_data_food_reviews(self, spark, path=None, limit=1000):


move this code into demo

mhamilton723 · 2024-06-12T19:02:11Z

deep-learning/src/main/python/synapse/ml/dl/sentence_embedding_transformer.py

+class SuppressLogging:
+    def __init__(self):
+        self._original_stderr = None
+
+    def start(self):
+        """Start suppressing logging by redirecting sys.stderr to /dev/null."""
+        if self._original_stderr is None:
+            self._original_stderr = sys.stderr
+            sys.stderr = open('/dev/null', 'w')
+
+    def stop(self):
+        """Stop suppressing logging and restore sys.stderr."""
+        if self._original_stderr is not None:
+            sys.stderr.close()
+            sys.stderr = self._original_stderr
+            self._original_stderr = None


mhamilton723 · 2024-06-12T19:04:18Z

deep-learning/src/main/python/synapse/ml/dl/sentence_embedding_transformer.py

+    FloatType,
+)
+
+class EmbeddingTransformer(Transformer, HasInputCol, HasOutputCol):


nit: HuggingFaceSentenceEmbedder

Also name the file HuggingFaceSentenceEmbedder.py

mhamilton723 · 2024-06-12T19:18:12Z

deep-learning/src/main/python/synapse/ml/dl/sentence_embedding_transformer.py

+        modelName="intfloat/e5-large-v2",
+        moduleName="e5-large-v2",


nit: no defaults here, and try to make this module Name thing go away

mhamilton723 · 2024-06-12T19:18:40Z

deep-learning/src/main/python/synapse/ml/dl/sentence_embedding_transformer.py

+        Initialize the EmbeddingTransformer with input/output columns and optional TRT flag.
+        """
+        super(EmbeddingTransformer, self).__init__()
+        self._setDefault(


try it on some other models from : https://sbert.net/docs/sentence_transformer/pretrained_models.html

mhamilton723 · 2024-06-12T19:32:45Z

tools/init_scripts/init_retriever.sh

+/databricks/python/bin/pip install --extra-index-url https://pypi.nvidia.com cudf-cu11~=${RAPIDS_VERSION} cuml-cu11~=${RAPIDS_VERSION} pylibraft-cu11~=${RAPIDS_VERSION} rmm-cu11~=${RAPIDS_VERSION} 
+
+# install model navigator
+/databricks/python/bin/pip install --extra-index-url https://pypi.nvidia.com onnxruntime-gpu==1.16.3 "tensorrt==9.3.0.post12.dev1" "triton-model-navigator<1" "sentence_transformers~=2.2.2" "faker" "urllib3<2" 


nit: remove faker

mhamilton723 · 2024-06-24T16:16:04Z

deep-learning/src/main/python/synapse/ml/HuggingFaceSentenceEmbedder.py

+from sentence_transformers import SentenceTransformer
+from sentence_transformers.util import batch_to_device
+from pyspark.ml.functions import predict_batch_udf
+from faker import Faker


can we remove this dep as previously discussed? you can add a little fake data passage and replicate it if you need

mhamilton723 · 2024-06-24T16:17:14Z

docs/Explore Algorithms/OpenAI/Quickstart - Custom Embeddings and Approximate KNN.ipynb

+    "logging.getLogger(\"sentence_transformers.SentenceTransformer\").setLevel(logging.ERROR)\n",
+    "mlflow.autolog(disable=True)\n",
+    "\n",
+    "# Record the start time\n",
+    "start_time = datetime.now()\n",
+    "\n",
+    "print(f\"Demo started\")"


probabbly dont need this stuff for a minimal demo

mhamilton723 · 2024-06-24T16:17:32Z

docs/Explore Algorithms/OpenAI/Quickstart - Custom Embeddings and Approximate KNN.ipynb

+    "warnings.filterwarnings(\"ignore\", category=UserWarning, module=\"tritonclient.grpc\")\n",
+    "import logging\n",
+    "\n",
+    "logging.getLogger(\"py4j\").setLevel(logging.ERROR)\n",


do you need this line?

mhamilton723 · 2024-06-24T16:18:55Z

docs/Explore Algorithms/OpenAI/Quickstart - Custom Embeddings and Approximate KNN.ipynb

+    "number_of_input_rows = 999\n",
+    "# Shuffle the DataFrame with a fixed seed\n",
+    "seed = 42\n",
+    "\n",
+    "# Check if the row count is less than 10\n",
+    "if number_of_input_rows <= 0 or number_of_input_rows >= 1000000:\n",
+    "    raise ValueError(f\"Limit is {number_of_input_rows}, which should be less than 1M.\")\n",
+    "\n",
+    "if number_of_input_rows > 1000:\n",
+    "\n",
+    "    # Cross-join the DataFrame with itself to create n x n pairs for string concatenation (synthetic data)\n",
+    "    cross_joined_df = df.crossJoin(\n",
+    "        df.withColumnRenamed(\"combined\", \"combined_\")\n",
+    "    )\n",
+    "\n",
+    "    # Create a new column 'result_vector' by concatenating the two source vectors\n",
+    "    tmp_df = cross_joined_df.withColumn(\n",
+    "        \"result_vector\",\n",
+    "        F.concat(F.col(\"combined\"), F.lit(\". \\n\"), F.col(\"combined_\")),\n",
+    "    )\n",
+    "\n",
+    "    # Select only the necessary columns and show the result\n",
+    "    tmp_df = tmp_df.select(\"result_vector\")\n",
+    "    df = tmp_df.withColumnRenamed(\"result_vector\", \"combined\").withColumn(\n",
+    "        \"id\", monotonically_increasing_id()\n",
+    "    )\n",
+    "\n",
+    "df = df.limit(number_of_input_rows).orderBy(rand(seed)).repartition(10).cache()\n",
+    "\n",
+    "print(f\"Loaded: {number_of_input_rows} rows\")"


we probabbly can remove the cross join stuff for the demo, i would rather use a large dataset and subset than a small dataset and augment

mhamilton723 · 2024-06-24T16:19:29Z

docs/Explore Algorithms/OpenAI/Quickstart - Custom Embeddings and Approximate KNN.ipynb

+   "outputs": [],
+   "source": [
+    "# dataTransformer = HuggingFaceSentenceEmbedder(modelName=\"intfloat/e5-large-v2\", inputCol=\"combined\", outputCol=\"embeddings\", runtime=\"tensorrt\")\n",
+    "dataTransformer = HuggingFaceSentenceEmbedder(modelName=\"sentence-transformers/all-MiniLM-L6-v2\", inputCol=\"combined\", outputCol=\"embeddings\", runtime=\"tensorrt\")\n",


nit: dataTransformer -> embedder

mhamilton723 · 2024-06-24T16:19:46Z

docs/Explore Algorithms/OpenAI/Quickstart - Custom Embeddings and Approximate KNN.ipynb

+    "# dataTransformer = HuggingFaceSentenceEmbedder(modelName=\"intfloat/e5-large-v2\", inputCol=\"combined\", outputCol=\"embeddings\", runtime=\"tensorrt\")\n",
+    "dataTransformer = HuggingFaceSentenceEmbedder(modelName=\"sentence-transformers/all-MiniLM-L6-v2\", inputCol=\"combined\", outputCol=\"embeddings\", runtime=\"tensorrt\")\n",
+    "\n",
+    "all_embeddings = dataTransformer.transform(df).cache()"


nit: all_embeddings -> embeddings

mhamilton723 · 2024-06-24T16:21:28Z

docs/Explore Algorithms/OpenAI/Quickstart - Custom Embeddings and Approximate KNN.ipynb

+    "queries = [\"desserts\", \"disgusting\"]\n",
+    "ids = [1, 2]\n",
+    "\n",
+    "# Combine the data into a list of tuples\n",
+    "data = list(zip(ids, queries))\n",
+    "\n",
+    "# Define the schema for the DataFrame\n",
+    "schema = StructType([\n",
+    "    StructField(\"id\", IntegerType(), nullable=False),\n",
+    "    StructField(\"query\", StringType(), nullable=False)\n",
+    "])\n",
+    "\n",
+    "# Create the DataFrame\n",
+    "qDf = spark.createDataFrame(data, schema)\n",


you can probabbly make this smaller by sayinf

test_data = spark.createDataFrame([("desserts", 1), ("disgusting", 2)], ["query", "id"])

mhamilton723 · 2024-06-24T16:21:46Z

docs/Explore Algorithms/OpenAI/Quickstart - Custom Embeddings and Approximate KNN.ipynb

+    "qDf = spark.createDataFrame(data, schema)\n",
+    "\n",
+    "# queryTransformer = HuggingFaceSentenceEmbedder(modelName=\"intfloat/e5-large-v2\", inputCol=\"query\", outputCol=\"embeddings\", runtime=\"cpu\")\n",
+    "queryTransformer = HuggingFaceSentenceEmbedder(modelName=\"sentence-transformers/all-MiniLM-L6-v2\", inputCol=\"query\", outputCol=\"embeddings\", runtime=\"cpu\")\n",


nit: use the embedder you already made above

mhamilton723 · 2024-06-24T16:22:12Z

docs/Explore Algorithms/OpenAI/Quickstart - Custom Embeddings and Approximate KNN.ipynb

+   "outputs": [],
+   "source": [
+    "# dataTransformer = HuggingFaceSentenceEmbedder(modelName=\"intfloat/e5-large-v2\", inputCol=\"combined\", outputCol=\"embeddings\", runtime=\"tensorrt\")\n",
+    "dataTransformer = HuggingFaceSentenceEmbedder(modelName=\"sentence-transformers/all-MiniLM-L6-v2\", inputCol=\"combined\", outputCol=\"embeddings\", runtime=\"tensorrt\")\n",


nit: make model name a param, and then pass model name, give people a few to try

mhamilton723 · 2024-06-24T16:22:51Z

docs/Explore Algorithms/OpenAI/Quickstart - Custom Embeddings and Approximate KNN.ipynb

+    "rapids_knn = ApproximateNearestNeighbors(k=5)\n",
+    "rapids_knn.setInputCol(\"embeddings\").setIdCol(\"id\")\n",
+    "\n",
+    "rapids_knn_model = rapids_knn.fit(all_embeddings.select(\"id\", \"embeddings\"))"


nit: you can make this a single statement with parentheses and dot chaining

mhamilton723 · 2024-06-24T16:23:12Z

docs/Explore Algorithms/OpenAI/Quickstart - Custom Embeddings and Approximate KNN.ipynb

+   "source": [
+    "## Step 6: Find top k Nearest Neighbors\n",
+    "\n",
+    "We will use fast ANN IVFFlat algorithm from Rapids"


lets link to the page explaining this algo

mhamilton723 · 2024-06-24T16:23:37Z

docs/Explore Algorithms/OpenAI/Quickstart - Custom Embeddings and Approximate KNN.ipynb

+    "print(f\"Demo finished\")\n",
+    "\n",
+    "# Record the end time\n",
+    "end_time = datetime.now()\n",
+    "\n",
+    "# Calculate the duration\n",
+    "duration = end_time - start_time\n",
+    "\n",
+    "# Optionally, display the duration in seconds\n",
+    "duration_in_seconds = duration.total_seconds()\n",
+    "print(f\"Application duration: {duration_in_seconds:.2f} seconds\")"


dont worry about timing for the demo, instead add a markdown cell with your timing results maybe up top or down here. If you want the takeaway to be that this is ultra fast, heres where you can show people the results

mhamilton723 · 2024-07-03T22:23:34Z

/azp run

azure-pipelines · 2024-07-03T22:23:45Z

Azure Pipelines successfully started running 1 pipeline(s).

codecov-commenter · 2024-07-03T22:39:30Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 85.36%. Comparing base (440f18e) to head (7280ea7).
Report is 4 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #2236      +/-   ##
==========================================
+ Coverage   84.43%   85.36%   +0.92%     
==========================================
  Files         327      327              
  Lines       16715    16742      +27     
  Branches     1495     1509      +14     
==========================================
+ Hits        14114    14291     +177     
+ Misses       2601     2451     -150

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Feat: Add custom embedder

0d76e67

vonodiripsa requested a review from mhamilton723 as a code owner June 12, 2024 18:52

mhamilton723 reviewed Jun 12, 2024

View reviewed changes

bvonodiripsa and others added 15 commits June 13, 2024 23:02

Corrected Names and file location

9afa431

Code style corrections

ee8f6f9

Source temp fixes

09af0e0

Formating

a485010

First test

eae2aee

Name changes

2cdfb59

With two models

88d34b8

Source style corrections

052f39b

Name change

ac7bc67

Name change

52a1581

Merge init scripts

1bd45ab

Removed extra file

cce365a

Added result output and _ correction

8d2ed06

Formatted

6b7798d

Merge branch 'microsoft:master' into add-demo

a8b4dc9

bvonodiripsa changed the title ~~Feat: Add custom embedder~~ feat: Add custom embedder Jun 18, 2024

bvonodiripsa added 2 commits June 18, 2024 18:56

Runtime flag update and load class from file (not from synapse.ml..)

dff46fe

Use built synapse.ml package instead of file

5f569ce

mhamilton723 reviewed Jun 24, 2024

View reviewed changes

bvonodiripsa and others added 5 commits June 27, 2024 23:57

Clean imports and the class

3098645

Corrected edge cases (slam dataframe or no gpu)

e2e2014

Added check for cuda

710d9a6

Added synapse.ml.nn.KNN to run on CPU

42d8a07

add some small fixes to namespaces

7280ea7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add custom embedder #2236

feat: Add custom embedder #2236

vonodiripsa commented Jun 12, 2024

mhamilton723 Jun 12, 2024

mhamilton723 Jun 12, 2024

mhamilton723 Jun 12, 2024

mhamilton723 Jun 12, 2024

mhamilton723 Jun 12, 2024

mhamilton723 Jun 12, 2024

mhamilton723 Jun 12, 2024

mhamilton723 Jun 12, 2024

mhamilton723 Jun 12, 2024

mhamilton723 Jun 12, 2024

mhamilton723 Jun 12, 2024

mhamilton723 Jun 12, 2024

mhamilton723 Jun 12, 2024

mhamilton723 Jun 24, 2024

mhamilton723 Jun 24, 2024

mhamilton723 Jun 24, 2024

mhamilton723 Jun 24, 2024

mhamilton723 Jun 24, 2024

mhamilton723 Jun 24, 2024

mhamilton723 Jun 24, 2024

mhamilton723 Jun 24, 2024

mhamilton723 Jun 24, 2024

mhamilton723 Jun 24, 2024

mhamilton723 Jun 24, 2024

mhamilton723 Jun 24, 2024 •

edited

Loading

mhamilton723 commented Jul 3, 2024

azure-pipelines bot commented Jul 3, 2024

codecov-commenter commented Jul 3, 2024 •

edited

Loading

feat: Add custom embedder #2236

Are you sure you want to change the base?

feat: Add custom embedder #2236

Conversation

vonodiripsa commented Jun 12, 2024

Related Issues/PRs

What changes are proposed in this pull request?

How is this patch tested?

Does this PR change any dependencies?

Does this PR add a new feature? If so, have you added samples on website?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mhamilton723 Jun 24, 2024 • edited Loading

Choose a reason for hiding this comment

mhamilton723 commented Jul 3, 2024

azure-pipelines bot commented Jul 3, 2024

codecov-commenter commented Jul 3, 2024 • edited Loading

Codecov Report

mhamilton723 Jun 24, 2024 •

edited

Loading

codecov-commenter commented Jul 3, 2024 •

edited

Loading