[OPIK-38] Deduplicate items before inserting them in a dataset #340

jverre · 2024-10-03T18:59:17Z

Details

The SDK has been updated so that items are not duplicated if they already exist in the dataset. This assumes that only one user is updating a dataset, we can move the hash_sync function to be called before each insert if we want to resolve this edge case but would come at a small performance cost:

import opik

client = opik.Opik()
dataset = client.create_dataset("deduplicated_dataset")

item = {
      "input": {"question": "What is the of capital of France?"},
      "expected_output": {"output": "Paris"},
  }
dataset.insert([item, item])

Documentation

The documentation has been updated

ferc

Very nice work, nice abstraction in compute_content_hash to being able to reuse it in different object types and thanks for adding test for it!

andrescrz

As discussed, this approach:

Doesn't handle a concurrent scenario properly.
The round trips affect scalability.

However, we discussed that we compromise on those and that it's good enought.

Generally the implementation is in the good direction, but one bug has gone through the cracks.

I recommend a follow-up PR to fix it, as it's low hanging fruit.

andrescrz · 2024-10-04T14:39:45Z

sdks/python/src/opik/api_objects/dataset/dataset.py

+        self._hash_to_id: Dict[str, str] = {}
+        self._id_to_hash: Dict[str, str] = {}


Minor: no need to use two dictionaries to track this, you just need to set to track the duplicated items. Anyway, it just makes the logic a bit less straight forward and uses more space.

andrescrz · 2024-10-04T14:41:08Z

sdks/python/src/opik/api_objects/dataset/dataset.py

+        # Remove duplicates if they already exist
+        deduplicated_items = []
+        for item in items:
+            item_hash = utils.compute_content_hash(item)


Minor: you don't need to compute a hash for this, you can just go with the content payload, assuming Python equality works correct with this particular payload.. This approach is valid anyway. It's just a matter of space vs computation.

andrescrz · 2024-10-04T14:43:08Z

sdks/python/src/opik/api_objects/dataset/dataset.py

+            if item_hash in self._hash_to_id:
+                if item.id is None or self._hash_to_id[item_hash] == item.id:  # type: ignore


There's a bug here as basically two items with the same payload and different ID aren't deduped. Not a big deal, but it should be fixed in a follow-up PR.

andrescrz · 2024-10-04T14:45:56Z

sdks/python/src/opik/api_objects/dataset/dataset.py

+    def _sync_hashes(self) -> None:
+        """Updates all the hashes in the dataset"""
+        LOGGER.debug("Start hash sync in dataset")
+        all_items = self.get_all_items()


Minor: This call seemed to have worked well per your testing, which is great because it's internally paginated. In the future it might consume too much space, but this is fine and no action point so far.

andrescrz · 2024-10-04T14:54:07Z

sdks/python/tests/unit/api_objects/dataset/test_deduplication.py

+    assert len(inserted_items) == 1, "Only one item should be inserted"
+
+
+def test_insert_deduplication_with_different_items():


Missing the test for the same payload with different item id.

japdubengsub · 2024-10-04T14:51:16Z

sdks/python/src/opik/api_objects/dataset/dataset.py

+        # Remove duplicates if they already exist
+        deduplicated_items = []
+        for item in items:
+            item_hash = utils.compute_content_hash(item)
+
+            if item_hash in self._hash_to_id:
+                if item.id is None or self._hash_to_id[item_hash] == item.id:  # type: ignore
+                    LOGGER.debug(
+                        "Duplicate item found with hash: %s - ignored the event",
+                        item_hash,
+                    )
+                    continue
+
+            deduplicated_items.append(item)
+            self._hash_to_id[item_hash] = item.id  # type: ignore
+            self._id_to_hash[item.id] = item_hash  # type: ignore


I would suggest extracting this code into a separate method. This would slightly reduce the method's complexity (McCabe/Cyclomatic Complexity) and improve testability by allowing you to test just the deduplication logic without needing to mock the Opik client or other dependencies.

japdubengsub · 2024-10-04T14:57:36Z

sdks/python/src/opik/api_objects/dataset/utils.py

+        content = item
+
+    # Convert the dictionary to a JSON string with sorted keys for consistency
+    json_string = json.dumps(content, sort_keys=True)


I’d like to point out that if the value for the key is a collection, such as an array or a set, and the order of elements is different, the items will be considered different even though their content is nearly identical. Though it might be not relevant to our use-cases.

* Implemented deduplication * Added documentation * Added update unit test * Add support for delete method * Fix linter * Fix python 3.8 tests

jverre added 2 commits October 3, 2024 19:43

Implemented deduplication

f506191

Added documentation

5a93ea0

jverre requested a review from a team as a code owner October 3, 2024 18:59

jverre added 4 commits October 3, 2024 20:01

Added update unit test

51bad95

Add support for delete method

f714014

Fix linter

d7d7660

Fix python 3.8 tests

77274e7

ferc approved these changes Oct 4, 2024

View reviewed changes

ferc merged commit e702307 into main Oct 4, 2024
21 checks passed

ferc deleted the jacques/sdk-deduplication branch October 4, 2024 11:53

andrescrz reviewed Oct 4, 2024

View reviewed changes

japdubengsub approved these changes Oct 4, 2024

View reviewed changes

dsblank pushed a commit that referenced this pull request Oct 4, 2024

[OPIK-38] Deduplicate items before inserting them in a dataset (#340)

eca042e

* Implemented deduplication * Added documentation * Added update unit test * Add support for delete method * Fix linter * Fix python 3.8 tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[OPIK-38] Deduplicate items before inserting them in a dataset #340

[OPIK-38] Deduplicate items before inserting them in a dataset #340

jverre commented Oct 3, 2024

ferc left a comment

andrescrz left a comment

andrescrz Oct 4, 2024

andrescrz Oct 4, 2024

andrescrz Oct 4, 2024

andrescrz Oct 4, 2024

andrescrz Oct 4, 2024

japdubengsub Oct 4, 2024

japdubengsub Oct 4, 2024

		self._hash_to_id: Dict[str, str] = {}
		self._id_to_hash: Dict[str, str] = {}

		if item_hash in self._hash_to_id:
		if item.id is None or self._hash_to_id[item_hash] == item.id: # type: ignore

		assert len(inserted_items) == 1, "Only one item should be inserted"


		def test_insert_deduplication_with_different_items():

[OPIK-38] Deduplicate items before inserting them in a dataset #340

[OPIK-38] Deduplicate items before inserting them in a dataset #340

Conversation

jverre commented Oct 3, 2024

Details

Documentation

ferc left a comment

Choose a reason for hiding this comment

andrescrz left a comment

Choose a reason for hiding this comment

andrescrz Oct 4, 2024

Choose a reason for hiding this comment

andrescrz Oct 4, 2024

Choose a reason for hiding this comment

andrescrz Oct 4, 2024

Choose a reason for hiding this comment

andrescrz Oct 4, 2024

Choose a reason for hiding this comment

andrescrz Oct 4, 2024

Choose a reason for hiding this comment

japdubengsub Oct 4, 2024

Choose a reason for hiding this comment

japdubengsub Oct 4, 2024

Choose a reason for hiding this comment