Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OPIK-38] Deduplicate items before inserting them in a dataset #340

Merged
merged 6 commits into from
Oct 4, 2024

Conversation

jverre
Copy link
Collaborator

@jverre jverre commented Oct 3, 2024

Details

The SDK has been updated so that items are not duplicated if they already exist in the dataset. This assumes that only one user is updating a dataset, we can move the hash_sync function to be called before each insert if we want to resolve this edge case but would come at a small performance cost:

import opik

client = opik.Opik()
dataset = client.create_dataset("deduplicated_dataset")

item = {
      "input": {"question": "What is the of capital of France?"},
      "expected_output": {"output": "Paris"},
  }
dataset.insert([item, item])

Documentation

The documentation has been updated

@jverre jverre requested a review from a team as a code owner October 3, 2024 18:59
Copy link
Collaborator

@ferc ferc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work, nice abstraction in compute_content_hash to being able to reuse it in different object types and thanks for adding test for it!

@ferc ferc merged commit e702307 into main Oct 4, 2024
21 checks passed
@ferc ferc deleted the jacques/sdk-deduplication branch October 4, 2024 11:53
Copy link
Collaborator

@andrescrz andrescrz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed, this approach:

  1. Doesn't handle a concurrent scenario properly.
  2. The round trips affect scalability.

However, we discussed that we compromise on those and that it's good enought.

Generally the implementation is in the good direction, but one bug has gone through the cracks.

I recommend a follow-up PR to fix it, as it's low hanging fruit.

Comment on lines +29 to +30
self._hash_to_id: Dict[str, str] = {}
self._id_to_hash: Dict[str, str] = {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: no need to use two dictionaries to track this, you just need to set to track the duplicated items. Anyway, it just makes the logic a bit less straight forward and uses more space.

# Remove duplicates if they already exist
deduplicated_items = []
for item in items:
item_hash = utils.compute_content_hash(item)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: you don't need to compute a hash for this, you can just go with the content payload, assuming Python equality works correct with this particular payload.. This approach is valid anyway. It's just a matter of space vs computation.

Comment on lines +62 to +63
if item_hash in self._hash_to_id:
if item.id is None or self._hash_to_id[item_hash] == item.id: # type: ignore
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a bug here as basically two items with the same payload and different ID aren't deduped. Not a big deal, but it should be fixed in a follow-up PR.

def _sync_hashes(self) -> None:
"""Updates all the hashes in the dataset"""
LOGGER.debug("Start hash sync in dataset")
all_items = self.get_all_items()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: This call seemed to have worked well per your testing, which is great because it's internally paginated. In the future it might consume too much space, but this is fine and no action point so far.

assert len(inserted_items) == 1, "Only one item should be inserted"


def test_insert_deduplication_with_different_items():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing the test for the same payload with different item id.

Comment on lines +57 to +72
# Remove duplicates if they already exist
deduplicated_items = []
for item in items:
item_hash = utils.compute_content_hash(item)

if item_hash in self._hash_to_id:
if item.id is None or self._hash_to_id[item_hash] == item.id: # type: ignore
LOGGER.debug(
"Duplicate item found with hash: %s - ignored the event",
item_hash,
)
continue

deduplicated_items.append(item)
self._hash_to_id[item_hash] = item.id # type: ignore
self._id_to_hash[item.id] = item_hash # type: ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest extracting this code into a separate method. This would slightly reduce the method's complexity (McCabe/Cyclomatic Complexity) and improve testability by allowing you to test just the deduplication logic without needing to mock the Opik client or other dependencies.

content = item

# Convert the dictionary to a JSON string with sorted keys for consistency
json_string = json.dumps(content, sort_keys=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d like to point out that if the value for the key is a collection, such as an array or a set, and the order of elements is different, the items will be considered different even though their content is nearly identical. Though it might be not relevant to our use-cases.

dsblank pushed a commit that referenced this pull request Oct 4, 2024
* Implemented deduplication

* Added documentation

* Added update unit test

* Add support for delete method

* Fix linter

* Fix python 3.8 tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants