Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v1] Allow upsert to support batchSize for batched vector upserts #90

Merged
merged 2 commits into from
Sep 5, 2023

Conversation

austin-denoble
Copy link
Contributor

@austin-denoble austin-denoble commented Aug 30, 2023

Problem

In v0 there was a utility called chunkedUpsert() which allowed users to upsert vectors in chunks:

const chunkedUpsert = async (
index: VectorOperationsApi,
vectors: Vector[],
namespace: string,
chunkSize = 10
) => {
// Split the vectors into chunks
const chunks = sliceIntoChunks<Vector>(vectors, chunkSize);

We want to offer this functionality in v1 as well to ease usage around the 1000 vector limit for upserting, and give end users more control over their upserts.

Solution

  • Add batchUpsert() and sliceArrayToBatches() helper functions to /data/upsert.ts. Because of how tightly coupled these helpers are to upsert specifically it felt like it made sense to keep them close to the file itself and test the inherent functionality in the existing test file.
  • Update tyepbox schemas for upsert to support consuming either an array of vectors, or an object containing vectors and batchSize.
  • Update logic in upsert() to handle checking whether we've received an array vs. object, and branching depending on the arguments.
  • Add new unit test to upsert.test.ts to validate batching behavior making the appropriate calls to upsert API.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update
  • Infrastructure change (CI configs, etc)
  • Non-code change (docs, etc)
  • None of the above: (explain here)

Test Plan

Make sure to manually test both cases:

  • Traditional upserts: pass an array of vectors to index.upsert( ... ) directly.
  • Batched upserts: pass an object with a specified batchSize: { vectors: [ ... ], batchSize: 10}

Note: For easing testing I generated a set of 100 vectors. You can find them in the gist here if it's helpful: https://gist.github.com/austin-denoble/41cfd5c8857bac314435bb21f6caa25d

$ npm run repl
$ await init()

// Test standard upserting via array arg
$ await client.index('test-index').upsert([{ id: '111', values: [1,2,3] }, { id: '222', values: [3,2,1] }, { id: '333', values: [2,3,1] }])
undefined

// Validate upsert
$ await client.index('test-index').fetch(['111', '222', '333'])
{
  vectors: {
    '111': {
      id: '111',
      values: [Array],
      sparseValues: undefined,
      metadata: undefined
    },
    '222': {
      id: '222',
      values: [Array],
      sparseValues: undefined,
      metadata: undefined
    },
    '333': {
      id: '333',
      values: [Array],
      sparseValues: undefined,
      metadata: undefined
    }
  },
  namespace: ''
}

// Test batched upsert via object arg
$ await client.index('test-index').upsert({ vectors: [{ "id": "test-create-0", "values": [1, 2, 3] }, { "id": "test-create-1", "values": [1, 2, 3] }, { "id": "test-create-2", "values": [1, 2, 3] }, { "id": "test-create-3", "values": [1, 2, 3] }, { "id": "test-create-4", "values": [1, 2, 3] }, { "id": "test-create-5", "values": [1, 2, 3] }, { "id": "test-create-6", "values": [1, 2, 3] }, { "id": "test-create-7", "values": [1, 2, 3] }, { "id": "test-create-8", "values": [1, 2, 3] }, { "id": "test-create-9", "values": [1, 2, 3] }, { "id": "test-create-10", "values": [1, 2, 3] }], batchSize: 5 })
undefined

// Validate all of the batches uploaded
$ await client.fetch(['test-create-0', 'test-create-1', 'test-create-2', 'test-create-3', 'test-create-4', 'test-create-5', 'test-create-6', 'test-create-7', 'test-create-8', 'test-create-9'])

{
  vectors: {
    'test-create-0': {
      id: 'test-create-0',
      values: [Array],
      sparseValues: undefined,
      metadata: undefined
    },
    'test-create-3': {
      id: 'test-create-3',
      values: [Array],
      sparseValues: undefined,
      metadata: undefined
    },
    'test-create-8': {
      id: 'test-create-8',
      values: [Array],
      sparseValues: undefined,
      metadata: undefined
    },
    'test-create-9': {
      id: 'test-create-9',
      values: [Array],
      sparseValues: undefined,
      metadata: undefined
    },
    'test-create-2': {
      id: 'test-create-2',
      values: [Array],
      sparseValues: undefined,
      metadata: undefined
    },
    'test-create-4': {
      id: 'test-create-4',
      values: [Array],
      sparseValues: undefined,
      metadata: undefined
    },
    'test-create-7': {
      id: 'test-create-7',
      values: [Array],
      sparseValues: undefined,
      metadata: undefined
    },
    'test-create-5': {
      id: 'test-create-5',
      values: [Array],
      sparseValues: undefined,
      metadata: undefined
    },
    'test-create-1': {
      id: 'test-create-1',
      values: [Array],
      sparseValues: undefined,
      metadata: undefined
    },
    'test-create-6': {
      id: 'test-create-6',
      values: [Array],
      sparseValues: undefined,
      metadata: undefined
    }
  },
  namespace: ''
}

@austin-denoble austin-denoble marked this pull request as ready for review August 30, 2023 22:16
Copy link
Collaborator

@jhamon jhamon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general approach looks good to me, but I can tell it's going to be a big pain resolving merge conflicts in this PR. Sorry about that 😬

Didn't do that much manual testing since I'm thinking we'll want to do that after you rebase and resolve conflicts.

src/data/__tests__/upsert.test.ts Outdated Show resolved Hide resolved
src/data/upsert.ts Outdated Show resolved Hide resolved
src/data/upsert.ts Outdated Show resolved Hide resolved
@austin-denoble austin-denoble force-pushed the adenoble/chunked-upsert branch from 079dfbc to a3f301d Compare September 4, 2023 18:26
@austin-denoble austin-denoble changed the title [v1] Allow upsert to support chunkSize for chunked vector upserts [v1] Allow upsert to support batchSize for batched vector upserts Sep 4, 2023
@austin-denoble austin-denoble mentioned this pull request Sep 4, 2023
1 task
…er and update typebox validation, update unit test coverage, export types
…to new pattern, export sliceArrayToBatches and test explicitly
@austin-denoble austin-denoble force-pushed the adenoble/chunked-upsert branch from a3f301d to 3fd5c29 Compare September 5, 2023 14:16
@austin-denoble austin-denoble merged commit 0238f3a into main Sep 5, 2023
@austin-denoble austin-denoble deleted the adenoble/chunked-upsert branch September 5, 2023 15:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants