Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: issue with max call stack when search #437

Merged
merged 4 commits into from
Sep 6, 2023

Conversation

H4ad
Copy link
Contributor

@H4ad H4ad commented Jul 5, 2023

Closes #301

I just have a simple issue with the .push.

I added a function instead of just the if statement because probably orama can have other parts with the same issue.

PS: To be able to test/run the script on #301 without taking several minutes/hours, merge this branch with #434.

@vercel
Copy link

vercel bot commented Jul 5, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
orama-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Sep 6, 2023 0:52am

@allevo
Copy link
Collaborator

allevo commented Jul 6, 2023

Hi!
Thanks for this contribution! Will this affect the search performance when the result count is less than the limit?
How the value 10_000 is chosen?

@H4ad
Copy link
Contributor Author

H4ad commented Jul 6, 2023

@allevo I describe more the reason here: https://github.com/oramasearch/orama/pull/437/files#diff-eeb84b4af585397ddd81a97513980b6a568f6b3e82a0a597effba24233873a40R13-R18

But essentially, was just the max value divided by 10.

About the search time, from what I see, the performance for more than 10K could be increased but not too much, both operations are very performant compared to the time that will take to search that amount of items.

Also, push is more performance than concat: https://www.measurethat.net/Benchmarks/Show/4223/0/array-concat-vs-spread-operator-vs-push

@allevo
Copy link
Collaborator

allevo commented Jul 11, 2023

I make some benchmarks for this PR, which apparently introduces a performance gain.

I used this code:

import { create as createDev, insert as insertDev } from "orama-dev"
import { create as createStable, insert as insertStable } from "@orama/orama"
import { Benchmark } from "kelonio"
import { create as createDev, insert as insertDev } from "orama-dev"
import { create as createStable, insert as insertStable } from "@orama/orama"
import { faker } from '@faker-js/faker'

async function createSearch() {
  const n = 30000
  const data1 = Array.from({ length: n }).map(() => faker.string.sample())
  const data2 = [...data1]

  const benchmark = new Benchmark()
  const db1 = await createStable({ schema: { name: "string" } })
  for (const name of data1) {
    await insertStable(db1, { name })
  }  
  
  const db2 = await createDev({ schema: { name: "string" } })
  for (const name of data2) {
    await insertDev(db2, { name })
  }  

  await benchmark.record("search - stable", async () => await insertStable(db1, { name: data1.pop() }), {
    iterations: n,
  })
  await benchmark.record("search - dev", async () => await insertDev(db2, { name: data2.pop() }), {
    iterations: n,
  })
  printResult(benchmark)
}


function printResult(benchmark: Benchmark) {
  const data = benchmark.data
  const names = Object.keys(data)
  const results = names.map(name => {
    const result = data[name]
    const mean = result.durations.reduce((a, b) => a + b, 0) / result.durations.length

    return { name, mean, totalDuration: result.totalDuration }
  })

  results.sort((a, b) => a.mean - b.mean)

  console.log(
    results.map(result => `${result.name}: ${result.mean} ms`).join("\n")
  )
  console.log('\n\n\n')
}

Prints

search - stable: 1.754566665599999 ms
search - dev: 2.1923525156000085 ms

@H4ad
Copy link
Contributor Author

H4ad commented Jul 11, 2023

Are you not testing just the insert? This PR will only affect the search.

@allevo
Copy link
Collaborator

allevo commented Jul 11, 2023

Sorry my bad. I posted the wrong test:

async function createSearch() {
  const n = 30000
  const data1 = Array.from({ length: n }).map(() => faker.string.sample())
  const data2 = [...data1]

  const benchmark = new Benchmark()
  const db1 = await createStable({ schema: { name: "string" } })
  for (const name of data1) {
    await insertStable(db1, { name })
  }  
  
  const db2 = await createDev({ schema: { name: "string" } })
  for (const name of data2) {
    await insertDev(db2, { name })
  }  

  await benchmark.record("search - stable", async () => await insertStable(db1, { name: data1.pop() }), {
    iterations: n,
  })
  await benchmark.record("search - dev", async () => await insertDev(db2, { name: data2.pop() }), {
    iterations: n,
  })
  printResult(benchmark)
}

@H4ad
Copy link
Contributor Author

H4ad commented Jul 18, 2023

Hey @allevo, do you have any suggestions on this change?

@allevo
Copy link
Collaborator

allevo commented Jul 18, 2023

Something like

export function safeAddNewItems<T>(arr: T[], newArr: T[]) {

  if (newArr.length < MAX_ARGUMENT_FOR_STACK) {
    arr.push(...newArr)
  } else {
    for (let i = 0; i < newArr.length; i += MAX_ARGUMENT_FOR_STACK) {
      arr.push(...newArr.slice(i, i + MAX_ARGUMENT_FOR_STACK))
    }
  }
}

I didn't tested the performance differences.

@rawpixel-vincent
Copy link
Contributor

rawpixel-vincent commented Jul 19, 2023

Hi,
I've got this issue and tested this PR, it fixes my issues but I had to use safeAddNewItems in two more place inside of searchByWhereClause for my search query to work.

Attached the patch I use against 1.0.10 (also have a fix in removeDocument that crash when the node doesnt exists, part of updateMultiple)
@orama+orama+1.0.10.patch

Search query I use:

await SearchDB.search('program', {
    sortBy: { property: 'date', order: 'DESC' },
    exact: false,
    where: {
      business: businessIds?.length ? ['none', ...businessIds] : 'none',
      published: true,
    },
    properties: ['title', 'author', 'tags'],
    term: matchKeys,
    limit: pagesize,
    offset: offset || 0,
    threshold: 0.6,
  });

Schema:

const SCHEMA = {
id: 'string',
title: 'string',
author: 'string',
tags: 'string',
published: 'boolean',
date: 'number',
business: 'string',
};

the dpack size is aroun 400MB (900k docs)

ps: Performance are really good, love the idea of this project ❤️

@H4ad
Copy link
Contributor Author

H4ad commented Jul 20, 2023

@allevo I changed to use your suggestion, way better than cloning the array every time.

And @rawpixel-vincent, thanks for your suggestions, I ended up adding the method for every .push in the code just to be sure. Also, I added you as Co-Author for the fix of avl.ts, thanks.

@allevo
Copy link
Collaborator

allevo commented Jul 20, 2023

@H4ad Could you align this branch to the main? I'm curious to see the difference in the performance.

Copy link
Member

@micheleriva micheleriva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@allevo
Copy link
Collaborator

allevo commented Jul 20, 2023

This PR introduces an overhead:

search string - stable: 2.092553788 ms
search string - dev: 2.301789926000001 ms

I wonder if we want to fix the issue: your search is very short term. In fact, the problem is raised only when an index returns more than 100K elements. In that case, an error is thrown.

@micheleriva WDYT?

@micheleriva
Copy link
Member

By an index returns more than 100K elements you mean when limit is set to 100_000? If so, we shouldn't return 100k results in the first place.

@H4ad
Copy link
Contributor Author

H4ad commented Jul 20, 2023

Orama gets the IDs of all documents to filter and sort, then, it limits that list by the number of items the user wants, usually 10.

Now, for every list that push more than 10K elements, we batch push instead of push all those items to avoid issue with max call stack.

@micheleriva
Copy link
Member

I think that looks good, @H4ad could you solve conflicts before merging? Thanks!

@H4ad
Copy link
Contributor Author

H4ad commented Jul 24, 2023

@micheleriva Done!

Copy link
Member

@micheleriva micheleriva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@micheleriva
Copy link
Member

This PR introduces an overhead:

search string - stable: 2.092553788 ms
search string - dev: 2.301789926000001 ms

I wonder if we want to fix the issue: your search is very short term. In fact, the problem is raised only when an index returns more than 100K elements. In that case, an error is thrown.

@micheleriva WDYT?

Hi @H4ad, I'm a bit concerned about this performance degradation. I'm not sure it's worth it to add such a overhead for this edge case

@H4ad
Copy link
Contributor Author

H4ad commented Aug 17, 2023

@micheleriva The overhead is about 0.3ms for 30k items, for most of the cases, it will not even be noticed.

I think worth the change since is better to be a little bit slower than crash.

Also, we can try experiment increase the amount of items per slice, the limit is more than 100k and we are using an arbitrary number of 10k, we can try increase to 20k, 40k, etc...

@allevo
Copy link
Collaborator

allevo commented Aug 18, 2023

The crash happens only when a limit is reached. What happens if we set MAX_ARGUMENT_FOR_STACK to 99_000?
Are there some performance gains also when the limit is high?

@H4ad
Copy link
Contributor Author

H4ad commented Aug 27, 2023

@allevo I think we can increase, @micheleriva can you do more tests with different values of MAX_ARGUMENT_FOR_STACK, it just need to be less than 100K.

@H4ad
Copy link
Contributor Author

H4ad commented Aug 27, 2023

Accords to this link, webkit has a lower limit, 65k items.

@allevo
Copy link
Collaborator

allevo commented Sep 4, 2023

It's OK to put the magic number to 65K: this allows not to throw and limits the performance issue to large results searches.

@micheleriva
Copy link
Member

Let's use 65k as limit then. I'll merge this as soon as the conflicts are fixed

Co-authored-by: Alexandr Pestryakov <mertico@yandex.ru>
Copy link
Member

@micheleriva micheleriva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@micheleriva micheleriva merged commit cdb2d5d into oramasearch:main Sep 6, 2023
2 checks passed
@H4ad H4ad deleted the fix/max-call-stack-length branch September 6, 2023 00:59
@BanderasPRO
Copy link
Contributor

still presented:
#469 (comment)

reopen or new issue ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

200k documents search leads to maximum call stack exceeded
5 participants