Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High memory usage #180

Open
JoshKeegan opened this issue Mar 28, 2021 · 4 comments
Open

High memory usage #180

JoshKeegan opened this issue Mar 28, 2021 · 4 comments

Comments

@JoshKeegan
Copy link
Owner

Unsure if memory leak, or just kestral behaviour as there are currently no memory limits on the container.
Repro: Do a lookup on "pi" digits for digit "1".

Would expect memory usage to go up as that will use the precomputed digits for the suffix array range but still need to search through ~10% of the suffix array to get the result, whereas a count would just hit the precomputed file. What I wasn't expecting is that it doesn't then release the memory.

@JoshKeegan
Copy link
Owner Author

JoshKeegan commented Mar 28, 2021

Profiled memory usage and looks like it's not a memory leak, or kestral. The search algorithm loads the full range of sufix array data into system memory to sort it, which for a search like "1" will be ~500million x 64bits (~3.7GiB). Unless you force GC it just doesn't naturally return that memory to the system and it will just leave the heap larger (unless the system were low on memory I guess, in which case it'd be forced to do the heap compaction or get OOM killed).

Due to the sort happening in memory though we can't just put a memory limit on the container, as it would then run out of it when loading the digits...

Actions:

  • Look at whether the sorting is necessary. Isn't suffix array ordered..? Even if not, we would only need to maintain n digits in memory to find the nth digit sorted, so maybe improving the sorting algorithm would be worthwhile (still going to have the same worst case scenario, but if someone is searching for the 5billionth occurrence of "1" then they're just trying to break the system :D)
  • Mechanism for loading all digits required for the sort is slow. It loads each digit individually, accessing the stream 1 byte at a time. Since the digits aren't aligned with bytes this can be slow. Perhaps we can make some optimisations to that, maybe by adding methods to load ranges of digits rather than individually.

Note: this logic is ported from the original .aspx API, which got around this issue by just validating and enforcing a minimum string length to search for. Since we load some digits onto the client this is reasonable, but prevents them searching for later ocurrences of a digit. Ideally we'd optimise this problem away but if not we should at least add validation to prevent someone DOSing the site accidentally.

@JoshKeegan
Copy link
Owner Author

JoshKeegan commented Mar 29, 2021

Sorting is required as the suffix array is ordered by the contents not the position, e.g. "32141" string as a suffix array would be:

suffix i
1 5
141 3
2141 2
32141 1
41 4

So in the above array if you searched for "1" you would get a start index of 1 & end index of 2 (in the suffix array). Since they give indices in the original string of 5 & 3 you can see that a sort would still be required as the first occurrence (3) was not first.

@JoshKeegan
Copy link
Owner Author

JoshKeegan commented Mar 29, 2021

We may also want to avoid using the suffix array and perform a naive linear search for very short lookup strings, which would prevent the additional memory usage entirely...
Since we return a count & suffix array ranges as part of the lookup response, we would still want to perform the suffix array search (or precomputed digits lookup). We therefore know how many digits we'd load into memory and can base the decision on that.

@JoshKeegan
Copy link
Owner Author

Looking back over this: the data is loaded into memory to be sorted so that the nth smallest element can be selected. This means that sorting the entire collection is inefficient. Instead of worrying about sorting, we can instead focus on the more specific problem of selection...

  • If memory wasn't the issue but CPU we could perhaps switch to a selection algorithm rather than sorting, e.g. Quickselect.
  • To reduce memory usage for the average case we could scan the data on disk, only storing the smallest n elements in memory.

I do wonder if my previous idea of switching to a naive linear search for very short lookup strings is still a better way to control the memory problem, but even if that was done then this sort could still be replaced with Quickselect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant