-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High memory usage #180
Comments
Profiled memory usage and looks like it's not a memory leak, or kestral. The search algorithm loads the full range of sufix array data into system memory to sort it, which for a search like "1" will be ~500million x 64bits (~3.7GiB). Unless you force GC it just doesn't naturally return that memory to the system and it will just leave the heap larger (unless the system were low on memory I guess, in which case it'd be forced to do the heap compaction or get OOM killed). Due to the sort happening in memory though we can't just put a memory limit on the container, as it would then run out of it when loading the digits... Actions:
Note: this logic is ported from the original .aspx API, which got around this issue by just validating and enforcing a minimum string length to search for. Since we load some digits onto the client this is reasonable, but prevents them searching for later ocurrences of a digit. Ideally we'd optimise this problem away but if not we should at least add validation to prevent someone DOSing the site accidentally. |
Sorting is required as the suffix array is ordered by the contents not the position, e.g. "32141" string as a suffix array would be:
So in the above array if you searched for "1" you would get a start index of 1 & end index of 2 (in the suffix array). Since they give indices in the original string of 5 & 3 you can see that a sort would still be required as the first occurrence (3) was not first. |
We may also want to avoid using the suffix array and perform a naive linear search for very short lookup strings, which would prevent the additional memory usage entirely... |
Looking back over this: the data is loaded into memory to be sorted so that the nth smallest element can be selected. This means that sorting the entire collection is inefficient. Instead of worrying about sorting, we can instead focus on the more specific problem of selection...
I do wonder if my previous idea of switching to a naive linear search for very short lookup strings is still a better way to control the memory problem, but even if that was done then this sort could still be replaced with Quickselect. |
Unsure if memory leak, or just kestral behaviour as there are currently no memory limits on the container.
Repro: Do a lookup on "pi" digits for digit "1".
Would expect memory usage to go up as that will use the precomputed digits for the suffix array range but still need to search through ~10% of the suffix array to get the result, whereas a count would just hit the precomputed file. What I wasn't expecting is that it doesn't then release the memory.
The text was updated successfully, but these errors were encountered: