You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems that method fetch() in selfservice/strategy/password/validator.go is allocating memory without limits. Looking briefly at the code I can see that it stores cached counter for each validated password against HIBP API. This could be a problematic in the long run since all the validated password hashes are stored in the memory indefinitely causing steady memory allocation eventually exhausting all available memory.
I believe this implementation stores this values to save calls to the API which is good. However we would need to limit number of data stored in this "cache" to not "leak" memory and add unnecessary memory pressure for the host. In previous app which had similar concept of cache we used https://pkg.go.dev/github.com/hashicorp/golang-lru?utm_source=godoc and it works well. We could also think of implementing something simpler but I'm not sure if that's going to be reliable on larger set of records.
Another thing is that while this local cache is fine for local testing where there is only one instance of Kratos in production when there are more instances this cache might not be as effective since requests will be spread across multiple instances and all of those instances have to populate this cache individually.
Let me know what you think. This is not a critical bug but it should be fixed rather quickly to avoid unnecessary memory leak hunting :-)
Reproducing the bug
Run Kratos instance for a substantial amount of time (24hrs+)
Generate substantial traffic to Kratos endpoints (login/registration)
Observe memory usage and notice it is steadily increasing over time
Relevant log output
Here is the 24hrs graph https://pasteboard.co/9Iw7qL5NHZb5.png (24+1 instances and the biggest graph shows cumulative memory usage for all instances).
Relevant configuration
No response
Version
0.8.0
On which operating system are you observing this issue?
Linux
In which environment are you deploying?
Kubernetes
Additional Context
No response
The text was updated successfully, but these errors were encountered:
It is however entirely possible that the cache is at fault here. It's astonishing that there's that much data being cached! Improving the cache is a good idea, we can probably use ristretto or rather https://github.com/outcaste-io/ristretto (which is maintained) as we use that lib across all projects usually :)
We are using our custom (legacy) hasher which is based on BCrypt and AES256. I did run profiler and it seems that this method is causing excessive allocations.
I can take a look at ristretto to see if that could help with the memory allocation and if it does I will create a PR with the proposed implementation.
Our initial tests on PROD looks promising. We will leave it running for a weekend and if on Monday memory usage will be reasonable we will create a PR with this changes.
Preflight checklist
Describe the bug
It seems that method
fetch()
inselfservice/strategy/password/validator.go
is allocating memory without limits. Looking briefly at the code I can see that it stores cached counter for each validated password against HIBP API. This could be a problematic in the long run since all the validated password hashes are stored in the memory indefinitely causing steady memory allocation eventually exhausting all available memory.I believe this implementation stores this values to save calls to the API which is good. However we would need to limit number of data stored in this "cache" to not "leak" memory and add unnecessary memory pressure for the host. In previous app which had similar concept of cache we used https://pkg.go.dev/github.com/hashicorp/golang-lru?utm_source=godoc and it works well. We could also think of implementing something simpler but I'm not sure if that's going to be reliable on larger set of records.
Another thing is that while this local cache is fine for local testing where there is only one instance of Kratos in production when there are more instances this cache might not be as effective since requests will be spread across multiple instances and all of those instances have to populate this cache individually.
Let me know what you think. This is not a critical bug but it should be fixed rather quickly to avoid unnecessary memory leak hunting :-)
Reproducing the bug
Relevant log output
Here is the 24hrs graph https://pasteboard.co/9Iw7qL5NHZb5.png (24+1 instances and the biggest graph shows cumulative memory usage for all instances).
Relevant configuration
No response
Version
0.8.0
On which operating system are you observing this issue?
Linux
In which environment are you deploying?
Kubernetes
Additional Context
No response
The text was updated successfully, but these errors were encountered: