-
-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow and seemingly unreasonably large memory usage #69
Comments
@Goblinlordx There is a significant amount of overhead when going from JS <--> rust, or vice-versa. Polars is best used when reading/writing directly from the filesystem, or using the lazy apis as much as possible. Unfortunately the upfront cost of converting JS to Rust is quite expensive. However, you can work around this by using by simply using a TypedArray, we can reduce the footprint dramatically. const data = Array(50000)
.fill(null)
.map((_, i) =>
Int32Array.from(Array(100)
.fill(0)
.map((_, ii) => i * 100 + ii))
) Without TypedArrays
with TypedArrays
|
Ah, I can confirm that resolves this specific case. However, this doesn't really seem to fit a more realistic case like even one given in the examples in the documentation:
Is there a way to accomplish this with the arrays with strings for this? How would I recreate this in an efficient way with the type of code creating a large dataset in JS to work with efficiently? Edit: I updated the code I provided to use typed array as well as a string array:
Actually, the memory usage seems somewhat acceptable (roughly 2x). It is still relatively slow but I assume that can't be helped. The memory usage does have somewhat odd behavior though. It's initial memory usage seems expected but it looks like there is memory not being freed between operations. This results in memory usage growing to about 3-4x memory usage slowly over many iterations. I think it would be useful if there was a way to prevent this type of growth of memory usage. If it is not, that is ok, I just feel it should probably be noted somewhere for users. For me, I think this was the effect I was observing. Basically, the way I had this setup before was I was finding and fetching relevant data files (thousands with some GB+ in size). These were Gzipped files which would then be stream gunzipped and processed transforming them into dataframes for use with Polars and performing some preliminary operations on them before finally emitting chunks as parquet format files. In the mean time, I have removed polars from my usage as the memory usage is prohibitive as it results in my containers getting killed by the orchestrator after some amount of time. There seems to be some condition that might prevent memory from being freed as there were cases where I observed the memory usage growing to over 68GB. It was hard to isolate the specific cause of this because the memory seemed to just grow in an unbounded way when utilizing actual data. I couldn't recreate this exact behavior when using synthetic data sets. |
So there is not much we can do about js <--> rust for hot loops such as is the case with instantiating dataframes. The optimal solution would be to use apache arrow to create the values in pure JS, then pass the buffer off to polars for processing. This is notably faster than going through the node-api (napi). However, we currently can't interop with arrow-js because they don't yet support const data = [{}, {}, {}]
const json_data = JSON.stringify(data)
const buff = Buffer.from(json_data)
const df = pl.readJSON(buff)
This definitely seems like a potential memory leak, or the GC not triggering properly. Ill look in to this further. |
As a note on the memory issue I had specifically, I would note that this was strictly an increase in I really appreciate the info and feedback. We can close this if you like or you can keep it open if you want to use it for tracking trying to find some mysterious memory leak (😢 ). I will move forward with other options for the moment and will take your suggestions into advisement. Again, I really appreciate the insight and feedback. 👍 Thanks a ton~ |
looks like the node.js team has been looking in to creating some fast paths for initializing objects via node-api |
Concerning memory for buffers not getting freed ("observed the memory usage growing to over 68GB"), the reason may be that "finalize" callbacks which free the memory are scheduled to run with If that is the problem, waiting a tick after each turn of the loop would allow the finalize callbacks to fire, and memory to be freed. I hit the same problem in a different context: napi-rs/napi-rs#1171 (comment) For discussion of this problem, and the reasons behind it, see the conversation from this comment onwards: nodejs/node-addon-api#917 (comment) Hope that's helpful. |
FYI, Arrow added LargeUtf8 in version 15. apache/arrow#35780 |
Have you tried latest version of polars?
What version of polars are you using?
0.7.3
What operating system are you using polars on?
Macos 13.0
What node version are you using
node 18.6.0
Describe your bug.
Memory usage seems unreasonably large and slow when passing data to
nodejs-polars
.With the example code below, it takes about 20 seconds and balloons the memory usage to around 3.5GB.
When doing a simple copy of the data in JS only, the operation takes 10-20ms and peak memory usage is around 325MB:
What are the steps to reproduce the behavior?
Example JS replacement for comparison:
What is the actual behavior?
Code runs as expected, it just uses what seems to be an unreasonable amount of memory.
What is the expected behavior?
Memory usage should be somewhat comparable to "maybe" 2x the usage for the data when not using library.
The text was updated successfully, but these errors were encountered: