-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix memory issue in multiprocessing: Don't pickle table index #2264
Conversation
The code quality check is going to be fixed by #2265 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the source of the problem was state.copy()
and self.__dict__.copy()
...
I am just wondering if we are missing some items in the state dict... Or it is enough just with path
, table
, replays
and blocks
...
We should also implement regression tests (either in this PR or in another).
with tempfile.NamedTemporaryFile("wb", delete=False, suffix=".arrow") as tmp_file: | ||
filename = tmp_file.name | ||
logger.debug( | ||
f"Attempting to pickle a table bigger than 4GiB. Writing it on the disk instead at {filename}" | ||
) | ||
_write_table_to_file(table=table, filename=filename) | ||
state["path"] = filename | ||
return state | ||
return {"path": filename} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there other items (previously in state) that we are missing now? Or just with path
is enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only need "path" in this case:
-
There were other items yes: _batches, _schema and _offsets. However we don't want to pickle _batches or it causes memory issues. Therefore all these attributes are reloaded when Table.init is called.
-
Then the only thing left to pickle is
table
. But if it's bigger than 4GB we are writing it on disk since pickle doesn't support pickling objects bigger than 4GB by default. In this case we just pickle the path to the table on disk
The memory issue didn't come from |
I'm still investigating why we didn't catch this issue in the tests. Lines 350 to 353 in 3db67f5
|
I'll focus on the patch release and fix the test in another PR after the release |
Yes, I think it is better that way... |
The table index is currently being pickled when doing multiprocessing, which brings all the record batches of the dataset in memory.
I fixed that by not pickling the index attributes. Therefore each process has to rebuild the index when unpickling the table.
Fix issue #2256
We'll do a patch release asap !