Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix memory issue in multiprocessing: Don't pickle table index #2264

Merged
merged 2 commits into from
Apr 26, 2021

Conversation

lhoestq
Copy link
Member

@lhoestq lhoestq commented Apr 26, 2021

The table index is currently being pickled when doing multiprocessing, which brings all the record batches of the dataset in memory.

I fixed that by not pickling the index attributes. Therefore each process has to rebuild the index when unpickling the table.

Fix issue #2256

We'll do a patch release asap !

@lhoestq
Copy link
Member Author

lhoestq commented Apr 26, 2021

The code quality check is going to be fixed by #2265

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the source of the problem was state.copy() and self.__dict__.copy()...

I am just wondering if we are missing some items in the state dict... Or it is enough just with path, table, replays and blocks...

We should also implement regression tests (either in this PR or in another).

with tempfile.NamedTemporaryFile("wb", delete=False, suffix=".arrow") as tmp_file:
filename = tmp_file.name
logger.debug(
f"Attempting to pickle a table bigger than 4GiB. Writing it on the disk instead at {filename}"
)
_write_table_to_file(table=table, filename=filename)
state["path"] = filename
return state
return {"path": filename}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there other items (previously in state) that we are missing now? Or just with path is enough?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only need "path" in this case:

  • There were other items yes: _batches, _schema and _offsets. However we don't want to pickle _batches or it causes memory issues. Therefore all these attributes are reloaded when Table.init is called.

  • Then the only thing left to pickle is table. But if it's bigger than 4GB we are writing it on disk since pickle doesn't support pickling objects bigger than 4GB by default. In this case we just pickle the path to the table on disk

@lhoestq
Copy link
Member Author

lhoestq commented Apr 26, 2021

The memory issue didn't come from self.__dict__.copy() but from the fact that this dict contains _batches which has all the batches of the table in it.
Therefore for a MemoryMappedTable all the data in _batches were copied in memory when pickling and this is the issue.

@lhoestq
Copy link
Member Author

lhoestq commented Apr 26, 2021

I'm still investigating why we didn't catch this issue in the tests.
This test should have caught it but didn't:

def test_memory_mapped_table_pickle_doesnt_fill_memory(arrow_file):
with assert_arrow_memory_doesnt_increase():
table = MemoryMappedTable.from_file(arrow_file)
assert_pickle_without_bringing_data_in_memory(table)

@lhoestq
Copy link
Member Author

lhoestq commented Apr 26, 2021

I'll focus on the patch release and fix the test in another PR after the release

@lhoestq lhoestq merged commit 22c5928 into master Apr 26, 2021
@lhoestq lhoestq deleted the fix-multiprocessing-memory-issue branch April 26, 2021 10:08
@albertvillanova
Copy link
Member

Yes, I think it is better that way...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants