Fix memory issue in multiprocessing: Don't pickle table index #2264

lhoestq · 2021-04-26T09:21:35Z

The table index is currently being pickled when doing multiprocessing, which brings all the record batches of the dataset in memory.

I fixed that by not pickling the index attributes. Therefore each process has to rebuild the index when unpickling the table.

Fix issue #2256

We'll do a patch release asap !

lhoestq · 2021-04-26T09:36:26Z

The code quality check is going to be fixed by #2265

albertvillanova

I guess the source of the problem was state.copy() and self.__dict__.copy()...

I am just wondering if we are missing some items in the state dict... Or it is enough just with path, table, replays and blocks...

We should also implement regression tests (either in this PR or in another).

albertvillanova · 2021-04-26T09:46:53Z

src/datasets/table.py

            with tempfile.NamedTemporaryFile("wb", delete=False, suffix=".arrow") as tmp_file:
                filename = tmp_file.name
                logger.debug(
                    f"Attempting to pickle a table bigger than 4GiB. Writing it on the disk instead at {filename}"
                )
                _write_table_to_file(table=table, filename=filename)
-                state["path"] = filename
-                return state
+                return {"path": filename}


Are there other items (previously in state) that we are missing now? Or just with path is enough?

We only need "path" in this case:

There were other items yes: _batches, _schema and _offsets. However we don't want to pickle _batches or it causes memory issues. Therefore all these attributes are reloaded when Table.init is called.

Then the only thing left to pickle is table. But if it's bigger than 4GB we are writing it on disk since pickle doesn't support pickling objects bigger than 4GB by default. In this case we just pickle the path to the table on disk

lhoestq · 2021-04-26T09:56:18Z

The memory issue didn't come from self.__dict__.copy() but from the fact that this dict contains _batches which has all the batches of the table in it.
Therefore for a MemoryMappedTable all the data in _batches were copied in memory when pickling and this is the issue.

lhoestq · 2021-04-26T09:58:18Z

I'm still investigating why we didn't catch this issue in the tests.
This test should have caught it but didn't:

datasets/tests/test_table.py

Lines 350 to 353 in 3db67f5

    
           def test_memory_mapped_table_pickle_doesnt_fill_memory(arrow_file): 
        
               with assert_arrow_memory_doesnt_increase(): 
        
                   table = MemoryMappedTable.from_file(arrow_file) 
        
               assert_pickle_without_bringing_data_in_memory(table)

lhoestq · 2021-04-26T10:07:19Z

I'll focus on the patch release and fix the test in another PR after the release

albertvillanova · 2021-04-26T10:30:28Z

Yes, I think it is better that way...

don't pickle table index

2a6ba4b

lhoestq requested a review from albertvillanova April 26, 2021 09:21

Merge branch 'master' into fix-multiprocessing-memory-issue

ddd8e1e

albertvillanova approved these changes Apr 26, 2021

View reviewed changes

lhoestq merged commit 22c5928 into master Apr 26, 2021

lhoestq deleted the fix-multiprocessing-memory-issue branch April 26, 2021 10:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix memory issue in multiprocessing: Don't pickle table index #2264

Fix memory issue in multiprocessing: Don't pickle table index #2264

lhoestq commented Apr 26, 2021

lhoestq commented Apr 26, 2021

albertvillanova left a comment •

edited

Loading

albertvillanova Apr 26, 2021

lhoestq Apr 26, 2021

lhoestq commented Apr 26, 2021 •

edited

Loading

lhoestq commented Apr 26, 2021

lhoestq commented Apr 26, 2021

albertvillanova commented Apr 26, 2021

Fix memory issue in multiprocessing: Don't pickle table index #2264

Fix memory issue in multiprocessing: Don't pickle table index #2264

Conversation

lhoestq commented Apr 26, 2021

lhoestq commented Apr 26, 2021

albertvillanova left a comment • edited Loading

Choose a reason for hiding this comment

albertvillanova Apr 26, 2021

Choose a reason for hiding this comment

lhoestq Apr 26, 2021

Choose a reason for hiding this comment

lhoestq commented Apr 26, 2021 • edited Loading

lhoestq commented Apr 26, 2021

lhoestq commented Apr 26, 2021

albertvillanova commented Apr 26, 2021

albertvillanova left a comment •

edited

Loading

lhoestq commented Apr 26, 2021 •

edited

Loading