ToDo: Serialize pickle byte code in numpy array, reduce multiprocessing memory usage #55

boeddeker · 2023-01-13T10:18:41Z

In [1] it is demonstrated to address the issue of the memory consumption, when multiprocessing is used. Although, we don't use multiprocessing (it's implemented, but threads are usually faster for audio data and avoid the fork issues mentioned in [1]), the idea can be integrated in our dataset implementation, with a small improvement for the memory consumption of large dataset. We have already a pickle based serialization, so there will ne no additional overhead.

Code from [1]:

class NumpySerializedList:
  def __init__(self, lst: list[Any]):
    lst = [np.frombuffer(pickle.dumps(x), dtype=np.uint8) for x in lst]
    self._addr = np.cumsum([len(x) for x in lst])
    self._lst = np.concatenate(lst)

  def __len__(self):
    return len(self._addr)

  def __getitem__(self, idx: int):
    start = 0 if idx == 0 else self._addr[idx - 1]
    end = self._addr[idx]
    return pickle.loads(memoryview(self._lst[start:end]))

[1] https://ppwwyyxx.com/blog/2022/Demystify-RAM-Usage-in-Multiprocess-DataLoader

boeddeker changed the title ~~ToDo:~~ ToDo: Serialize pickle byte code in numpy array, reduce multiprocessing memory usage Jan 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ToDo: Serialize pickle byte code in numpy array, reduce multiprocessing memory usage #55

ToDo: Serialize pickle byte code in numpy array, reduce multiprocessing memory usage #55

boeddeker commented Jan 13, 2023 •

edited

Loading

ToDo: Serialize pickle byte code in numpy array, reduce multiprocessing memory usage #55

ToDo: Serialize pickle byte code in numpy array, reduce multiprocessing memory usage #55

Comments

boeddeker commented Jan 13, 2023 • edited Loading

boeddeker commented Jan 13, 2023 •

edited

Loading