Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ToDo: Serialize pickle byte code in numpy array, reduce multiprocessing memory usage #55

Open
boeddeker opened this issue Jan 13, 2023 · 0 comments

Comments

@boeddeker
Copy link
Member

boeddeker commented Jan 13, 2023

In [1] it is demonstrated to address the issue of the memory consumption, when multiprocessing is used. Although, we don't use multiprocessing (it's implemented, but threads are usually faster for audio data and avoid the fork issues mentioned in [1]), the idea can be integrated in our dataset implementation, with a small improvement for the memory consumption of large dataset. We have already a pickle based serialization, so there will ne no additional overhead.

Code from [1]:

class NumpySerializedList:
  def __init__(self, lst: list[Any]):
    lst = [np.frombuffer(pickle.dumps(x), dtype=np.uint8) for x in lst]
    self._addr = np.cumsum([len(x) for x in lst])
    self._lst = np.concatenate(lst)

  def __len__(self):
    return len(self._addr)

  def __getitem__(self, idx: int):
    start = 0 if idx == 0 else self._addr[idx - 1]
    end = self._addr[idx]
    return pickle.loads(memoryview(self._lst[start:end]))

[1] https://ppwwyyxx.com/blog/2022/Demystify-RAM-Usage-in-Multiprocess-DataLoader

@boeddeker boeddeker changed the title ToDo: ToDo: Serialize pickle byte code in numpy array, reduce multiprocessing memory usage Jan 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant