Issue with too much memory usage when running in parallel with mx.distributed #1220

sck-at-ucy · 2024-06-21T16:03:13Z

sck-at-ucy
Jun 21, 2024

I am trying to refactor my physics-informed transformer model to run parallel. I am trying to create the various dataset arrays on rank=0, along also empty (zero) counterpart arrays of the same shape on rank > 0 and then use mx.distrubtred.all_sum to get copies of the dataset on all ranks. Then I slice the dataset arrays to create local copies for each rank. Once I no longer need the original complete dataset and I thought the code below would free the memory, but apparently is not, as the code uses way too much memory compared to running on a single node and the entire dataset.

What am I doing wrong? Is there a better memory management approach when using mx.distributed() in the absence of a distributed broadcast?

   local_training_data = training_data_mlx[start_train_idx:end_train_idx,:,:,:]
   local_training_alphas = training_alphas_mlx[start_train_idx:end_train_idx]
   local_training_dts = training_dts_mlx[start_train_idx:end_train_idx]
   #
   local_validation_data = validation_data_mlx[start_valid_idx:end_valid_idx,:,:,:]
   local_validation_alphas = validation_alphas_mlx[start_valid_idx:end_valid_idx]
   local_validation_dts = validation_dts_mlx[start_valid_idx:end_valid_idx]
   #
   local_test_data = test_data_mlx[start_test_idx:end_test_idx,:,:,:]
   local_test_alphas = test_alphas_mlx[start_test_idx:end_test_idx]
   local_test_dts = test_dts_mlx[start_test_idx:end_test_idx
   #
   mx.eval(local_training_data)
   mx.eval(local_validation_data)
   mx.eval(local_test_data)
    #
   mx.eval(local_training_alphas)
   mx.eval(local_validation_alphas)
   mx.eval(local_test_alphas)
   #
   mx.eval(local_training_dts)
   mx.eval(local_validation_dts)
   mx.eval(local_test_dts)
   #
   training_data_mlx = None
   validation_data_mlx = None
   test_data_mlx = None
    #
   training_alphas_mlx = None
   validation_alphas_mlx = None
   test_alphas_mlx = None
    #
   training_dts_mlx = None
   validation_dts_mlx = None
   test_dts_mlx = None

   mx.metal.clear_cache()

Would be thankful for any advice from @awni @angeloskath :)

awni · 2024-06-23T19:27:53Z

awni
Jun 23, 2024
Maintainer

Setting the arrays to None and clearing the cache should free the memory they used.
Probably we should have a broadcast to avoid doing what you are doing which seems tedious, but that's a slightly different matter
How are you measuring memory? And how much more is the multi-process version using than the single process version?
Could you provide the full code or a simplified way to reproduce what you are seeing?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with too much memory usage when running in parallel with mx.distributed #1220

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Issue with too much memory usage when running in parallel with mx.distributed #1220

sck-at-ucy Jun 21, 2024

Replies: 1 comment

awni Jun 23, 2024 Maintainer

sck-at-ucy
Jun 21, 2024

awni
Jun 23, 2024
Maintainer