Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AccelerateMixin error with RandomizedSearchCV #944

Closed
Raphaaal opened this issue Mar 28, 2023 · 21 comments · Fixed by #947
Closed

AccelerateMixin error with RandomizedSearchCV #944

Raphaaal opened this issue Mar 28, 2023 · 21 comments · Fixed by #947

Comments

@Raphaaal
Copy link

Raphaaal commented Mar 28, 2023

Hi,

Thanks a lot for the great tool!

I tried the recently added HuggingFace Accelerate integration. I want to perform hyper-parameters optimization using Skorch with Accelerate + ScikitLearn RandomizedSearchCV.

However, it seems that they do not play nicely at scoring time by the RandomizedSearchCV.

Reproducible example named skorch_accelerate_issue.py :

import torch
import numpy as np
from skorch import NeuralNetClassifier
from skorch.hf import AccelerateMixin
from accelerate import Accelerator 
from sklearn.datasets import make_classification
from sklearn.model_selection import RandomizedSearchCV
import torch.nn as nn

# FYI: Accelerate also requires the `transformers` packages from HuggingFace

# Generate data
X, y = make_classification(10_000, 100, n_informative=5, random_state=0)
X = X.astype(np.float32)
y = y.astype(np.int64)

# PyTorch module
class MyModule(torch.nn.Module):
    def __init__(self):
        super().__init__()

        self.dense0 = nn.Linear(100, 2)
        self.nonlin = nn.ReLU()

    def forward(self, X):
        X = self.dense0(X)
        X = self.nonlin(X)
        return X

# Skorch wrapper
class AcceleratedNeuralNetClassifier(
    AccelerateMixin, 
    NeuralNetClassifier
):
    """NeuralNetClassifier with HuggingFace Accelerate support"""
accelerator = Accelerator()
model = AcceleratedNeuralNetClassifier(
    MyModule,
    accelerator=accelerator,
)

# HPO
rs = RandomizedSearchCV(
    estimator=model,
    param_distributions={
        "lr": [0.0001, 0.001, 0.01, 0.1],
        "batch_size": [10, 20, 30, 40],
    },
    n_iter=10,
    scoring="average_precision",
    n_jobs=1,
    refit=False,
    cv=2,
    verbose=1,
)
rs.fit(X, y)

Accelerate config to run this script on 2 GPUs on the same machine:

(/home/razorin/conda_envs/backup) razorin@ML:~/tmp$ accelerate config                                                                         

In which compute environment are you running?
This machine

Which type of machine are you using?
multi-GPU

How many different machines will you use (use more than 1 for multi-node training)? [1]: 
1 

Do you wish to optimize your script with torch dynamo?[yes/NO]:NO
Do you want to use DeepSpeed? [yes/NO]: NO
Do you want to use FullyShardedDataParallel? [yes/NO]: NO
Do you want to use Megatron-LM ? [yes/NO]: NO 

How many GPU(s) should be used for distributed training? [1]:2 

What GPU(s) (by id) should be used for training on this machine as a comma-separated list? [all]:0,5 

Do you wish to use FP16 or BF16 (mixed precision)? no 

accelerate configuration saved at /home/razorin/.cache/huggingface/accelerate/default_config.yaml

I ran the code using: accelerate launch skorch_accelerate_issue.py

And here is the error:

The following values were not passed to `accelerate launch` and had defaults used instead: 
--dynamo_backend was set to a value of no 
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`. 

Fitting 2 folds for each of 10 candidates, totalling 20 fits 
Fitting 2 folds for each of 10 candidates, totalling 20 fits 
epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------ 
1        5.4880       0.6060        4.5132  0.3304 
2        4.0846       0.6520        3.3801  0.2789 

/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:778: UserWarning: Scoring failed. The score on this train-test partition for these parameters will be set to nan. Details:                                                                                                                           

Traceback (most recent call last): File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 767, in _score
   scores = scorer(estimator, X_test, y_test) 
File "/home/razorin/conda_envs/backup/lib/python3.9/sitepackages/sklearn/metrics/_scorer.py", line 234, in __call__                                                          
   return self._score( 
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 399, in _score                                                            
   return self._sign * self._score_func(y, y_pred, **self._kwargs) 
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 234, in average_precision_score 
   return _average_binary_score( 
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_base.py", line 75, in _average_binary_score                                                
   return binary_metric(y_true, y_score, sample_weight=sample_weight)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 207, in _binary_uninterpolated_average_precision
   precision, recall, _ = precision_recall_curve(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 878, in precision_recall_curve                                           
   fps, tps, thresholds = _binary_clf_curve( 
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 751, in _binary_clf_curve                                                
   check_consistent_length(y_true, y_score, sample_weight) 
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/validation.py", line 397, in check_consistent_length                                          
  raise ValueError( 

ValueError: Found input variables with inconsistent numbers of samples: [5000, 2500]

FYI, when training starts, I can see that the two GPUs are indeed occupied. Also, when I get rid of the RandomizedSearchCV and just perform model.fit(X, y), training occurs as expected on 2 GPUs.

Many thanks in advance for your help.

@BenjaminBossan
Copy link
Collaborator

Thanks for this great summary and code example. I don't have a multi-gpu setup that I can test this, with a single GPU I couldn't reproduce the error. But I have a suspicion that it could be due to caching. Could you please check if turning off caching solves the issue for you? To do that, initialize your net like this:

model = AcceleratedNeuralNetClassifier(
    MyModule,
    accelerator=accelerator,
    callbacks__valid_acc__use_caching=False,  # <= added line
)

Of course, if you have more scoring callbacks than the default ones, turn off caching for those too.

If that doesn't help, please test disabling callbacks completely, using:

model = AcceleratedNeuralNetClassifier(
    MyModule,
    accelerator=accelerator,
    callbacks="disable",
)

Please report your findings back.

Thanks a lot for the great tool!

Always happy to hear that :)

@Raphaaal
Copy link
Author

Thanks for the quick reply.

Unfortunately it did not solve the issue. However, the error trace changed.
Please find below the various traces (I increased the RandomizedSearchCV verbosity with verbose=3).

  • Turning off caching
model = AcceleratedNeuralNetClassifier(
    MyModule,
    accelerator=accelerator,
    callbacks__valid_acc__use_caching=False,  # <= added line
)

Error:

The following values were not passed to accelerate launch and had defaults used instead:
dynamo_backend was set to a value of no
To avoid this warning pass in values for each of the problematic parameters or run accelerate config
Fitting 2 folds for each of 10 candidates, totalling 20 
Fitting 2 folds for each of 10 candidates, totalling 20 
[CV 1/2] END .............batch_size=30, lr=0.001;, score=nan total time=   4.5s
[CV 2/2] END .............batch_size=30, lr=0.001;, score=nan total time=   0.0s

As you can see, the score is still NaN. At this point execution freezes and the RandomizedSearchCV fit does not terminate. Note that the second fold fit time is 0.0s.

  • Disabling callbacks
model = AcceleratedNeuralNetClassifier(
    MyModule,
    accelerator=accelerator,
    callbacks="disable",  # <= added line
)

Error:

The following values were not passed to accelerate launch and had defaults used instead
dynamo_backend was set to a value of 
To avoid this warning pass in values for each of the problematic parameters or run accelerate config
Fitting 2 folds for each of 10 candidates, totalling 20 fits
Fitting 2 folds for each of 10 candidates, totalling 20 fits                                                                                                                

/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:778: UserWarning:
Scoring failed. The score on this train-test partition for these parameters will be set to nan. 
Details:

Traceback (most recent call last):
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 767, in _score
   scores = scorer(estimator, X_test, y_test)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 234, in __call__                                                          
   return self._score(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 399, in _score                                                            
   return self._sign * self._score_func(y, y_pred, **self._kwargs)                                                                                                           
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 234, in average_precision_score
   return _average_binary_score(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_base.py", line 75, in _average_binary_score
   return binary_metric(y_true, y_score, sample_weight=sample_weight)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 207, in _binary_uninterpolated_average_precision
   precision, recall, _ = precision_recall_curve(                                                                                                                            
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 878, in precision_recall_curve
   fps, tps, thresholds = _binary_clf_curve(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 751, in _binary_clf_curve                                                 
   check_consistent_length(y_true, y_score, sample_weight)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/validation.py", line 397, in 
   check_consistent_length
raise ValueError(

ValueError: Found input variables with inconsistent numbers of samples: [5000, 2500]
warnings.warn(
[CV 1/2] END ..............batch_size=20, lr=0.01;, score=nan total time=   5.8s
[CV 2/2] END ..............batch_size=20, lr=0.01;, score=nan total time=   0.0s

As you can see, the score is still NaN. At this point execution freezes and the RandomizedSearchCV fit does not terminate. Note that the second fold fit time is 0.0s.

  • No change (verbose=3)
model = AcceleratedNeuralNetClassifier(
    MyModule,
    accelerator=accelerator,
)

Error:

The following values were not passed to accelerate launch and had defaults used instead:
dynamo_backend was set to a value of no
To avoid this warning pass in values for each of the problematic parameters or run accelerate config.
Fitting 2 folds for each of 10 candidates, totalling 20 fits
Fitting 2 folds for each of 10 candidates, totalling 20 fits
epoch    train_loss    valid_acc    valid_loss     dur
------------  -----------  ------------  ------
1       -0.4437       0.8000       -1.6170  0.1552
2       -2.1335       0.8180       -2.1231  0.1443
3       -0.2566       0.5800        2.0079  0.1521
4        1.5909       0.5840        1.9191  0.1305
5        1.5185       0.5880        1.7096  0.1742
6        1.3397       0.5960        1.5191  0.1502
7        1.1890       0.6080        1.3841  0.1607
8        1.0627       0.6080        1.3157  0.1336
9        0.9933       0.6080        1.2165  0.1481
10        0.9050       0.6120        1.1093  0.1277                                                                                                                    

/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py:778: UserWarning:
Scoring failed. The score on this train-test partition for these parameters will be set to nan. 
Details:

Traceback (most recent call last):
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 767, in _score
   scores = scorer(estimator, X_test, y_test)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 234, in __call__                                                          
   return self._score(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 399, in _score                                                            
   return self._sign * self._score_func(y, y_pred, **self._kwargs)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 234, in average_precision_score
   return _average_binary_score(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_base.py", line 75, in _average_binary_score
   return binary_metric(y_true, y_score, sample_weight=sample_weight)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 207, in _binary_uninterpolated_average_precision
   precision, recall, _ = precision_recall_curve(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 878, in precision_recall_curve
   fps, tps, thresholds = _binary_clf_curve(
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 751, in _binary_clf_curve                                                
   check_consistent_length(y_true, y_score, sample_weight)
File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/validation.py", line 397, in check_consistent_length
   raise ValueError(

ValueError: Found input variables with inconsistent numbers of samples: [5000, 2500]
warnings.warn(
[CV 1/2] END ...............batch_size=20, lr=0.1;, score=nan total time=   6.0s

As you can see, the score is NaN but only the first fold completed. At this point execution freezes and the RandomizedSearchCV fit does not terminate.

Thanks a lot in advance for your guidance

@BenjaminBossan
Copy link
Collaborator

Hmm, this does not look good.

Whether the search fails early or works a while and fails later is probably not related to the specific conditions you posted but is caused by some combination of random hyper-parameters; since RandomizedSearchCV is not seeded, sometimes that combination occurs earlier, sometimes later. And the score being nan could be because the output non-linearity is not correct for a classification task (relu should be softmax). Still, we shouldn't see that ValueError.

I'm sorry that I have to ask you to try a few more things, but as mentioned I cannot replicate this locally:

  • Could you please turn off the skorch internal validation? Pass train_split=False to the net to do so.
  • Could you please check if this is related to a specific batch size? To test that, don't use RandomizedSearchCV but just fit the net directly, initializing it with the different batch sizes you tested. Is it possible to trigger the error consistently with a specific batch size?

@Raphaaal
Copy link
Author

Thanks again for the reply.

I implemented your suggestions (reproducibility by setting seeds & SoftMax). I also changed the default error_score=nan to error_score="raise" in the RandomizedSearchCV because I suspected the nan to come from the scoring error.

I also fitted the net without RandomizedSearchCV using the four batch_size used in the example, without any problem with Accelerate.

Full code:

import torch
import numpy as np
from skorch import NeuralNetClassifier
from skorch.hf import AccelerateMixin
from accelerate import Accelerator 
from sklearn.datasets import make_classification
from sklearn.model_selection import RandomizedSearchCV
import torch.nn as nn
import random
# FYI: Accelerate also requires the `transformers` packages from HuggingFace


# Reproducibility
SEED = 42
torch.manual_seed(SEED)
np.random.seed(SEED)
random.seed(SEED)

# Generate data
X, y = make_classification(10_000, 100, n_informative=5, random_state=0)
X = X.astype(np.float32)
y = y.astype(np.int64)

# PyTorch module
class MyModule(torch.nn.Module):
    def __init__(self):
        super().__init__()

        self.dense0 = nn.Linear(100, 2)
        self.nonlin = nn.Softmax(dim=-1)

    def forward(self, X):
        X = self.dense0(X)
        X = self.nonlin(X)
        return X

# Skorch wrapper
class AcceleratedNeuralNetClassifier(
    AccelerateMixin, 
    NeuralNetClassifier
):
    """NeuralNetClassifier with HuggingFace Accelerate support"""

accelerator = Accelerator()
model = AcceleratedNeuralNetClassifier(
    MyModule,
    accelerator=accelerator,
)

# HPO
rs = RandomizedSearchCV(
    estimator=model,
    param_distributions={
        "lr": [0.0001, 0.001, 0.01, 0.1],
        "batch_size": [10, 20, 30, 40],
    },
    n_iter=10,
    scoring="average_precision",
    n_jobs=1,
    refit=False,
    cv=2,
    verbose=3,
    random_state=SEED,
    error_score="raise"
)
rs.fit(X, y)

print(f"{rs.cv_results_}")
  • Using Accelerate
    The trace now includes error messages from torch.distributed because the executions finishes.
The following values were not passed to `accelerate launch` and had defaults used instead:
        `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
Fitting 2 folds for each of 10 candidates, totalling 20 fits
Fitting 2 folds for each of 10 candidates, totalling 20 fits
  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        0.8021       0.5000        0.7604  0.2809
      2        0.7660       0.5240        0.7290  0.2804
      3        0.7338       0.5560        0.7011  0.2634
      4        0.7052       0.5820        0.6765  0.2743
      5        0.6798       0.6200        0.6546  0.2696
      6        0.6571       0.6400        0.6352  0.2517
      7        0.6369       0.6520        0.6178  0.2470
      8        0.6187       0.6660        0.6024  0.2482
      9        0.6024       0.6880        0.5885  0.2399
     10        0.5876       0.6920        0.5760  0.2459

Traceback (most recent call last):
  File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue.py", line 69, in <module>
    rs.fit(X, y)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_search.py", line 874, in fit
    self._run_search(evaluate_candidates)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_search.py", line 1768, in _run_search
    evaluate_candidates(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_search.py", line 821, in evaluate_candidates
    out = parallel(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 63, in __call__
    return super().__call__(iterable_with_config)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 1085, in __call__
    if self.dispatch_one_batch(iterator):
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 901, in dispatch_one_batch
    self._dispatch(tasks)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 819, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 597, in __init__
    self.results = batch()
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in __call__
    return [func(*args, **kwargs)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in <listcomp>
    return [func(*args, **kwargs)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 123, in __call__
    return self.function(*args, **kwargs)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 708, in _fit_and_score
    test_scores = _score(estimator, X_test, y_test, scorer, error_score)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 767, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 234, in __call__
    return self._score(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 399, in _score
    return self._sign * self._score_func(y, y_pred, **self._kwargs)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 234, in average_precision_score
    return _average_binary_score(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_base.py", line 75, in _average_binary_score
    return binary_metric(y_true, y_score, sample_weight=sample_weight)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 207, in _binary_uninterpolated_average_precision
    precision, recall, _ = precision_recall_curve(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 878, in precision_recall_curve
    fps, tps, thresholds = _binary_clf_curve(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 751, in _binary_clf_curve
    check_consistent_length(y_true, y_score, sample_weight)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/validation.py", line 397, in check_consistent_length
    raise ValueError(
ValueError: Found input variables with inconsistent numbers of samples: [5000, 2500]

Traceback (most recent call last):
  File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue.py", line 69, in <module>
    rs.fit(X, y)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_search.py", line 874, in fit
    self._run_search(evaluate_candidates)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_search.py", line 1768, in _run_search
    evaluate_candidates(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_search.py", line 821, in evaluate_candidates
    out = parallel(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 63, in __call__
    return super().__call__(iterable_with_config)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 1085, in __call__
    if self.dispatch_one_batch(iterator):
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 901, in dispatch_one_batch
    self._dispatch(tasks)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 819, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 597, in __init__
    self.results = batch()
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in __call__
    return [func(*args, **kwargs)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in <listcomp>
    return [func(*args, **kwargs)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 123, in __call__
    return self.function(*args, **kwargs)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 708, in _fit_and_score
    test_scores = _score(estimator, X_test, y_test, scorer, error_score)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 767, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 234, in __call__
    return self._score(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 399, in _score
    return self._sign * self._score_func(y, y_pred, **self._kwargs)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 234, in average_precision_score
    return _average_binary_score(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_base.py", line 75, in _average_binary_score
    return binary_metric(y_true, y_score, sample_weight=sample_weight)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 207, in _binary_uninterpolated_average_precision
    precision, recall, _ = precision_recall_curve(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 878, in precision_recall_curve
    fps, tps, thresholds = _binary_clf_curve(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 751, in _binary_clf_curve
    check_consistent_length(y_true, y_score, sample_weight)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/validation.py", line 397, in check_consistent_length
    raise ValueError(
ValueError: Found input variables with inconsistent numbers of samples: [5000, 2500]

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 45560) of binary: /home/razorin/conda_envs/backup/bin/python
Traceback (most recent call last):
  File "/home/razorin/conda_envs/backup/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/launch.py", line 906, in launch_command
    multi_gpu_launcher(args)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/launch.py", line 599, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run
    elastic_launch(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
skorch_accelerate_issue.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-03-28_16:23:17
  host      : ML
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 45561)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-28_16:23:17
  host      : ML
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 45560)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
  • Passing train_split=False

Same trace:

The following values were not passed to `accelerate launch` and had defaults used instead:
        `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
Fitting 2 folds for each of 10 candidates, totalling 20 fits
Fitting 2 folds for each of 10 candidates, totalling 20 fits
  epoch    train_loss     dur
-------  ------------  ------
      1        0.7932  0.3126
      2        0.7503  0.2607
      3        0.7133  0.2838
      4        0.6814  0.2720
      5        0.6537  0.2740
      6        0.6296  0.2714
      7        0.6086  0.2735
      8        0.5902  0.2781
      9        0.5740  0.2641
     10        0.5597  0.2656

Traceback (most recent call last):
  File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue.py", line 69, in <module>
    rs.fit(X, y)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_search.py", line 874, in fit
    self._run_search(evaluate_candidates)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_search.py", line 1768, in _run_search
    evaluate_candidates(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_search.py", line 821, in evaluate_candidates
    out = parallel(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 63, in __call__
    return super().__call__(iterable_with_config)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 1085, in __call__
    if self.dispatch_one_batch(iterator):
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 901, in dispatch_one_batch
    self._dispatch(tasks)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 819, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 597, in __init__
    self.results = batch()
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in __call__
    return [func(*args, **kwargs)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in <listcomp>
    return [func(*args, **kwargs)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 123, in __call__
    return self.function(*args, **kwargs)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 708, in _fit_and_score
    test_scores = _score(estimator, X_test, y_test, scorer, error_score)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 767, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 234, in __call__
    return self._score(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 399, in _score
    return self._sign * self._score_func(y, y_pred, **self._kwargs)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 234, in average_precision_score
    return _average_binary_score(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_base.py", line 75, in _average_binary_score
    return binary_metric(y_true, y_score, sample_weight=sample_weight)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 207, in _binary_uninterpolated_average_precision
    precision, recall, _ = precision_recall_curve(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 878, in precision_recall_curve
    fps, tps, thresholds = _binary_clf_curve(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 751, in _binary_clf_curve
    check_consistent_length(y_true, y_score, sample_weight)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/validation.py", line 397, in check_consistent_length
    raise ValueError(
ValueError: Found input variables with inconsistent numbers of samples: [5000, 2500]

Traceback (most recent call last):
  File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue.py", line 69, in <module>
    rs.fit(X, y)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_search.py", line 874, in fit
    self._run_search(evaluate_candidates)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_search.py", line 1768, in _run_search
    evaluate_candidates(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_search.py", line 821, in evaluate_candidates
    out = parallel(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 63, in __call__
    return super().__call__(iterable_with_config)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 1085, in __call__
    if self.dispatch_one_batch(iterator):
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 901, in dispatch_one_batch
    self._dispatch(tasks)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 819, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 597, in __init__
    self.results = batch()
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in __call__
    return [func(*args, **kwargs)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in <listcomp>
    return [func(*args, **kwargs)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 123, in __call__
    return self.function(*args, **kwargs)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 708, in _fit_and_score
    test_scores = _score(estimator, X_test, y_test, scorer, error_score)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 767, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 234, in __call__
    return self._score(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 399, in _score
    return self._sign * self._score_func(y, y_pred, **self._kwargs)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 234, in average_precision_score
    return _average_binary_score(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_base.py", line 75, in _average_binary_score
    return binary_metric(y_true, y_score, sample_weight=sample_weight)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 207, in _binary_uninterpolated_average_precision
    precision, recall, _ = precision_recall_curve(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 878, in precision_recall_curve
    fps, tps, thresholds = _binary_clf_curve(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 751, in _binary_clf_curve
    check_consistent_length(y_true, y_score, sample_weight)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/validation.py", line 397, in check_consistent_length
    raise ValueError(
ValueError: Found input variables with inconsistent numbers of samples: [5000, 2500]

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 47975) of binary: /home/razorin/conda_envs/backup/bin/python
Traceback (most recent call last):
  File "/home/razorin/conda_envs/backup/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/launch.py", line 906, in launch_command
    multi_gpu_launcher(args)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/launch.py", line 599, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run
    elastic_launch(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
skorch_accelerate_issue.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-03-28_16:25:23
  host      : ML
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 47976)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-28_16:25:23
  host      : ML
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 47975)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
  • Turning off caching
    Same trace

  • Disabling callbacks
    Same trace

  • All-at-once (callback disabled + caching turned off + train split disabled)
    Same trace

Thanks a lot in advance for your time.

@BenjaminBossan
Copy link
Collaborator

BenjaminBossan commented Mar 29, 2023

Thanks for your detailed experiments. IIUC, all the conditions work, except for using accelerate together with RandomizedSearchCV (I assume it's the same for GridSearchCV etc.). This narrows down the possibilities, but I still don't see why one would affect the other.

Could you please do some more tests:

# check cross_validate
from sklearn.model_selection import cross_validate
model = AcceleratedNeuralNetClassifier(
    MyModule,
    accelerator=accelerator,
)
cross_validate(model, X, y)

# check cloning
from sklearn.base import clone
model = AcceleratedNeuralNetClassifier(
    MyModule,
    accelerator=accelerator,
    # also test with different hyper-parameter settings, esp. batch size
)
model_cloned = clone(model)
model_cloned.fit(X, y)

# checking joblib
from joblib import parallel_backend

backend = 'loky'  # also test 'threading' and 'multiprocessing'
with parallel_backend(backend, n_jobs=1):
    model = ...  # check different hyper parames
    model.fit(X, y)

Can any of those conditions reproduce the error?

I suspect it could be a weird interaction with joblib. I could ask the accelerate devs if they have ever seen anything like this. To do so, could you please give detailed info about your environment (hardware, OS, versions of all packages, Python, etc.)?

@Raphaaal
Copy link
Author

Raphaaal commented Mar 29, 2023

Thanks for your reply.

IIUC, all the conditions work, except for using accelerate together with RandomizedSearchCV

Indeed.

Could you please do some more tests:

  • Check cross-validate
from sklearn.model_selection import cross_validate
model = AcceleratedNeuralNetClassifier(
    MyModule,
    accelerator=accelerator,
)
cross_validate(
    model, X, y, 
    cv=2, scoring="average_precision", error_score="raise"
)

Interestingly, it reproduces (almost) the same error. I am saying almost because the "inconsistent numbers of samples" are slightly different ([5000, 2560] versus [5000, 2500])

The following values were not passed to `accelerate launch` and had defaults used instead:
        `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        0.6996       0.6738        0.6160  0.0532
      2        0.5616       0.7441        0.5305  0.0325
      3        0.4985       0.7637        0.4866  0.0325
      4        0.4641       0.7891        0.4606  0.0353
      5        0.4428       0.8047        0.4437  0.0403
      6        0.4286       0.8105        0.4320  0.0339
      7        0.4184       0.8105        0.4234  0.0345
      8        0.4108       0.8105        0.4169  0.0342
      9        0.4050       0.8203        0.4118  0.0344
     10        0.4004       0.8262        0.4078  0.0334

Traceback (most recent call last):
  File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue.py", line 68, in <module>
    cross_validate(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 266, in cross_validate
    results = parallel(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 63, in __call__
    return super().__call__(iterable_with_config)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 1085, in __call__
    if self.dispatch_one_batch(iterator):
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 901, in dispatch_one_batch
    self._dispatch(tasks)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 819, in _dispatch
Traceback (most recent call last):
  File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue.py", line 68, in <module>
    job = self._backend.apply_async(batch, callback=cb)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 597, in __init__
    cross_validate(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 266, in cross_validate
    self.results = batch()
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in __call__
    return [func(*args, **kwargs)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in <listcomp>
    results = parallel(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 63, in __call__
    return [func(*args, **kwargs)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 123, in __call__
    return super().__call__(iterable_with_config)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 1085, in __call__
    return self.function(*args, **kwargs)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 708, in _fit_and_score
    test_scores = _score(estimator, X_test, y_test, scorer, error_score)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 767, in _score
    if self.dispatch_one_batch(iterator):
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 901, in dispatch_one_batch
    scores = scorer(estimator, X_test, y_test)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 234, in __call__
    return self._score(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 399, in _score
    self._dispatch(tasks)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 819, in _dispatch
    return self._sign * self._score_func(y, y_pred, **self._kwargs)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 234, in average_precision_score
    return _average_binary_score(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_base.py", line 75, in _average_binary_score
    return binary_metric(y_true, y_score, sample_weight=sample_weight)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 207, in _binary_uninterpolated_average_precision
    job = self._backend.apply_async(batch, callback=cb)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
    precision, recall, _ = precision_recall_curve(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 878, in precision_recall_curve
    result = ImmediateResult(func)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 597, in __init__
    fps, tps, thresholds = _binary_clf_curve(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 751, in _binary_clf_curve
    self.results = batch()
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in __call__
    return [func(*args, **kwargs)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in <listcomp>
    check_consistent_length(y_true, y_score, sample_weight)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/validation.py", line 397, in check_consistent_length
    return [func(*args, **kwargs)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 123, in __call__
    raise ValueError(
ValueError: Found input variables with inconsistent numbers of samples: [5000, 2560]

    return self.function(*args, **kwargs)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 708, in _fit_and_score
    test_scores = _score(estimator, X_test, y_test, scorer, error_score)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 767, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 234, in __call__
    return self._score(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 399, in _score
    return self._sign * self._score_func(y, y_pred, **self._kwargs)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 234, in average_precision_score
    return _average_binary_score(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_base.py", line 75, in _average_binary_score
    return binary_metric(y_true, y_score, sample_weight=sample_weight)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 207, in _binary_uninterpolated_average_precision
    precision, recall, _ = precision_recall_curve(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 878, in precision_recall_curve
    fps, tps, thresholds = _binary_clf_curve(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 751, in _binary_clf_curve
    check_consistent_length(y_true, y_score, sample_weight)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/validation.py", line 397, in check_consistent_length
    raise ValueError(
ValueError: Found input variables with inconsistent numbers of samples: [5000, 2560]

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 46440) of binary: /home/razorin/conda_envs/backup/bin/python
Traceback (most recent call last):
  File "/home/razorin/conda_envs/backup/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/launch.py", line 906, in launch_command
    multi_gpu_launcher(args)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/launch.py", line 599, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run
    elastic_launch(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
skorch_accelerate_issue.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-03-29_14:11:45
  host      : ML
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 46441)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-29_14:11:45
  host      : ML
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 46440)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
  • Check cloning
from sklearn.base import clone
for b_size in [10, 20, 30, 40]:
    accelerator = Accelerator()
    model = AcceleratedNeuralNetClassifier(
        MyModule,
        accelerator=accelerator,
        batch_size=b_size
    )
    model_cloned = clone(model)
    model_cloned.fit(X, y)

Training OK.

The following values were not passed to `accelerate launch` and had defaults used instead:
        `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        0.4272       0.8470        0.3837  0.5196
      2        0.3804       0.8550        0.3810  0.5237
      3        0.3778       0.8530        0.3815  0.5245
      4        0.3772       0.8510        0.3819  0.5087
      5        0.3770       0.8500        0.3821  0.4756
      6        0.3769       0.8510        0.3822  0.4767
      7        0.3769       0.8510        0.3823  0.5115
      8        0.3769       0.8510        0.3823  0.4704
      9        0.3768       0.8510        0.3823  0.5193
     10        0.3768       0.8510        0.3823  0.4821
  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        0.4801       0.8390        0.4085  0.2739
      2        0.3975       0.8440        0.3992  0.2672
      3        0.3896       0.8400        0.3986  0.2590
      4        0.3871       0.8370        0.3991  0.2563
      5        0.3862       0.8370        0.3998  0.2627
      6        0.3858       0.8380        0.4003  0.2649
      7        0.3856       0.8370        0.4007  0.2629
      8        0.3855       0.8370        0.4009  0.2542
      9        0.3855       0.8370        0.4011  0.2465
     10        0.3854       0.8370        0.4013  0.2557
  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        0.4989       0.8363        0.4242  0.2241
      2        0.3972       0.8451        0.4008  0.1792
      3        0.3843       0.8500        0.3941  0.1748
      4        0.3799       0.8510        0.3916  0.1724
      5        0.3779       0.8539        0.3905  0.1749
      6        0.3769       0.8529        0.3900  0.1722
      7        0.3764       0.8520        0.3898  0.1840
      8        0.3761       0.8510        0.3898  0.1778
      9        0.3760       0.8500        0.3899  0.1813
     10        0.3759       0.8510        0.3899  0.1816
  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        0.5122       0.8130        0.4207  0.1464
      2        0.4112       0.8300        0.3918  0.1458
      3        0.3964       0.8370        0.3836  0.1447
      4        0.3915       0.8410        0.3804  0.1391
      5        0.3893       0.8420        0.3791  0.1361
      6        0.3883       0.8400        0.3785  0.1410
      7        0.3878       0.8430        0.3783  0.1393
      8        0.3875       0.8440        0.3783  0.1383
      9        0.3874       0.8430        0.3783  0.1373
     10        0.3874       0.8460        0.3784  0.1355
  • Check joblib
from joblib import parallel_backend
for backend in ['loky', 'threading', 'multiprocessing']:
    print(f"\nUsing backend {backend}")
    with parallel_backend(backend, n_jobs=1):
        for b_size in [10, 20, 30, 40]:
            accelerator = Accelerator()
            model = AcceleratedNeuralNetClassifier(
                MyModule,
                accelerator=accelerator,
                batch_size=b_size
            )
            model.fit(X, y)

Training OK

The following values were not passed to `accelerate launch` and had defaults used instead:
        `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.

Using backend loky

Using backend loky
  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        0.4272       0.8470        0.3837  0.5170
      2        0.3804       0.8550        0.3810  0.5128
      3        0.3778       0.8530        0.3815  0.5058
      4        0.3772       0.8510        0.3819  0.5218
      5        0.3770       0.8500        0.3821  0.5017
      6        0.3769       0.8510        0.3822  0.4915
      7        0.3769       0.8510        0.3823  0.5379
      8        0.3769       0.8510        0.3823  0.5746
      9        0.3768       0.8510        0.3823  0.5320
     10        0.3768       0.8510        0.3823  0.5005
  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        0.4801       0.8390        0.4085  0.2769
      2        0.3975       0.8440        0.3992  0.2822
      3        0.3896       0.8400        0.3986  0.2683
      4        0.3871       0.8370        0.3991  0.2673
      5        0.3862       0.8370        0.3998  0.2607
      6        0.3858       0.8380        0.4003  0.3066
      7        0.3856       0.8370        0.4007  0.3167
      8        0.3855       0.8370        0.4009  0.2673
      9        0.3855       0.8370        0.4011  0.2614
     10        0.3854       0.8370        0.4013  0.2647
  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        0.4989       0.8363        0.4242  0.1936
      2        0.3972       0.8451        0.4008  0.1856
      3        0.3843       0.8500        0.3941  0.1895
      4        0.3799       0.8510        0.3916  0.2117
      5        0.3779       0.8539        0.3905  0.2053
      6        0.3769       0.8529        0.3900  0.1819
      7        0.3764       0.8520        0.3898  0.1928
      8        0.3761       0.8510        0.3898  0.1832
      9        0.3760       0.8500        0.3899  0.1861
     10        0.3759       0.8510        0.3899  0.1942
  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        0.5122       0.8130        0.4207  0.1546
      2        0.4112       0.8300        0.3918  0.1588
      3        0.3964       0.8370        0.3836  0.1705
      4        0.3915       0.8410        0.3804  0.1777
      5        0.3893       0.8420        0.3791  0.1431
      6        0.3883       0.8400        0.3785  0.1561
      7        0.3878       0.8430        0.3783  0.1457
      8        0.3875       0.8440        0.3783  0.1614
      9        0.3874       0.8430        0.3783  0.1470
     10        0.3874       0.8460        0.3784  0.1527

Using backend threading

Using backend threading
  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        0.4261       0.8450        0.3846  0.5254
      2        0.3805       0.8500        0.3816  0.5058
      3        0.3777       0.8480        0.3817  0.5006
      4        0.3771       0.8490        0.3820  0.5077
      5        0.3769       0.8500        0.3822  0.5350
      6        0.3769       0.8510        0.3822  0.5379
      7        0.3768       0.8510        0.3822  0.5068
      8        0.3768       0.8510        0.3823  0.5018
      9        0.3768       0.8510        0.3823  0.5917
     10        0.3768       0.8510        0.3823  0.5405
  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        0.4615       0.8300        0.4188  0.3085
      2        0.3944       0.8380        0.4059  0.2704
      3        0.3880       0.8420        0.4028  0.2707
      4        0.3863       0.8400        0.4019  0.3165
      5        0.3857       0.8380        0.4016  0.2664
      6        0.3855       0.8370        0.4015  0.2624
      7        0.3854       0.8380        0.4015  0.2673
      8        0.3854       0.8390        0.4015  0.2867
      9        0.3854       0.8390        0.4015  0.2866
     10        0.3854       0.8380        0.4015  0.2755
  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        0.4711       0.8284        0.4159  0.1916
      2        0.3936       0.8373        0.3964  0.2079
      3        0.3829       0.8422        0.3915  0.2168
      4        0.3793       0.8451        0.3900  0.1939
      5        0.3777       0.8490        0.3897  0.1833
      6        0.3769       0.8490        0.3897  0.1863
      7        0.3765       0.8500        0.3898  0.2119
      8        0.3763       0.8500        0.3899  0.2143
      9        0.3762       0.8490        0.3901  0.2452
     10        0.3761       0.8500        0.3902  0.1894
  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        0.5508       0.8100        0.4390  0.1565
      2        0.4187       0.8380        0.4014  0.1456
      3        0.4004       0.8420        0.3900  0.1927
      4        0.3939       0.8440        0.3850  0.1590
      5        0.3909       0.8480        0.3825  0.1454
      6        0.3893       0.8470        0.3811  0.1493
      7        0.3885       0.8460        0.3802  0.1676
      8        0.3880       0.8440        0.3798  0.1418
      9        0.3877       0.8430        0.3795  0.1457
     10        0.3875       0.8430        0.3793  0.1608

Using backend multiprocessing

Using backend multiprocessing
  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        0.4196       0.8480        0.3868  0.5061
      2        0.3805       0.8520        0.3827  0.5080
      3        0.3780       0.8510        0.3825  0.5078
      4        0.3774       0.8500        0.3825  0.5109
      5        0.3771       0.8490        0.3825  0.5234
      6        0.3770       0.8490        0.3825  0.5452
      7        0.3770       0.8480        0.3825  0.5594
      8        0.3769       0.8500        0.3824  0.5044
      9        0.3769       0.8500        0.3824  0.5021
     10        0.3769       0.8510        0.3824  0.5518
  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        0.4627       0.8330        0.4141  0.2845
      2        0.3969       0.8390        0.4032  0.2606
      3        0.3895       0.8360        0.4011  0.3675
      4        0.3871       0.8350        0.4007  0.2591
      5        0.3862       0.8370        0.4007  0.2517
      6        0.3858       0.8360        0.4009  0.2732
      7        0.3856       0.8370        0.4011  0.2692
      8        0.3855       0.8360        0.4012  0.2773
      9        0.3855       0.8370        0.4013  0.2745
     10        0.3855       0.8370        0.4014  0.2972
  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        0.4894       0.8353        0.4170  0.2135
      2        0.3962       0.8520        0.3967  0.2107
      3        0.3844       0.8529        0.3915  0.1853
      4        0.3802       0.8510        0.3899  0.2013
      5        0.3783       0.8500        0.3894  0.1808
      6        0.3773       0.8500        0.3893  0.1910
      7        0.3767       0.8500        0.3894  0.1899
      8        0.3764       0.8510        0.3896  0.1952
      9        0.3762       0.8500        0.3898  0.2132
     10        0.3761       0.8500        0.3899  0.2240
  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        0.5386       0.8200        0.4245  0.1483
      2        0.4170       0.8360        0.3927  0.1550
      3        0.3999       0.8420        0.3836  0.1409
      4        0.3938       0.8430        0.3801  0.1403
      5        0.3910       0.8410        0.3785  0.1733
      6        0.3895       0.8410        0.3779  0.1465
      7        0.3886       0.8470        0.3777  0.1506
      8        0.3882       0.8450        0.3777  0.1417
      9        0.3879       0.8450        0.3778  0.1496
     10        0.3877       0.8460        0.3779  0.1471

could you please give detailed info about your environment (hardware, OS, versions of all packages, Python, etc.)?

  • GPUs: Tesla V100-PCIE-16GB
  • NVIDIA-SMI: 515.43.04
  • Driver Version: 515.43.04
  • CUDA Version: 11.7
  • OS: Ubuntu 18.04.3 LTS
  • Python: 3.9.15
  • All packages:
Name Version
_ipython_minor_entry_point 8.7.0
_libgcc_mutex 0.1
_openmp_mutex 5.1
absl-py 1.3.0
accelerate 0.17.0
alembic 1.8.1
anyio 3.6.2
argon2-cffi 21.3.0
argon2-cffi-bindings 21.2.0
asttokens 2.2.1
astunparse 1.6.3
attrs 22.1.0
autopage 0.5.1
babel 2.11.0
backcall 0.2.0
backports 1
backports.functools_lru_cache 1.6.4
beautifulsoup4 4.11.1
bleach 5.0.1
brotlipy 0.7.0
ca-certificates 2023.01.10
cachetools 5.2.0
captum 0.5.0
catboost 1.0.6
category-encoders 2.4.0
certifi 2022.12.7
cffi 1.15.0
chardet 4.0.0
charset-normalizer 2.1.1
click 7
cliff 4.1.0
cloudpickle 2.2.0
cmaes 0.9.0
cmd2 2.4.2
colorama 0.4.6
colorlog 6.7.0
configargparse 1.5.3
cryptography 3.4.8
cycler 0.11.0
decorator 5.1.1
defusedxml 0.7.1
docker-pycreds 0.4.0
einops 0.6.0
entrypoints 0.4
executing 1.2.0
filelock 3.9.0
flatbuffers 1.12
flit-core 3.8.0
gast 0.4.0
gitdb 4.0.10
gitpython 3.1.29
google-auth 2.15.0
google-auth-oauthlib 0.4.6
google-pasta 0.2.0
greenlet 2.0.1
grpcio 1.51.1
h5py 3.7.0
huggingface-hub 0.13.1
icecream 2.1.1
idna 2.1
importlib-metadata 5.1.0
importlib_resources 5.10.1
iniconfig 1.1.1
ipykernel 5.5.5
ipython 8.7.0
ipython_genutils 0.2.0
jedi 0.18.2
jinja2 3.1.2
joblib 1.2.0
json5 0.9.5
jsonschema 4.17.3
jupyter_client 7.0.6
jupyter_core 5.1.0
jupyter_server 1.23.3
jupyterlab 3.5.1
jupyterlab_pygments 0.2.2
jupyterlab_server 2.16.5
keras 2.9.0
keras-preprocessing 1.1.2
kiwisolver 1.4.4
ld_impl_linux-64 2.38
liac-arff 2.5.0
libclang 14.0.6
libffi 3.4.2
libgcc-ng 11.2.0
libgomp 11.2.0
libsodium 1.0.18
libstdcxx-ng 11.2.0
lightgbm 3.2.1
llvmlite 0.38.1
mako 1.2.4
markdown 3.4.1
markupsafe 2.1.1
matplotlib 3.4.2
matplotlib-inline 0.1.6
minio 7.1.12
mistune 2.0.4
nbclassic 0.4.8
nbclient 0.7.2
nbconvert 7.2.6
nbconvert-core 7.2.6
nbconvert-pandoc 7.2.6
nbformat 5.7.0
ncurses 6.3
nest-asyncio 1.5.6
notebook 6.5.2
notebook-shim 0.2.2
numba 0.55.1
numpy 1.20.3
nvidia-ml-py3 7.352.0
oauthlib 3.2.2
openml 0.12.2
openssl 1.1.1s
opt-einsum 3.3.0
optuna 2.10.0
packaging 22
pandas 1.5.3
pandoc 2.19.2
pandocfilters 1.5.0
parso 0.8.3
pathtools 0.1.2
patsy 0.5.3
pbr 5.11.0
pexpect 4.8.0
pickleshare 0.7.5
pillow 9.3.0
pip 22.3.1
pkgutil-resolve-name 1.3.10
platformdirs 2.6.0
plotly 5.10.0
pluggy 1.0.0
ply 3.11
prettytable 3.5.0
progressbar2 4.2.0
prometheus_client 0.15.0
promise 2.3
prompt-toolkit 3.0.36
protobuf 3.19.6
psutil 5.9.4
ptyprocess 0.7.0
pure_eval 0.2.2
py 1.11.0
pyarrow 10.0.1
pyasn1 0.4.8
pyasn1-modules 0.2.8
pycparser 2.21
pygments 2.13.0
pynvml 11.4.1
pyopenssl 20.0.1
pyparsing 3.0.9
pyperclip 1.8.2
pyrsistent 0.18.0
pysocks 1.7.1
pytest 7.1.2
python 3.9.15
python-dateutil 2.8.2
python-fastjsonschema 2.16.2
python-graphviz 0.20.1
python-utils 3.4.5
python_abi 3.9
pytomlpp 1.0.10
pytz 2022.6
pyyaml 6
pyzmq 19.0.2
quantiphy 2.18.0
readline 8.2
regex 2022.10.31
requests 2.25.1
requests-oauthlib 1.3.1
rotation-forest 1
rsa 4.9
scikit-learn 1.2.2
scipy 1.6.2
send2trash 1.8.0
sentry-sdk 1.11.1
setproctitle 1.3.2
setuptools 65.5.0
setuptools-scm 7.0.5
shap 0.39.0
shortuuid 1.0.11
six 1.16.0
skorch 0.12.1
slicer 0.0.7
smmap 5.0.0
sniffio 1.3.0
soupsieve 2.3.2.post1
sqlalchemy 1.4.45
sqlite 3.40.0
stack_data 0.6.2
statsmodels 0.13.5
stevedore 4.1.1
tabulardl 0.1.0
tabulate 0.9.0
tenacity 8.1.0
tensorboard 2.9.1
tensorboard-data-server 0.6.1
tensorboard-plugin-wit 1.8.1
tensorboardx 2.6
tensorflow 2.9.1
tensorflow-estimator 2.9.0
tensorflow-io-gcs-filesystem 0.28.0
termcolor 2.1.1
termcolor-whl 1.1.2
terminado 0.17.1
threadpoolctl 3.1.0
tinycss2 1.2.1
tk 8.6.12
tokenizers 0.13.2
tomli 2.0.1
torch 1.10.1
torch-summary 1.4.5
tornado 6.1
tqdm 4.62.3
traitlets 5.7.1
transformers 4.26.1
typing_extensions 4.4.0
tzdata 2022g
urllib3 1.26.13
wandb 0.12.11
wcwidth 0.2.5
webencodings 0.5.1
websocket-client 1.4.2
werkzeug 2.2.2
wheel 0.37.1
wrapt 1.14.1
xgboost 1.7.4
xmltodict 0.13.0
xz 5.2.8
yaspin 2.2.0
zero 0.9.1
zeromq 4.3.4
zipp 3.11.0
zlib 1.2.13

Many thanks in advance.

@BenjaminBossan
Copy link
Collaborator

Great, I'm asking colleagues, let's see if anything comes up.

Meanwhile, two more things to test:

  1. Scoring function

Probably you already tested that, but using, say, scoring="accuracy", still gives the same error, right?

  1. Manual device placement

The issue is almost certainly related to the two GPUs, since the same code runs fine with 1 GPU. Also, we have 10000 samples, with cv=2, we get 5000, which the error suggests is what we expect, but for some reason we get 2500 or 2560, which is (roughly) half of 5000, as if the predictions were split between the GPUs. Still, to validate this, could you please check that indicating a single GPU makes the error disappear?

accelerator = Accelerator(device_placement=False)
model = AcceleratedNeuralNetClassifier(
    MyModule,
    accelerator=accelerator,
    device='cuda:0',  # or 'cuda:1'
)
cross_validate(
    model,
    X,
    y, 
    cv=2,
    error_score="raise"
)

@BenjaminBossan
Copy link
Collaborator

No progress yet but something more to test:

import copy
import torch
import torch.nn as nn
from accelerate import Accelerator
from sklearn.model_selection import KFold

class MyModule(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.dense0 = nn.Linear(100, 2)
        self.nonlin = nn.LogSoftmax(dim=-1)

    def forward(self, X):
        X = self.dense0(X)
        X = self.nonlin(X)
        return X

X = torch.rand((10000, 100))
y = torch.randint(0, 2, size=(10000,))

model = MyModule()
optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
accelerator = Accelerator()

def accuracy(y_true, y_pred):
    assert len(y_true) == len(y_pred)
    return (y_true.cpu() == y_pred.cpu()).float().mean().item()

def _fit_and_score(model, accelerator, X_train, y_train, X_test, y_test, max_epochs=10):
    model = copy.deepcopy(model)
    accelerator = copy.deepcopy(accelerator)

    optimizer = torch.optim.SGD(model.parameters(), lr=0.01)
    dataset_train = torch.utils.data.TensorDataset(X_train, y_train)
    dataloader_train = torch.utils.data.DataLoader(dataset_train, batch_size=10)
    dataset_test = torch.utils.data.TensorDataset(X_test, y_test)
    dataloader_test = torch.utils.data.DataLoader(dataset_test, batch_size=10)
    
    model, optimizer = accelerator.prepare(model, optimizer)
    dataloader_train, dataloader_test = accelerator.prepare(dataloader_train, dataloader_test)

    # training
    model.train()
    for epoch in range(max_epochs):
        for source, targets in dataloader_train:
            optimizer.zero_grad()
            output = model(source)
            loss = nn.functional.nll_loss(output, targets)
            accelerator.backward(loss)
            optimizer.step()
            
    # validation
    model.eval()
    y_proba = []
    losses = []
    for source, targets in dataloader_test:
        output = model(source)
        loss = nn.functional.nll_loss(output, targets)
        y_proba.append(output)
        losses.append(loss)

    print(len(y_proba), {len(batch) for batch in y_proba})
    y_proba = torch.vstack(y_proba)
    y_pred = y_proba.argmax(1)
    print("test loss", (sum(losses) / len(losses)).item())
    print("accuracy:", accuracy(y_test, y_pred))

# training without joblib
for idx_train, idx_test in KFold(2).split(X, y):
    X_train, y_train = X[idx_train], y[idx_train]
    X_test, y_test = X[idx_test], y[idx_test]
    _fit_and_score(model, accelerator, X_train, y_train, X_test, y_test)

# training with joblib
from joblib import Parallel, delayed
parallel = Parallel(n_jobs=None, verbose=0, pre_dispatch='2*n_jobs')
parallel(
    delayed(_fit_and_score)(
        model,
        accelerator,
        X[idx_train], y[idx_train],
        X[idx_test], y[idx_test],
    )
    for idx_train, idx_test in KFold(2).split(X, y)
)

# training with sklearn joblib
from sklearn.utils.parallel import Parallel, delayed
parallel = Parallel(n_jobs=None, verbose=0, pre_dispatch='2*n_jobs')
parallel(
    delayed(_fit_and_score)(
        model,
        accelerator,
        X[idx_train], y[idx_train],
        X[idx_test], y[idx_test],
    )
    for idx_train, idx_test in KFold(2).split(X, y)
)

The idea here is to try to remove as much "fluff" as possible in order to isolate the problem. So skorch is completely removed, and from cross_validate, I tried to only take the essential parts.

@Raphaaal
Copy link
Author

Raphaaal commented Mar 29, 2023

Thanks for your reply.

using, say, scoring="accuracy", still gives the same error, right?

Yes.

ValueError: Found input variables with inconsistent numbers of samples: [5000, 2500]

could you please check that indicating a single GPU makes the error disappear?

  • When keeping Accelerate configured to use two GPUs and using your code:
accelerator = Accelerator(device_placement=False)
model = AcceleratedNeuralNetClassifier(
    MyModule,
    accelerator=accelerator,
    device='cuda:0',
)
cross_validate(
    model,
    X,
    y, 
    cv=2,
    error_score="raise"
)

I get a new error:

The following values were not passed to `accelerate launch` and had defaults used instead:
        `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
Traceback (most recent call last):
Traceback (most recent call last):
  File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue.py", line 70, in <module>
  File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue.py", line 70, in <module>
    cross_validate(    
cross_validate(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 266, in cross_validate
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 266, in cross_validate
    results = parallel(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 63, in __call__
    results = parallel(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 63, in __call__
    return super().__call__(iterable_with_config)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 1085, in __call__
    return super().__call__(iterable_with_config)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 1085, in __call__
    if self.dispatch_one_batch(iterator):
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 901, in dispatch_one_batch
    if self.dispatch_one_batch(iterator):
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 901, in dispatch_one_batch
    self._dispatch(tasks)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 819, in _dispatch
    self._dispatch(tasks)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 819, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
    job = self._backend.apply_async(batch, callback=cb)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 597, in __init__
    result = ImmediateResult(func)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 597, in __init__
    self.results = batch()
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in __call__
    self.results = batch()
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in __call__
    return [func(*args, **kwargs)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in <listcomp>
    return [func(*args, **kwargs)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in <listcomp>
    return [func(*args, **kwargs)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 123, in __call__
        return [func(*args, **kwargs)return self.function(*args, **kwargs)

  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 123, in __call__
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    return self.function(*args, **kwargs)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/skorch/classifier.py", line 141, in fit
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/skorch/classifier.py", line 141, in fit
    return super(NeuralNetClassifier, self).fit(X, y, **fit_params)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/skorch/net.py", line 1228, in fit
    return super(NeuralNetClassifier, self).fit(X, y, **fit_params)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/skorch/net.py", line 1228, in fit
    self.initialize()
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/skorch/net.py", line 815, in initialize
    self.initialize()
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/skorch/net.py", line 815, in initialize
    self._initialize_module()
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/skorch/hf.py", line 948, in _initialize_module
    self._initialize_module()
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/skorch/hf.py", line 948, in _initialize_module
    setattr(self, name + '_', self.accelerator.prepare(module))
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/accelerator.py", line 1094, in prepare
    setattr(self, name + '_', self.accelerator.prepare(module))
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/accelerator.py", line 1094, in prepare
    result = tuple(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/accelerator.py", line 1095, in <genexpr>
    result = tuple(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/accelerator.py", line 1095, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/accelerator.py", line 949, in _prepare_one
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/accelerator.py", line 949, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/accelerator.py", line 1166, in prepare_model
    return self.prepare_model(obj, device_placement=device_placement)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/accelerator.py", line 1166, in prepare_model
    model = torch.nn.parallel.DistributedDataParallel(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 578, in __init__
    model = torch.nn.parallel.DistributedDataParallel(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 578, in __init__
    dist._verify_model_across_ranks(self.process_group, parameters)
    dist._verify_model_across_ranks(self.process_group, parameters)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 24510) of binary: /home/razorin/conda_envs/backup/bin/python
Traceback (most recent call last):
  File "/home/razorin/conda_envs/backup/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/launch.py", line 906, in launch_command
    multi_gpu_launcher(args)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/launch.py", line 599, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run
    elastic_launch(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
skorch_accelerate_issue.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-03-29_16:42:35
  host      : ML
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 24511)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-29_16:42:35
  host      : ML
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 24510)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Also, this launches two processes on the same GPU, as if constraining device='cuda:'0 messes with what Accelerate was configured to do (i.e., train on two GPUs).

  • When configuring Accelerate to use a single GPU and using your code:
accelerator = Accelerator(device_placement=False)
model = AcceleratedNeuralNetClassifier(
    MyModule,
    accelerator=accelerator,
    device='cuda:0',
)
cross_validate(
    model,
    X,
    y, 
    cv=2,
    error_score="raise"
)

Training OK. Note that this launches a single process on the GPU.

  • FYI, when configuring Accelerate to use three GPUs instead of two and keeping the original code, the error becomes:
ValueError: Found input variables with inconsistent numbers of samples: [5000, 1670]

something more to test

I reply in the next comment.

@Raphaaal
Copy link
Author

Raphaaal commented Mar 29, 2023

something more to test

  • Using the following portion of your code:
# training without joblib
for idx_train, idx_test in KFold(2).split(X, y):
    X_train, y_train = X[idx_train], y[idx_train]
    X_test, y_test = X[idx_test], y[idx_test]
    _fit_and_score(model, accelerator, X_train, y_train, X_test, y_test)

Error is:

The following values were not passed to `accelerate launch` and had defaults used instead:
        `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
250 {10}
250 {10}
test loss 0.6986234784126282
Traceback (most recent call last):
  File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue_wofluff.py", line 80, in <module>
    _fit_and_score(model, accelerator, X_train, y_train, X_test, y_test)
  File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue_wofluff.py", line 74, in _fit_and_score
    print("accuracy:", accuracy(y_test, y_pred))
  File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue_wofluff.py", line 27, in accuracy
    assert len(y_true) == len(y_pred)
AssertionError
test loss 0.6999748349189758
Traceback (most recent call last):
  File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue_wofluff.py", line 80, in <module>
    _fit_and_score(model, accelerator, X_train, y_train, X_test, y_test)
  File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue_wofluff.py", line 74, in _fit_and_score
    print("accuracy:", accuracy(y_test, y_pred))
  File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue_wofluff.py", line 27, in accuracy
    assert len(y_true) == len(y_pred)
AssertionError
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 43418) of binary: /home/razorin/conda_envs/backup/bin/python
Traceback (most recent call last):
  File "/home/razorin/conda_envs/backup/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/launch.py", line 906, in launch_command
    multi_gpu_launcher(args)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/launch.py", line 599, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run
    elastic_launch(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
skorch_accelerate_issue_wofluff.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-03-29_17:03:54
  host      : ML
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 43419)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-29_17:03:54
  host      : ML
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 43418)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

FYI len(y_true)=5000 and len(y_pred)=2500

  • Using the following portion of your code:
# training with joblib
from joblib import Parallel, delayed
parallel = Parallel(n_jobs=None, verbose=0, pre_dispatch='2*n_jobs')
parallel(
    delayed(_fit_and_score)(
        model,
        accelerator,
        X[idx_train], y[idx_train],
        X[idx_test], y[idx_test],
    )
    for idx_train, idx_test in KFold(2).split(X, y)
)

Error is the same:

The following values were not passed to `accelerate launch` and had defaults used instead:
        `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
250 {10}
test loss 0.6977217793464661
Traceback (most recent call last):
  File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue_wofluff.py", line 85, in <module>
    parallel(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 1085, in __call__
    if self.dispatch_one_batch(iterator):
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 901, in dispatch_one_batch
    self._dispatch(tasks)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 819, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 597, in __init__
    self.results = batch()
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in __call__
    return [func(*args, **kwargs)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in <listcomp>
    return [func(*args, **kwargs)
  File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue_wofluff.py", line 74, in _fit_and_score
    print("accuracy:", accuracy(y_test, y_pred))
  File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue_wofluff.py", line 27, in accuracy
    assert len(y_true) == len(y_pred)
AssertionError
250 {10}
test loss 0.6986113786697388
Traceback (most recent call last):
  File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue_wofluff.py", line 85, in <module>
    parallel(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 1085, in __call__
    if self.dispatch_one_batch(iterator):
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 901, in dispatch_one_batch
    self._dispatch(tasks)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 819, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 597, in __init__
    self.results = batch()
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in __call__
    return [func(*args, **kwargs)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in <listcomp>
    return [func(*args, **kwargs)
  File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue_wofluff.py", line 74, in _fit_and_score
    print("accuracy:", accuracy(y_test, y_pred))
  File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue_wofluff.py", line 27, in accuracy
    assert len(y_true) == len(y_pred)
AssertionError
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 45374) of binary: /home/razorin/conda_envs/backup/bin/python
Traceback (most recent call last):
  File "/home/razorin/conda_envs/backup/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/launch.py", line 906, in launch_command
    multi_gpu_launcher(args)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/launch.py", line 599, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run
    elastic_launch(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
skorch_accelerate_issue_wofluff.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-03-29_17:06:27
  host      : ML
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 45375)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-29_17:06:27
  host      : ML
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 45374)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

FYI len(y_true)=5000 and len(y_pred)=2500

  • Using the following portion of your code:
# training with sklearn joblib
from sklearn.utils.parallel import Parallel, delayed
parallel = Parallel(n_jobs=None, verbose=0, pre_dispatch='2*n_jobs')
parallel(
    delayed(_fit_and_score)(
        model,
        accelerator,
        X[idx_train], y[idx_train],
        X[idx_test], y[idx_test],
    )
    for idx_train, idx_test in KFold(2).split(X, y)
)

Error is the same:

The following values were not passed to `accelerate launch` and had defaults used instead:
        `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
250 {10}
test loss 0.6967058181762695
Traceback (most recent call last):
  File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue_wofluff.py", line 98, in <module>
    parallel(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 63, in __call__
    return super().__call__(iterable_with_config)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 1085, in __call__
    if self.dispatch_one_batch(iterator):
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 901, in dispatch_one_batch
    self._dispatch(tasks)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 819, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 597, in __init__
250 {10}
    self.results = batch()
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in __call__
    return [func(*args, **kwargs)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in <listcomp>
    return [func(*args, **kwargs)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 123, in __call__
    return self.function(*args, **kwargs)
  File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue_wofluff.py", line 74, in _fit_and_score
    print("accuracy:", accuracy(y_test, y_pred))
  File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue_wofluff.py", line 27, in accuracy
    assert len(y_true) == len(y_pred)
AssertionError
test loss 0.697378396987915
Traceback (most recent call last):
  File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue_wofluff.py", line 98, in <module>
    parallel(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 63, in __call__
    return super().__call__(iterable_with_config)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 1085, in __call__
    if self.dispatch_one_batch(iterator):
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 901, in dispatch_one_batch
    self._dispatch(tasks)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 819, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 597, in __init__
    self.results = batch()
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in __call__
    return [func(*args, **kwargs)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in <listcomp>
    return [func(*args, **kwargs)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 123, in __call__
    return self.function(*args, **kwargs)
  File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue_wofluff.py", line 74, in _fit_and_score
    print("accuracy:", accuracy(y_test, y_pred))
  File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue_wofluff.py", line 27, in accuracy
    assert len(y_true) == len(y_pred)
AssertionError
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 46860) of binary: /home/razorin/conda_envs/backup/bin/python
Traceback (most recent call last):
  File "/home/razorin/conda_envs/backup/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/launch.py", line 906, in launch_command
    multi_gpu_launcher(args)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/launch.py", line 599, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run
    elastic_launch(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
skorch_accelerate_issue_wofluff.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-03-29_17:08:10
  host      : ML
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 46861)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-29_17:08:10
  host      : ML
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 46860)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

FYI len(y_true)=5000 and len(y_pred)=2500

Finally, when I change your assert for

assert len(y_true.cpu()) == len(y_pred.cpu())

I still get the same AssertionError in all three cases.

Many thanks in advance for your feedback.

@BenjaminBossan
Copy link
Collaborator

Thanks again, this is really helpful. Especially, since the first example without joblib already fails, that can't be the reason. This prompted me to look a bit more into the accelerate docs and I would like to test one more thing (sorry for the back and forth), name calling gather explicitly, like described here:

https://huggingface.co/docs/accelerate/quicktour#distributed-evaluation

So IIUC, that means that in the evaluation part of _fit_and_score, you need to add output = accelerator.gather_for_metrics(output) after output = model(source).

In case this solves the issue, I would consider it a skorch bug. To quickly try a fix, you would need to subclass AccelerateMixin and add the following method:

class MyAccelerateMixin(AccelerateMixin):
    def evaluation_step(self, batch, training=False):
        output = super().evaluation_step(batch, training=training)
        return self.accelerator.gather_for_metrics(output)

(or add this method AcceleratedNeuralNetClassifier)

This would be more of a quick and dirty hack, I would need to investigate further how to do this most efficiently. So if it works, do check that your code actually runs faster with accelerate than without.

@Raphaaal
Copy link
Author

Raphaaal commented Mar 30, 2023

Good news!

With a slight adaptation, gather_for_metrics can indeed solve the issue when using the _fit_and_score without any fluff.

for source, targets in dataloader_test:
    outputs = model(source)
    # outputs = accelerator.gather_for_metrics(outputs) <= initial suggestion
    all_outputs, all_targets = accelerator.gather_for_metrics((outputs, targets)) # <= corrected
    loss = nn.functional.nll_loss(all_outputs, all_targets)
    y_proba.append(all_outputs)
    losses.append(loss)

Output:

The following values were not passed to `accelerate launch` and had defaults used instead:
        `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
250 {20}
250 {20}
test loss 0.6963825225830078
test loss 0.6963825225830078
accuracy: 0.5130000114440918
accuracy: 0.5016000270843506
250 {20}
250 {20}
test loss 0.6963870525360107
accuracy: 0.4991999864578247
test loss 0.6963870525360107
accuracy: 0.5005999803543091
250 {20}
250 {20}
test loss 0.6963825225830078
accuracy: 0.5130000114440918
test loss 0.6963825225830078
accuracy: 0.5016000270843506
250 {20}250
 {20}
test loss 0.6963870525360107
test loss 0.6963870525360107
accuracy: 0.4991999864578247
accuracy: 0.5005999803543091
250 {20}
250 {20}
test loss 0.6963825225830078
accuracy: 0.5130000114440918
test loss 0.6963825225830078
accuracy: 0.5016000270843506
250 {20}
250 {20}
test loss 0.6963870525360107
test loss 0.6963870525360107
accuracy: 0.4991999864578247
accuracy: 0.5005999803543091

Note that gather_for_metrics is only needed in the eval phase and not in the training phase.

I also tried the Skorch adaptation you mentioned, but I think I am incorrectly implementing it. Full code:

import torch
import numpy as np
from skorch import NeuralNetClassifier
from skorch.hf import AccelerateMixin
from accelerate import Accelerator 
from sklearn.datasets import make_classification
import torch.nn as nn
import random
from sklearn.model_selection import cross_validate
from skorch.dataset import unpack_data


# Reproducibility
SEED = 42
torch.manual_seed(SEED)
np.random.seed(SEED)
random.seed(SEED)

# Generate data
X, y = make_classification(10_000, 100, n_informative=5, random_state=SEED)
X = X.astype(np.float32)
y = y.astype(np.int64)


# PyTorch module
class MyModule(torch.nn.Module):
    def __init__(self):
        super().__init__()

        self.dense0 = nn.Linear(100, 2)
        self.nonlin = nn.Softmax(dim=-1)

    def forward(self, X):
        X = self.dense0(X)
        X = self.nonlin(X)
        return X


# Skorch wrapper
class AcceleratedNeuralNetClassifier(
    AccelerateMixin, 
    NeuralNetClassifier
):
    """NeuralNetClassifier with HuggingFace Accelerate support"""

    
    # First attempt
    # def evaluation_step(self, batch, training=False):
    #     output = super().evaluation_step(batch, training=training)
    #     return self.accelerator.gather_for_metrics(output)

    # Second attempt
    def evaluation_step(self, batch, training=False):
            """Perform a forward step to produce the output used for
            prediction and scoring.
            Preds and targets are gathered by the accelerator before return
            """
            self.check_is_fitted()
            Xi, targets = unpack_data(batch)
            with torch.set_grad_enabled(training):
                self._set_training(training)
                y_infer = self.infer(Xi)
                all_y_infer, all_targets = self.accelerator.gather_for_metrics((
                    y_infer, 
                    targets
                ))
                return all_y_infer

accelerator = Accelerator()
model = AcceleratedNeuralNetClassifier(
    MyModule,
    accelerator=accelerator,
)

cross_validate(
    model, X, y, 
    cv=2, scoring="average_precision", error_score="raise"
)

Both attempts produce the same error:

The following values were not passed to `accelerate launch` and had defaults used instead:
        `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        0.6753       0.6113        0.6674  0.0511
      2        0.6300       0.6426        0.6356  0.0413
      3        0.6005       0.6523        0.6136  0.0367
      4        0.5796       0.6816        0.5972  0.0362
      5        0.5640       0.6953        0.5845  0.0417
      6        0.5520       0.7129        0.5742  0.0406
      7        0.5423       0.7266        0.5658  0.0368
      8        0.5344       0.7422        0.5587  0.0367
      9        0.5278       0.7422        0.5527  0.0418
     10        0.5222       0.7441        0.5476  0.0373
Traceback (most recent call last):
  File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue.py", line 104, in <module>
    cross_validate(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 266, in cross_validate
Traceback (most recent call last):
  File "/home/razorin/TabularDL/_trash/tmp/skorch_accelerate_issue.py", line 104, in <module>
    results = parallel(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 63, in __call__
    return super().__call__(iterable_with_config)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 1085, in __call__
    cross_validate(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 266, in cross_validate
        if self.dispatch_one_batch(iterator):results = parallel(

  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 901, in dispatch_one_batch
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 63, in __call__
    return super().__call__(iterable_with_config)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 1085, in __call__
    self._dispatch(tasks)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 819, in _dispatch
    if self.dispatch_one_batch(iterator):
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 901, in dispatch_one_batch
    job = self._backend.apply_async(batch, callback=cb)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
    result = ImmediateResult(func)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 597, in __init__
    self._dispatch(tasks)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 819, in _dispatch
    self.results = batch()
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in __call__
    return [func(*args, **kwargs)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in <listcomp>
    job = self._backend.apply_async(batch, callback=cb)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 208, in apply_async
    return [func(*args, **kwargs)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 123, in __call__
    result = ImmediateResult(func)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/_parallel_backends.py", line 597, in __init__
    return self.function(*args, **kwargs)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 708, in _fit_and_score
    self.results = batch()
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in __call__
    test_scores = _score(estimator, X_test, y_test, scorer, error_score)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 767, in _score
    return [func(*args, **kwargs)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/joblib/parallel.py", line 288, in <listcomp>
    return [func(*args, **kwargs)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/parallel.py", line 123, in __call__
    scores = scorer(estimator, X_test, y_test)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 234, in __call__
    return self.function(*args, **kwargs)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 708, in _fit_and_score
    return self._score(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 399, in _score
    test_scores = _score(estimator, X_test, y_test, scorer, error_score)
    return self._sign * self._score_func(y, y_pred, **self._kwargs)  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 767, in _score

  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 234, in average_precision_score
    return _average_binary_score(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_base.py", line 75, in _average_binary_score
    return binary_metric(y_true, y_score, sample_weight=sample_weight)
    scores = scorer(estimator, X_test, y_test)  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 207, in _binary_uninterpolated_average_precision

  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 234, in __call__
    precision, recall, _ = precision_recall_curve(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 878, in precision_recall_curve
    return self._score(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_scorer.py", line 399, in _score
    return self._sign * self._score_func(y, y_pred, **self._kwargs)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 234, in average_precision_score
    fps, tps, thresholds = _binary_clf_curve(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 751, in _binary_clf_curve
    return _average_binary_score(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_base.py", line 75, in _average_binary_score
    return binary_metric(y_true, y_score, sample_weight=sample_weight)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 207, in _binary_uninterpolated_average_precision
    check_consistent_length(y_true, y_score, sample_weight)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/validation.py", line 397, in check_consistent_length
    precision, recall, _ = precision_recall_curve(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 878, in precision_recall_curve
    raise ValueError(
ValueError: Found input variables with inconsistent numbers of samples: [5000, 5120]
    fps, tps, thresholds = _binary_clf_curve(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/metrics/_ranking.py", line 751, in _binary_clf_curve
    check_consistent_length(y_true, y_score, sample_weight)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/sklearn/utils/validation.py", line 397, in check_consistent_length
    raise ValueError(
ValueError: Found input variables with inconsistent numbers of samples: [5000, 5120]
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 46386) of binary: /home/razorin/conda_envs/backup/bin/python
Traceback (most recent call last):
  File "/home/razorin/conda_envs/backup/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/launch.py", line 906, in launch_command
    multi_gpu_launcher(args)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/accelerate/commands/launch.py", line 599, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run
    elastic_launch(
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/razorin/conda_envs/backup/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
skorch_accelerate_issue.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2023-03-30_10:47:36
  host      : ML
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 46387)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2023-03-30_10:47:36
  host      : ML
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 46386)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Thanks a lot in advance for your ideas.

@BenjaminBossan
Copy link
Collaborator

BenjaminBossan commented Mar 30, 2023

Good progress, I think we're getting close. Maybe I'll be able to get a multi-GPU setup to test soon.

The error now seems to be:

Found input variables with inconsistent numbers of samples: [5000, 5120]

I believe the reason is that accelerate tries to equalize the batch sizes for each GPU. Since by default, skorch uses 128, it fills up the batches with an additional 120 dummy samples, that way each GPU can get 20*128 = 2560 samples, resulting in a total of 5120 samples. Now gather_for_metrics should in theory remove those dummy samples again, I'm not sure what is going wrong here.

Btw. the reason why in my code snippet, I only gathered the predictions, not the target, is that the target should not come from skorch. sklearn splits the data and the y_test that it uses never goes to torch, so it should be the correct size. We only need to gather the predictions. In the "no fluff" example, this was not the case, which is why gathering the target there, as you did, was correct.

I'll think more about it or hopefully get to test it, but meanwhile, here are some suggested solutions:

  1. Call gather inside another method

Instead of overriding evaluation_step, we could try calling it even earlier, in infer:

    def infer(self, x, **fit_params):
        y_infer = super().infer(x, **fit_params)
        return self.accelerator.gather_for_metrics(y_infer)

So try adding this method instead of overriding evaluation_step in the custom net class.

  1. Use a batch size that divides without remainder

That way, accelerate should not need to create dummy samples. E.g. for 10000 samples, batch size of 100 should work. However, this is quite annoying, especially if data is split into train/valid etc. (by default, skorch uses an 80/20 split). Depending on the size of the dataset, batching without remainder might not be possible (except for batch size of 1).

This might also require passing split_batches=True to Accelerator, not completely sure.

  1. Manually truncate dummy samples

This could be unsafe, i.e. it could mean that the wrong samples are truncated, but maybe it works. Add this method to the custom neural net class:

    def forward(self, X, *args, **kwargs):
        y_infer = super().forward(X, *args, **kwargs)
        n = len(X)

        is_multioutput = len(y_infer) > 0 and isinstance(y_infer[0], tuple)
        if is_multioutput:
            return tuple(yi[:n] for yi in y_infer)
        return y_infer[:n]

The gather_for_metrics call might not be necessary with this fix. An issue here is that if the method is incorrect, it is probably only affecting the last batch, so it might look correct because it only affects few samples.

  1. Don't use multiple GPUs for evaluation

This is of course not nice because you want to make use of those GPUs, but at least training still seems to work fine. For this, it should be sufficient to not prepare the validation data loader:

    def get_iterator(self, dataset, training=False):
        iterator = super().get_iterator(dataset, training=training)
        if not training:
            return iterator
        iterator = self.accelerator.prepare(iterator)
        return iterator

@BenjaminBossan
Copy link
Collaborator

I have a multi GPU instance now and can reproduce the error. Unfortunately, the solution does not work and it appears that the issue is that for some reason, accelerate does not detect that it should truncate excess samples. I'm investigating.

@Raphaaal
Copy link
Author

Raphaaal commented Mar 30, 2023

Great to hear that you can try it for yourself. Thanks a lot for your time.
Please let me know if I can be of any help.

@BenjaminBossan
Copy link
Collaborator

Okay, so I managed to kinda track down the problem. To keep it quick, the gradient_state of accelerator somehow diverges from the gradient_state of the data loader, which should not happen. The latter correctly detects that the batch is finished, so the hacky solution is to override the gradient state of the accelerator by the one from the data loader.

Of course, it is still necessary to add the gather_for_metrics call. In sum, these two methods should be added:

    def evaluation_step(self, batch, training=False):
        output = super().evaluation_step(batch, training=training)
        return self.accelerator.gather_for_metrics(output)

    def get_iterator(self, dataset, training=False):
        iterator = super().get_iterator(dataset, training=training)
        self.accelerator.gradient_state = iterator.gradient_state
        return iterator

Could you please check that this solves your problem?

@BenjaminBossan
Copy link
Collaborator

Update: I spoke an accelerate dev and the issue is most likely that sklearn sometimes creates a copy.deepcopy of the estimator. In particular, this happens when calling any kind of hyper-parameter search and also cross_validate. However, accelerate relies on some references that may be broken when deepcopied, therefore there is no guarantee that anything will still work after deepcopying the accelerator instance. This would explain why you don't see any issues when using skorch without hyper-parameter search.

In the "no fluff" example I posted, I did add a deecopy call but there it doesn't seem to cause any issue. However, this stuff is tricky to replicate, so probably it is not exactly the same as what happens with using RandomizedSearchCV.

So what does it mean for this specific issue? Unfortunately, there is no guarantee that you will get correct results, even if the hack I posted above removes the error. I would recommend not using accelerate in this context.

Still, if you have 2 GPUs and the model is small enough that it can fit on each of them, it is possible to use grid search with skorch while leveraging both GPUs. This is documented here. Maybe that's a solution that can work for you.

@Raphaaal
Copy link
Author

Raphaaal commented Mar 31, 2023

Many thanks for the analysis and suggested alternative. It is a pity.

Do you think it is also unsafe to use Skorch+ Accelerate + RandomizedSearchCV on a single GPU (e.g., to benefit from DeepSpeed)?

Finally, do you think this deserves opening an issue on Accelerate? I am not sure whether it can be considered a bug.

Anyway, thanks a lot for your help getting to the bottom of this and keep up the great work with this tool :)

@BenjaminBossan
Copy link
Collaborator

BenjaminBossan commented Mar 31, 2023

Do you think it is also unsafe to use Skorch+ Accelerate + RandomizedSearchCV on a single GPU (e.g., to benefit from DeepSpeed)?

Potentially it's the same issue because of the copy being created. Whether this can still cause reference issues when only one GPU is involved, I don't know. The answer is probably "it depends".

Interestingly, I did manage to find a potential solution by simply adding a __deepcopy__ method to Accelerator. First the code:

class MyAccelerator(Accelerator):
    def __deepcopy__(self, memo):
        cls = type(self)
        instance = cls()  # <= add more arguments here if needed
        return instance

# calling gather_for_metrics is still required
class MyNet(NeuralNetClassifier):
    def evaluation_step(self, batch, training=False):
        output = super().evaluation_step(batch, training=training)
        return self.accelerator.gather_for_metrics(output)

accelerator = MyAccelerator()
net = MyNet(..., accelerator=accelerator)
cross_validate(net, ...)

For my example, it worked. Maybe you can give it a spin for your real use case and report if the results look correct. I'll consult with the accelerate devs if this could be a viable solution.

EDIT
Creating multiple instances of Accelerator per script is a bad idea according to accelerate devs. A different solution could be:

class MyAccelerator(Accelerator):
    def __deepcopy__(self, memo):
        return self

Not sure if this can lead to trouble elsewhere down the line, but it works in my tests.

BenjaminBossan added a commit that referenced this issue Mar 31, 2023
Partly resolves #944

There is an issue with using skorch in a multi-GPU setting with
accelerate. After some searching, it turns out there were two problems:

1. skorch did not call `accelerator.gather_for_metrics`, which resulted
   in `y_pred` not having the correct size. For more on this, consult the
   [accelerate
   docs](https://huggingface.co/docs/accelerate/quicktour#distributed-evaluation).

2. accelerate has an issue with beeing deepcopied, which happens for
   instance when using GridSearchCV. The problem is that some references
   get messed up, resulting in the GradientState of the accelerator
   instance and of the dataloader to diverge. Therefore, the
   accelerator did not "know" when the last batch was encountered and was
   thus unable to remove the dummy samples added for multi-GPU inference.

The fix for 1. is provided in this PR. For 2., there is no solution in
skorch, but a possible (maybe hacky) fix is suggested in the docs. The
fix consists of writing a custom Accelerator class that overrides
__deepcopy__ to just return self. I don't know enough about accelerate
internals to determine if this is a safe solution or if it can cause
more issues down the line, but it resolves the issue.

Since reproducing this bug requires a multi-GPU setup and running the
scripts with the accelerate launcher, it cannot be covered by normal
unit tests. Instead, this PR adds two scripts to reproduce the issue.
With the appropriate hardware, they can be used to check the solution.
@Raphaaal
Copy link
Author

Raphaaal commented Apr 1, 2023

EDIT: changed nn.LogSoftmax to nn.Softmax


Thanks for your reply.

I conducted some tests that seem conclusive.

I compared running the following script on:

  • 1 GPU (Skorch without Accelerate)
  • 1 GPU (Skorch with Accelerate)
  • 3 GPUs (Skorch with Accelerate)
import torch
import torch.nn as nn
import numpy as np
import random
from skorch import NeuralNetClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import GridSearchCV
from accelerate import Accelerator
from skorch.hf import AccelerateMixin
from sklearn.metrics import average_precision_score


# Reproducibility

SEED = 42
def seed_everything(seed=42):
    torch.manual_seed(seed)
    random.seed(seed)
    np.random.seed(seed)
    torch.use_deterministic_algorithms(True)
    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)


class MyModule(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.dense0 = nn.Linear(100, 2)
        self.nonlin = nn.Softmax(dim=-1)

    def forward(self, X):
        X = self.dense0(X)
        X = self.nonlin(X)
        return X


class AcceleratedNeuralNetClassifier(AccelerateMixin, NeuralNetClassifier):
    def evaluation_step(self, batch, training=False):
        output = super().evaluation_step(batch, training=training)
        return self.accelerator.gather_for_metrics(output)


class SkorchAccelerator(Accelerator):
    def __deepcopy__(self, memo):
        return self


seed_everything()
X, y = make_classification(
    1_000, 100, 
    n_informative=5, random_state=SEED, flip_y=0.1
)
X = X.astype(np.float32)
y = y.astype(np.int64)

accelerator = SkorchAccelerator()

for i in range(3):
    seed_everything()
    model_skorch = AcceleratedNeuralNetClassifier(
        accelerator=accelerator, module=MyModule, 
        max_epochs=1, verbose=False, batch_size=10, callbacks="disable"
        )
    gs = GridSearchCV(
        estimator=model_skorch,
        param_grid={
            "lr": [0.1, 0.001],
        },
        scoring="average_precision",
        n_jobs=1,
        cv=2,
        verbose=0,
        refit=False,
    )
    gs.fit(X, y)

    if accelerator.is_local_main_process:
        print(f"{gs.cv_results_['params']=}")
        print(f"{gs.cv_results_['mean_test_score']=}")

    # Manual refit
    best_model_skorch = AcceleratedNeuralNetClassifier(
        accelerator=accelerator, module=MyModule, 
        max_epochs=1, verbose=False, batch_size=10, callbacks="disable",
        **gs.best_params_
    )
    best_model_skorch.fit(X, y)
    preds = best_model_skorch.predict_proba(X)[: , 1]
    score = average_precision_score(y, preds)

    if accelerator.is_local_main_process:
        print(f"{score=}")
        print("-"*10)
  • 1 GPU (Skorch without Accelerate)

Running CUBLAS_WORKSPACE_CONFIG=':4096:8' python script.py
NB: the script requires some edits to remove all Accelerate business, namely:

...

# accelerator = SkorchAccelerator()

...

# model_skorch = AcceleratedNeuralNetClassifier(
model_skorch = NeuralNetClassifier(
    # accelerator=accelerator, 
    module=MyModule, 
    max_epochs=1, verbose=False, batch_size=10, callbacks="disable"
)

...

# if accelerator.is_local_main_process:
print(f"{gs.cv_results_['params']=}")
print(f"{gs.cv_results_['mean_test_score']=}")

...

# best_model_skorch = AcceleratedNeuralNetClassifier(
best_model_skorch = NeuralNetClassifier(
    # accelerator=accelerator, 
    module=MyModule, 
    max_epochs=1, verbose=False, batch_size=10, callbacks="disable",
    **gs.best_params_
)

...

# if accelerator.is_local_main_process:
print(f"{score=}")
print("-"*10)

Output:

gs.cv_results_['params']=[{'lr': 0.1}, {'lr': 0.001}]
gs.cv_results_['mean_test_score']=array([0.74681354, 0.54288419])
score=0.8161235660884453
----------
gs.cv_results_['params']=[{'lr': 0.1}, {'lr': 0.001}]
gs.cv_results_['mean_test_score']=array([0.74681354, 0.54288419])
score=0.8161235660884453
----------
gs.cv_results_['params']=[{'lr': 0.1}, {'lr': 0.001}]
gs.cv_results_['mean_test_score']=array([0.74681354, 0.54288419])
score=0.8161235660884453
----------
  • 1 GPU (Skorch with Accelerate)

Running CUBLAS_WORKSPACE_CONFIG=':4096:8' accelerate launch script.py.
Output:

gs.cv_results_['params']=[{'lr': 0.1}, {'lr': 0.001}]
gs.cv_results_['mean_test_score']=array([0.74681354, 0.54288419])
score=0.8161235660884453
----------
gs.cv_results_['params']=[{'lr': 0.1}, {'lr': 0.001}]
gs.cv_results_['mean_test_score']=array([0.74681354, 0.54288419])
score=0.8161235660884453
----------
gs.cv_results_['params']=[{'lr': 0.1}, {'lr': 0.001}]
gs.cv_results_['mean_test_score']=array([0.74681354, 0.54288419])
score=0.8161235660884453
----------
  • 3 GPUs (Skorch with Accelerate)

Running CUBLAS_WORKSPACE_CONFIG=':4096:8' accelerate launch script.py
Output:

gs.cv_results_['params']=[{'lr': 0.1}, {'lr': 0.001}]
gs.cv_results_['mean_test_score']=array([0.7486652, 0.5338035])
score=0.8295460156186707
----------
gs.cv_results_['params']=[{'lr': 0.1}, {'lr': 0.001}]
gs.cv_results_['mean_test_score']=array([0.7486652, 0.5338035])
score=0.8295460156186707
----------
gs.cv_results_['params']=[{'lr': 0.1}, {'lr': 0.001}]
gs.cv_results_['mean_test_score']=array([0.7486652, 0.5338035])
score=0.8295460156186707
----------

Seems reasonable to me. Thanks a lot.

@BenjaminBossan
Copy link
Collaborator

Thanks a lot for testing, the results look very reasonable. They're not 100% the same for 3 GPUs, but I think that's to be expected.

I will update this thread if I get more feedback from accelerate devs. For now, I think we can close the issue but if you encounter new problems, feel free to re-open.

BenjaminBossan added a commit that referenced this issue Apr 28, 2023
Partly resolves #944

There is an issue with using skorch in a multi-GPU setting with
accelerate. After some searching, it turns out there were two problems:

1. skorch did not call `accelerator.gather_for_metrics`, which resulted
   in `y_pred` not having the correct size. For more on this, consult the
   [accelerate
   docs](https://huggingface.co/docs/accelerate/quicktour#distributed-evaluation).

2. accelerate has an issue with beeing deepcopied, which happens for
   instance when using GridSearchCV. The problem is that some references
   get messed up, resulting in the GradientState of the accelerator
   instance and of the dataloader to diverge. Therefore, the
   accelerator did not "know" when the last batch was encountered and was
   thus unable to remove the dummy samples added for multi-GPU inference.

The fix for 1. is provided in this PR. For 2., there is no solution in
skorch, but a possible (maybe hacky) fix is suggested in the docs. The
fix consists of writing a custom Accelerator class that overrides
__deepcopy__ to just return self. I don't know enough about accelerate
internals to determine if this is a safe solution or if it can cause
more issues down the line, but it resolves the issue.

Since reproducing this bug requires a multi-GPU setup and running the
scripts with the accelerate launcher, it cannot be covered by normal
unit tests. Instead, this PR adds two scripts to reproduce the issue.
With the appropriate hardware, they can be used to check the solution.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants