Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{ai}[foss/2022b] PyTorch v1.13.1 #18421

Merged

Conversation

Flamefire
Copy link
Contributor

@Flamefire Flamefire commented Jul 31, 2023

(created using eb --new-pr)

This is a bit of a struggle as PyTorch 1.13 isn't compatible with neither GCC 12 nor with Python 3.11 however I think I am/was able to backport enough of the changes in PyTorch 2 to that such that this works. I'm a bit afraid of jumping straight to PyTorch 2 and users might want to have PyTorch 1.x and 2.x anyway

IMPORTANT Failures related to Abseil can be fixed by reinstalling the latter including the updated ECs from #18413

…-1.13.1_fix-gcc-12-compilation.patch, PyTorch-1.13.1_fix-protobuf-dependency.patch, PyTorch-1.13.1_fix-warning-in-test-cpp-api.patch, PyTorch-1.13.1_increase-tolerance-test_ops.patch, PyTorch-1.13.1_skip-tests-without-fbgemm.patch
@Flamefire Flamefire marked this pull request as draft July 31, 2023 11:39
@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
taurusml30 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), 6 x NVIDIA Tesla V100-SXM2-32GB, 440.64.00, Python 2.7.5
See https://gist.github.com/Flamefire/de869fd0841ac2b1bd88ce693e4dcfb3 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
taurusi8018 - Linux CentOS Linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 470.57.02, Python 2.7.5
See https://gist.github.com/Flamefire/e3f066f904d0f4f0b6af026ba0e52d31 for a full test report.

@Flamefire Flamefire marked this pull request as ready for review August 9, 2023 14:56
@casparvl
Copy link
Contributor

casparvl commented Aug 9, 2023

@boegelbot please test @ generoso
CORE_CNT=16

@boegelbot
Copy link
Collaborator

@casparvl: Request for testing this PR well received on login1

PR test command 'EB_PR=18421 EB_ARGS= EB_CONTAINER= /opt/software/slurm/bin/sbatch --job-name test_PR_18421 --ntasks="16" ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 11418

Test results coming soon (I hope)...

- notification for comment with ID 1671683522 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
FAILED
Build succeeded for 2 out of 3 (3 easyconfigs in total)
cnx3 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/boegelbot/360443966dec05983156f07d7214e9d4 for a full test report.

@casparvl
Copy link
Contributor

@boegelbot please test @ jsc-zen2

@boegelbot
Copy link
Collaborator

@casparvl: Request for testing this PR well received on jsczen2l1.int.jsc-zen2.easybuild-test.cluster

PR test command 'EB_PR=18421 EB_ARGS= /opt/software/slurm/bin/sbatch --mem-per-cpu=4000M --job-name test_PR_18421 --ntasks=8 ~/boegelbot/eb_from_pr_upload_jsc-zen2.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 3118

Test results coming soon (I hope)...

- notification for comment with ID 1672663950 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@Flamefire
Copy link
Contributor Author

@casparvl Again it is test_quantization that failed. I see that too in #18424 and am wondering if it is always the same test. I created an update for the easyblock to help with that in the future.

Can you check the log for "FAIL: " or "ERROR: " lines? For me it was test_add_scalar_relu which failed

@casparvl
Copy link
Contributor

Test report by @casparvl
FAILED
Build succeeded for 2 out of 3 (3 easyconfigs in total)
tcn1.local.snellius.surf.nl - Linux RHEL 8.6, x86_64, AMD EPYC 7H12 64-Core Processor, Python 3.6.8
See https://gist.github.com/casparvl/711a09587dee94466a4fff6082a4f58d for a full test report.

@casparvl
Copy link
Contributor

Different test for me...

======================================================================
FAIL: test_sigmoid (quantization.core.test_quantized_op.TestQuantizedOps)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-nkhrtryq/tmpancn1h4i/lib/python3.10/site-packages/torch/testing/_internal/common_quantization.py", line 283, in wrapper
    fn(*args, **kwargs)
  File "/gpfs/nvme1/1/casparl/ebbuildpath/PyTorch/1.13.1/foss-2022b/pytorch-v1.13.1/test/quantization/core/test_quantized_op.py", line 311, in test_sigmoid
    @given(X=hu.tensor(shapes=hu.array_shapes(1, 5, 1, 5),
  File "/scratch-nvme/1/casparl/generic/software/hypothesis/6.68.2-GCCcore-12.2.0/lib/python3.10/site-packages/hypothesis/core.py", line 1396, in wrapped_test
    raise the_error_hypothesis_found
  File "/gpfs/nvme1/1/casparl/ebbuildpath/PyTorch/1.13.1/foss-2022b/pytorch-v1.13.1/test/quantization/core/test_quantized_op.py", line 325, in test_sigmoid
    self._test_activation_function(X, 'sigmoid', sigmoid_test_configs)
  File "/gpfs/nvme1/1/casparl/ebbuildpath/PyTorch/1.13.1/foss-2022b/pytorch-v1.13.1/test/quantization/core/test_quantized_op.py", line 228, in _test_activation_function
    self.assertEqual(qY, qY_hat, msg='{} - {} failed: ({} vs. {})'.format(
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-nkhrtryq/tmpancn1h4i/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2470, in assertEqual
    assert_equal(
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-nkhrtryq/tmpancn1h4i/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1093, in assert_equal
    raise error_metas[0].to_error(msg)
AssertionError: Quantized tensor-likes are not close!

Mismatched elements: 63 / 75 (84.0%)
Greatest absolute difference: 0.00390625 at index (0, 0, 1) (up to 1e-05 allowed)
Greatest relative difference: 0.0078125 at index (0, 0, 1) (up to 1.3e-06 allowed) : sigmoid - quantized.sigmoid failed: (tensor([[[0.0000, 0.5039, 0.5039, 0.5039, 0.5039],
         [0.5039, 0.5039, 0.5039, 0.5039, 0.5039],
         [0.5039, 0.5039, 0.5039, 0.5039, 0.5039],
         [0.5039, 0.5039, 0.5039, 0.5039, 0.5039],
         [0.5039, 0.5039, 0.5039, 0.5039, 0.5039]],

        [[0.5039, 0.5039, 0.5039, 0.5039, 0.5039],
         [0.5039, 0.5039, 0.5039, 0.5039, 0.5039],
         [0.5039, 0.5039, 0.5039, 0.5039, 0.5039],
         [0.5039, 0.5039, 0.5039, 0.5039, 0.5039],
         [0.5039, 0.5039, 0.5039, 0.5039, 0.5039]],

        [[0.5039, 0.5039, 0.5039, 0.5039, 0.5039],
         [0.5039, 0.5039, 0.5039, 0.5039, 0.5039],
         [0.5039, 0.5039, 0.5039, 0.5039, 0.5000],
         [0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
         [0.5000, 0.5000, 0.5000, 0.5000, 0.5000]]], size=(3, 5, 5),
       dtype=torch.quint8, quantization_scheme=torch.per_tensor_affine,
       scale=0.00390625, zero_point=0) vs. tensor([[[0.0000, 0.5000, 0.5000, 0.5000, 0.5000],
         [0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
         [0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
         [0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
         [0.5000, 0.5000, 0.5000, 0.5000, 0.5000]],

        [[0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
         [0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
         [0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
         [0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
         [0.5000, 0.5000, 0.5000, 0.5000, 0.5000]],

        [[0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
         [0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
         [0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
         [0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
         [0.5000, 0.5000, 0.5000, 0.5000, 0.5000]]], size=(3, 5, 5),
       dtype=torch.quint8, quantization_scheme=torch.per_tensor_affine,
       scale=0.00390625, zero_point=0))
Falsifying example: test_sigmoid(
    X=(array([[[-261630.,     514.,     514.,     514.,     514.],
             [    514.,     514.,     514.,     514.,     514.],
             [    514.,     514.,     514.,     514.,     514.],
             [    514.,     514.,     514.,     514.,     514.],
             [    514.,     514.,     514.,     514.,     514.]],

            [[    514.,     514.,     514.,     514.,     514.],
             [    514.,     514.,     514.,     514.,     514.],
             [    514.,     514.,     514.,     514.,     514.],
             [    514.,     514.,     514.,     514.,     514.],
             [    514.,     514.,     514.,     514.,     514.]],

            [[    514.,     514.,     514.,     514.,     514.],
             [    514.,     514.,     514.,     514.,     514.],
             [    514.,     514.,     514.,     514.,     514.],
             [    514.,     514.,     514.,     514.,     514.],
             [    514.,     514.,     514.,     514.,     514.]]],
           dtype=float32), (1028.0156862745098, 255, torch.quint8)),
    self=<quantization.core.test_quantized_op.TestQuantizedOps testMethod=test_sigmoid>,
)

----------------------------------------------------------------------
Ran 942 tests in 656.469s

FAILED (failures=2, errors=1, skipped=72)

@Flamefire
Copy link
Contributor Author

Flamefire commented Aug 10, 2023

Thanks for that, this really helps as that example when used manually reproduces this reliably on my machine too.

And it makes me think we have a real bug here: The input can be further reduced to all 514. and the function is an element-wise function. Of the output only the first 64 elements are wrong: They differ by exactly one "scale" (scale=0.00390625, with 0.5=128*scale and 0.5039...=129*scale).
That makes me think that there is an off-by-one error in the vectorized part because only the last 11 elements are correct.

This is either a bug in PyTorch or the compiler and as it affects more than this EC I think we/I should investigate this further.

For reference the reduced test case is attached: reproduce_pytorch_quantization_fail.py

Reported upstream: pytorch/pytorch#107030

@branfosj
Copy link
Member

Test report by @branfosj
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2983
FAILED
Build succeeded for 6 out of 7 (3 easyconfigs in total)
bear-pg0105u03b.bear.cluster - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/branfosj/5df29b00c85e9c346cc0c79d797b858d for a full test report.

@branfosj
Copy link
Member

`test_sigmoid_non_observed`
======================================================================
FAIL: test_sigmoid_non_observed (quantization.core.test_quantized_op.TestQuantizedOps)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/dev/shm/branfosj/tmp-up-EL8/eb-ovi_2jjn/tmpfm2kior8/lib/python3.10/site-packages/torch/testing/_internal/common_quantized.py", line 172, in test_fn
    qfunction(*args, **kwargs)
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/1.13.1/foss-2022b/pytorch-v1.13.1/test/quantization/core/test_quantized_op.py", line 292, in test_sigmoid_non_observed
    @given(X=hu.tensor(shapes=hu.array_shapes(1, 5, 1, 5),
  File "/rds/projects/2017/branfosj-rse/easybuild/EL8-ice/software/hypothesis/6.68.2-GCCcore-12.2.0/lib/python3.10/site-packages/hypothesis/core.py", line 1396, in wrapped_test
    raise the_error_hypothesis_found
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/1.13.1/foss-2022b/pytorch-v1.13.1/test/quantization/core/test_quantized_op.py", line 305, in test_sigmoid_non_observed
    self._test_activation_function(X, 'sigmoid', sigmoid_test_configs)
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/1.13.1/foss-2022b/pytorch-v1.13.1/test/quantization/core/test_quantized_op.py", line 228, in _test_activation_function
    self.assertEqual(qY, qY_hat, msg='{} - {} failed: ({} vs. {})'.format(
  File "/dev/shm/branfosj/tmp-up-EL8/eb-ovi_2jjn/tmpfm2kior8/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2470, in assertEqual
    assert_equal(
  File "/dev/shm/branfosj/tmp-up-EL8/eb-ovi_2jjn/tmpfm2kior8/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1093, in assert_equal
    raise error_metas[0].to_error(msg)
AssertionError: Quantized tensor-likes are not close!

Mismatched elements: 63 / 81 (77.8%)
Greatest absolute difference: 0.00390625 at index (0, 0, 0, 1) (up to 1e-05 allowed)
Greatest relative difference: 0.0078125 at index (0, 0, 0, 1) (up to 1.3e-06 allowed) : sigmoid - <built-in method sigmoid of type object at 0x7fa9ed562640> failed: (tensor([[[[0.0000, 0.5039, 0.5039],
          [0.5039, 0.5039, 0.5039],
          [0.5039, 0.5039, 0.5039]],

         [[0.5039, 0.5039, 0.5039],
          [0.5039, 0.5039, 0.5039],
          [0.5039, 0.5039, 0.5039]],

         [[0.5039, 0.5039, 0.5039],
          [0.5039, 0.5039, 0.5039],
          [0.5039, 0.5039, 0.5039]]],


        [[[0.5039, 0.5039, 0.5039],
          [0.5039, 0.5039, 0.5039],
          [0.5039, 0.5039, 0.5039]],
         [[0.5039, 0.5039, 0.5039],
          [0.5039, 0.5039, 0.5039],
          [0.5039, 0.5039, 0.5039]],

         [[0.5039, 0.5039, 0.5039],
          [0.5039, 0.5039, 0.5039],
          [0.5039, 0.5039, 0.5039]]],
  
    
        [[[0.5039, 0.5039, 0.5039],
          [0.5039, 0.5000, 0.5000],
          [0.5000, 0.5000, 0.5000]],
    
         [[0.5039, 0.5039, 0.5039],
          [0.5000, 0.5000, 0.5000],
          [0.5000, 0.5000, 0.5000]],
    
         [[0.5039, 0.5039, 0.5039],
          [0.5000, 0.5000, 0.5000],
          [0.5000, 0.5000, 0.5000]]]], size=(3, 3, 3, 3), dtype=torch.quint8,
       quantization_scheme=torch.per_tensor_affine, scale=0.00390625,
       zero_point=0) vs. tensor([[[[0.0000, 0.5000, 0.5000],
          [0.5000, 0.5000, 0.5000],
          [0.5000, 0.5000, 0.5000]],

         [[0.5000, 0.5000, 0.5000],
          [0.5000, 0.5000, 0.5000],
          [0.5000, 0.5000, 0.5000]],

         [[0.5000, 0.5000, 0.5000],
          [0.5000, 0.5000, 0.5000],
          [0.5000, 0.5000, 0.5000]]],


        [[[0.5000, 0.5000, 0.5000],
          [0.5000, 0.5000, 0.5000],
          [0.5000, 0.5000, 0.5000]],

         [[0.5000, 0.5000, 0.5000],
          [0.5000, 0.5000, 0.5000],
          [0.5000, 0.5000, 0.5000]],

         [[0.5000, 0.5000, 0.5000],
          [0.5000, 0.5000, 0.5000],
          [0.5000, 0.5000, 0.5000]]],
         
          
        [[[0.5000, 0.5000, 0.5000],
          [0.5000, 0.5000, 0.5000],
          [0.5000, 0.5000, 0.5000]],
        
         [[0.5000, 0.5000, 0.5000],
          [0.5000, 0.5000, 0.5000],
          [0.5000, 0.5000, 0.5000]],
         
         [[0.5000, 0.5000, 0.5000],
          [0.5000, 0.5000, 0.5000],
          [0.5000, 0.5000, 0.5000]]]], size=(3, 3, 3, 3), dtype=torch.quint8,
       quantization_scheme=torch.per_tensor_affine, scale=0.00390625,
       zero_point=0))
Falsifying example: test_sigmoid_non_observed(
    X=(array([[[[-261630.,     514.,     514.],
              [    514.,     514.,     514.],
              [    514.,     514.,     514.]],
          
             [[    514.,     514.,     514.],
              [    514.,     514.,     514.],
              [    514.,     514.,     514.]],
     
             [[    514.,     514.,     514.],
              [    514.,     514.,     514.],
              [    514.,     514.,     514.]]],
     
     
            [[[    514.,     514.,     514.],
              [    514.,     514.,     514.],
              [    514.,     514.,     514.]],
     
             [[    514.,     514.,     514.],
              [    514.,     514.,     514.],
              [    514.,     514.,     514.]],
`TestGradientsCPU.test_forward_mode_AD_linalg_det_singular_cpu_complex128`
______________________________________________________ TestGradientsCPU.test_forward_mode_AD_linalg_det_singular_cpu_complex128 _______________________________________________________
Traceback (most recent call last):
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/1.13.1/foss-2022b/pytorch-v1.13.1/test/test_ops_gradients.py", line 251, in test_forward_mode_AD
    self._forward_grad_helper(device, dtype, op, op.get_op(), is_inplace=False)
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/1.13.1/foss-2022b/pytorch-v1.13.1/test/test_ops_gradients.py", line 239, in _forward_grad_helper
    call_grad_test_helper()
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/1.13.1/foss-2022b/pytorch-v1.13.1/test/test_ops_gradients.py", line 236, in call_grad_test_helper
    self._grad_test_helper(device, dtype, op, variant, check_forward_ad=True, check_backward_ad=False,
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/1.13.1/foss-2022b/pytorch-v1.13.1/test/test_ops_gradients.py", line 139, in _grad_test_helper
    return self._check_helper(device, dtype, op, variant, 'gradcheck', check_forward_ad=check_forward_ad,
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/1.13.1/foss-2022b/pytorch-v1.13.1/test/test_ops_gradients.py", line 108, in _check_helper
    self.assertTrue(gradcheck(fn, gradcheck_args,
  File "/dev/shm/branfosj/tmp-up-EL8/eb-ovi_2jjn/tmpfm2kior8/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3327, in gradcheck
    return torch.autograd.gradcheck(fn, inputs, **kwargs)
  File "/dev/shm/branfosj/tmp-up-EL8/eb-ovi_2jjn/tmpfm2kior8/lib/python3.10/site-packages/torch/autograd/gradcheck.py", line 1418, in gradcheck
    return _gradcheck_helper(**args)
  File "/dev/shm/branfosj/tmp-up-EL8/eb-ovi_2jjn/tmpfm2kior8/lib/python3.10/site-packages/torch/autograd/gradcheck.py", line 1432, in _gradcheck_helper
    _gradcheck_real_imag(gradcheck_fn, func, func_out, tupled_inputs, outputs, eps,
  File "/dev/shm/branfosj/tmp-up-EL8/eb-ovi_2jjn/tmpfm2kior8/lib/python3.10/site-packages/torch/autograd/gradcheck.py", line 1086, in _gradcheck_real_imag
    gradcheck_fn(imag_fn, imag_func_out, imag_inputs, diff_imag_func_out, eps,
  File "/dev/shm/branfosj/tmp-up-EL8/eb-ovi_2jjn/tmpfm2kior8/lib/python3.10/site-packages/torch/autograd/gradcheck.py", line 1311, in _fast_gradcheck
    _check_analytical_numerical_equal(analytical_vJu, numerical_vJu, complex_indices,
  File "/dev/shm/branfosj/tmp-up-EL8/eb-ovi_2jjn/tmpfm2kior8/lib/python3.10/site-packages/torch/autograd/gradcheck.py", line 1283, in _check_analytical_numerical_equal
    raise GradcheckError(_get_notallclose_msg(a, n, j, i, complex_indices, test_imag, is_forward_ad) + jacobians_str)
torch.autograd.gradcheck.GradcheckError: While considering the imaginary part of complex inputs only, Jacobian computed with forward mode mismatch for output 0 with respect to input 0,
numerical:tensor([-0.1722-0.0510j,  0.0925-0.0695j,  0.1424+0.0206j,  0.0174-0.0618j],
       dtype=torch.complex128)
analytical:tensor([-0.0000+0.0000j, 0.0925-0.0695j, 0.1424+0.0206j, 0.0174-0.0618j],
       dtype=torch.complex128, grad_fn=<CopyBackwards>)

The above quantities relating the numerical and analytical jacobians are computed 
in fast mode. See: https://github.com/pytorch/pytorch/issues/53876 for more background 
about fast mode. Below, we recompute numerical and analytical jacobians in slow mode:

Numerical:
 tensor([[-0.0112-0.4397j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.1399+0.1964j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.2135+1.1259j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.2874-0.1255j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.4532+0.1881j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.3703-0.1188j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.1231+0.1741j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 1.0020+0.1501j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.1885+0.2033j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0249+0.4331j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0104+0.4476j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.1520-0.1927j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.1611-1.1551j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.2859+0.1418j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.4700-0.1687j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.6285+0.2119j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.2056-0.2999j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-1.7049-0.2809j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.3266-0.3420j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0316-0.7392j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.3182-0.3713j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0171+0.2675j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.9740+0.8209j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.3359+0.0935j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.2587+0.4802j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.6985+0.0645j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j, -0.2969+0.6312j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.1866-0.3573j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0442-0.1377j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0535+0.1101j,  0.0000+0.0000j,  0.0000+0.0000j], 
        [ 0.0000+0.0000j,  0.4200+0.7239j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j, -0.8191+0.1475j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.4764-0.0655j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.1644-0.0522j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j, -0.0701+0.1281j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.5101+0.0255j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j, -0.1967+0.4683j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.1248-0.2656j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0279-0.1015j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0423+0.0784j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j, -0.5237+0.1324j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0536-0.5345j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j, -0.0438+0.3072j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0032+0.1113j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j, -0.0671-0.0662j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.6072-0.2544j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0322+0.6539j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j, -0.0029-0.3782j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j, -0.0241-0.1335j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0928+0.0676j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.1755+0.2521j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.1530+0.0077j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0345-0.1766j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j, -0.0285-0.1272j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j, -0.0042-0.0547j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j, -0.4295+0.0305j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j, -0.1191+0.1787j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.1879+0.1682j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.1736+0.0569j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0688+0.0342j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.2955+0.5644j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.3128+0.0557j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.1170-0.3543j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j, -0.0255-0.2691j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0056-0.1136j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.2919+0.4545j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.2684+0.0236j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0723-0.3080j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j, -0.0417-0.2254j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j, -0.0038-0.0963j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.5213+0.0991j,  0.0000+0.0000j], 
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.1982-0.1754j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j, -0.1708-0.2597j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j, -0.1889-0.1226j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j, -0.0713-0.0624j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j, -0.0720+0.2371j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j, -0.0247-0.2584j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.1370-0.2756j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j, -0.0175-0.1092j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.1252+0.0815j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.2908+0.1945j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j, -0.3592-0.0726j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j, -0.3160-0.2980j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j, -0.1546-0.0211j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.1615-0.1357j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0195-0.1104j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0251+0.1147j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j, -0.0466+0.1312j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0138+0.0481j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j, -0.0606-0.0298j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.1967-0.0940j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j, -0.1531+0.1694j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j, -0.2602+0.0746j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j, -0.0605+0.0761j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j, -0.0214-0.1297j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j, -0.1634+0.1431j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.1013-0.2037j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.2296-0.1416j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0375-0.0893j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0554+0.1187j]],
       dtype=torch.complex128)
Analytical:
tensor([[ 0.0000-0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j], 
        [ 0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000-0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000-0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000-0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.6985+0.0645j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j, -0.2969+0.6312j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.1866-0.3573j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0442-0.1377j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0535+0.1101j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.4200+0.7239j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j, -0.8191+0.1475j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.4764-0.0655j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.1644-0.0522j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j, -0.0701+0.1281j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.5101+0.0255j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j, -0.1967+0.4683j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.1248-0.2656j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0279-0.1015j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0423+0.0784j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j, -0.5237+0.1324j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0536-0.5345j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j, -0.0438+0.3072j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0032+0.1113j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j, -0.0671-0.0662j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.6072-0.2544j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0322+0.6539j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j, -0.0029-0.3782j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j, -0.0241-0.1335j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0928+0.0676j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.1755+0.2521j,  0.0000+0.0000j], 
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.1530+0.0077j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0345-0.1766j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j, -0.0285-0.1272j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j, -0.0042-0.0547j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j, -0.4295+0.0305j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j, -0.1191+0.1787j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.1879+0.1682j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.1736+0.0569j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0688+0.0342j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.2955+0.5644j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.3128+0.0557j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.1170-0.3543j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j, -0.0255-0.2691j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0056-0.1136j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.2919+0.4545j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.2684+0.0236j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0723-0.3080j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j, -0.0417-0.2254j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j, -0.0038-0.0963j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.5213+0.0991j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.1982-0.1754j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j, -0.1708-0.2597j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j, -0.1889-0.1226j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j, -0.0713-0.0624j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j, -0.0720+0.2371j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j, -0.0247-0.2584j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.1370-0.2756j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j, -0.0175-0.1092j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.1252+0.0815j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.2908+0.1945j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j, -0.3592-0.0726j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j, -0.3160-0.2980j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j, -0.1546-0.0211j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.1615-0.1357j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0195-0.1104j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0251+0.1147j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j, -0.0466+0.1312j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0138+0.0481j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j, -0.0606-0.0298j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.1967-0.0940j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j, -0.1531+0.1694j], 
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j, -0.2602+0.0746j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j, -0.0605+0.0761j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j, -0.0214-0.1297j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j, -0.1634+0.1431j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.1013-0.2037j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.2296-0.1416j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0375-0.0893j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0554+0.1187j]],
       dtype=torch.complex128, grad_fn=<CopySlices>)

The max per-element difference (slow mode) is: 1.7278800558183407.

@easybuilders easybuilders deleted a comment from boegelbot Aug 12, 2023
@boegel boegel added the update label Aug 12, 2023
@boegel boegel added this to the 4.x milestone Aug 12, 2023
@easybuilders easybuilders deleted a comment from boegelbot Aug 12, 2023
@easybuilders easybuilders deleted a comment from boegelbot Aug 12, 2023
@boegel
Copy link
Member

boegel commented Aug 12, 2023

Test report by @boegel
FAILED
Build succeeded for 3 out of 4 (3 easyconfigs in total)
node3112.skitty.os - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz (skylake_avx512), Python 3.6.8
See https://gist.github.com/boegel/4c4bccd1300f4b0dbda5680a27923331 for a full test report.

@Flamefire
Copy link
Contributor Author

@boegel The failure you see is fixed by the updated ECs in #18413

I now mentioned this in the PR description.

@branfosj The quantization test is pytorch/pytorch#107030 which happens randomly due to the random test input. I'll allow a couple tests to fail to compensate as there is no fix I can see.
However the log says: "1 test failure, 0 test errors (out of 87220):" which is incomplete. Looks like we need to update our regex in the easyblock as only my fallback catches test_ops_gradients as failing. Can you look for the lines containing test_ops_gradients in the log or attach it?
As for the failure I'll look into forward-porting a patch we have in 1.12: PyTorch-1.12.1_skip-failing-grad-test.patch. That looks like it may workaround the failure.

…w 2 failing test (test_quantization may randomly fail)
@branfosj
Copy link
Member

However the log says: "1 test failure, 0 test errors (out of 87220):" which is incomplete. Looks like we need to update our regex in the easyblock as only my fallback catches test_ops_gradients as failing. Can you look for the lines containing test_ops_gradients in the log or attach it?

The relevant lines are (from a different build to the test report):

FAILED test_ops_gradients.py::TestGradientsCPU::test_fn_grad_linalg_det_singular_cpu_complex128 - torch.autograd.gradcheck.GradcheckError: While considering the imaginary part of complex outputs only, Ja...
FAILED test_ops_gradients.py::TestGradientsCPU::test_forward_mode_AD_linalg_det_singular_cpu_complex128 - torch.autograd.gradcheck.GradcheckError: While considering the imaginary part of complex inputs o...

regex = r"^[=-]+\n(FAIL|ERROR): (test_.*?)\s\(.*\n[=-]+\n" matches the information in the failure summary that includes

Ran 942 tests in 552.066s

FAILED (failures=2, skipped=73)

It is not matching for the other type:

========================================================================================== short test summary info ===========================================================================================
FAILED test_ops_gradients.py::TestGradientsCPU::test_fn_grad_linalg_det_singular_cpu_complex128 - torch.autograd.gradcheck.GradcheckError: While considering the imaginary part of complex outputs only, Ja...
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
================================================== 1 failed, 468 passed, 618 skipped, 14 deselected, 18 xfailed, 17 warnings, 2 rerun in 106.23s (0:01:46) ===================================================

@Flamefire
Copy link
Contributor Author

@branfosj I was referring to the old(er) RegExes that are meant to match the counts of failures and tests for each test suite which seem to be faulty as only 1 failure is counted while there are at least 2.

FAILED test_ops_gradients.py::TestGradientsCPU::test_fn_grad_linalg_det_singular_cpu_complex128 - torch.autograd.gradcheck.GradcheckError: While considering the imaginary part of complex outputs only, Ja...
FAILED test_ops_gradients.py::TestGradientsCPU::test_forward_mode_AD_linalg_det_singular_cpu_complex128 - torch.autograd.gradcheck.GradcheckError: While considering the imaginary part of complex inputs o...

This part would be for the new RegEx unless it could be already matched elsewhere, i.e. the one you mention below:

regex = r"^[=-]+\n(FAIL|ERROR): (test_.*?)\s\(.*\n[=-]+\n" matches the information in the failure summary that includes

Ran 942 tests in 552.066s

FAILED (failures=2, skipped=73)

Was this really what you meant here? Because that (new) regEx doesn't match what you quoted, so I'm confused...
This should however be matched by https://github.com/easybuilders/easybuild-easyblocks/blob/3f95af4acb2d8c86728027ec0688ca357e6e1808/easybuild/easyblocks/p/pytorch.py#L304-L312 but seemingly isn't which we need ti fix.

It is not matching for the other type:

========================================================================================== short test summary info ===========================================================================================
FAILED test_ops_gradients.py::TestGradientsCPU::test_fn_grad_linalg_det_singular_cpu_complex128 - torch.autograd.gradcheck.GradcheckError: While considering the imaginary part of complex outputs only, Ja...
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
================================================== 1 failed, 468 passed, 618 skipped, 14 deselected, 18 xfailed, 17 warnings, 2 rerun in 106.23s (0:01:46) ===================================================

This should be matched by https://github.com/easybuilders/easybuild-easyblocks/blob/3f95af4acb2d8c86728027ec0688ca357e6e1808/easybuild/easyblocks/p/pytorch.py#L325-L327 but likely isn't due to the new time format added.

So from a first check the 2nd regexp needs to be updated but somehow the results (1 failure counted) seem to be the other way round.

Can you attach the log? I think it makes sense to turn parts of that into a new test-case and fix them.

@branfosj
Copy link
Member

@branfosj I was referring to the old(er) RegExes that are meant to match the counts of failures and tests for each test suite which seem to be faulty as only 1 failure is counted while there are at least 2.

Sorry. I'd got confused between that and the changes in easybuilders/easybuild-easyblocks#2983

Log is at https://gist.github.com/branfosj/39d6c72617b71589101fd6bb5870d8ad

I think the issue it related to the sharding - we are running the test in two part:

Running test_ops_gradients ... [2023-08-15 11:55:52.433788]
Ignoring disabled issues:  []
Ignoring disabled issues:  []
Executing ['/rds/projects/2017/branfosj-rse/easybuild/EL8-ice/software/Python/3.10.8-GCCcore-12.2.0/bin/python', '-bb', 'test_ops_gradients.py', '-v', '--use-pytest', '-vv', '-x', '--reruns=2', '-rfEX', '--shard-id=0', '--num-shards=2', '-k=not _linalg_cholesky_'] ... [2023-08-15 11:55:53.696146]
Executing ['/rds/projects/2017/branfosj-rse/easybuild/EL8-ice/software/Python/3.10.8-GCCcore-12.2.0/bin/python', '-bb', 'test_ops_gradients.py', '-v', '--use-pytest', '-vv', '-x', '--reruns=2', '-rfEX', '--shard-id=1', '--num-shards=2', '-k=not _linalg_cholesky_'] ... [2023-08-15 11:55:53.696500]

So, there is one failure for both shards. Also, looking at that log, we are stopping on first failure in each shard.

@Flamefire
Copy link
Contributor Author

Flamefire commented Aug 16, 2023

@branfosj Thanks. That log is from a different run than #18421 (comment) though, isn't it? I see 4 valid failures in the log and 2 reported by the old code

But you are right that one issue is with the sharding. With my latest change just now I get 3 of the 4 errors counted (but all 4 reported by the newly introduced RegEx collecting single failures) and I don't see how we can count the last one without counting anything twice with the current approach which also attributes the failures to test suites.
The output of the 2 shards is:

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
================================================== 1 failed, 468 passed, 618 skipped, 14 deselected, 18 xfailed, 17 warnings, 2 rerun in 106.23s (0:01:46) ===================================================
If in CI, skip info is located in the xml test reports, please either go to s3 or the hud to download them

FINISHED PRINTING LOG FILE of test_ops_gradients (/dev/shm/branfosj/build-up-EL8/PyTorch/1.13.1/foss-2022b/pytorch-v1.13.1/test/test-reports/test_ops_gradients_rgzvx4s2)


PRINTING LOG FILE of test_ops_gradients (/dev/shm/branfosj/build-up-EL8/PyTorch/1.13.1/foss-2022b/pytorch-v1.13.1/test/test-reports/test_ops_gradients_bvqok09h)
============================================================================================ test session starts =============================================================================================
...
========================================================================================== short test summary info ===========================================================================================
FAILED test_ops_gradients.py::TestGradientsCPU::test_forward_mode_AD_linalg_det_singular_cpu_complex128 - torch.autograd.gradcheck.GradcheckError: While considering the imaginary part of complex inputs o...
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
================================================== 1 failed, 1252 passed, 757 skipped, 18 deselected, 30 xfailed, 81 warnings, 2 rerun in 228.37s (0:03:48) ==================================================
If in CI, skip info is located in the xml test reports, please either go to s3 or the hud to download them

FINISHED PRINTING LOG FILE of test_ops_gradients (/dev/shm/branfosj/build-up-EL8/PyTorch/1.13.1/foss-2022b/pytorch-v1.13.1/test/test-reports/test_ops_gradients_bvqok09h)

test_ops_gradients failed!
Running test_ops_jit ... [2023-08-15 11:59:44.335642]

I.e. it prints the logs of all shards and only then the "test_ops_gradients failed!" message so our regex matches only the 2nd/last summary.

I see 4 solutions:

  1. Patch run_test.py to not shard at all (any maybe disable the exit-on-first-failure)
  2. export BUILD_ENVIRONMENT=slow-gradcheck as a hack to disable the parallelization: https://github.com/pytorch/pytorch/blob/v1.13.1/test/run_test.py#L721 although that might get forgotten when the code changes (and it seems to have changed in PyTorch 2 already and was only introduced in 1.13)
  3. Make matching the test-suite-name optional for this pattern, seems to work for this example at least
  4. Look into parsing the XML report (enabled via --save-xml) which might be the best option but requires quite some work.

Edit: As for 4.:
It needs 2 Python packages: lxml and unittest-xml-reporting and a patch for PyTorch to propagate --save-xml.
But then it has folders named after the tests with 1 or more xml files containing e.g. <testsuites><testsuite name="pytest" errors="0" failures="0" skipped="127" tests="476" time="408.891" timestamp="2023-08-16T13:49:53.750990" hostname="taurusi8002"><testcase classname="TestJitCPU" name="t est_jit_alias_remapping_abs_cpu_float32" time="0.063" file="test_ops_jit.py" /> which does look helpful

@Flamefire
Copy link
Contributor Author

@branfosj I patched the sharding out so our error counting should work and with the allowed failures I'd say this should pass now also for you.

@branfosj
Copy link
Member

Test report by @branfosj
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3003
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
bear-pg0105u03a.bear.cluster - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/branfosj/7d87a85ff2d3b84654ac7666f8f6a1d4 for a full test report.

@branfosj
Copy link
Member

@boegelbot please test @ generoso
EB_ARGS="--include-easyblocks-from-pr 3003"
CORE_CNT=16

@boegelbot
Copy link
Collaborator

@branfosj: Request for testing this PR well received on login1

PR test command 'EB_PR=18421 EB_ARGS="--include-easyblocks-from-pr 3003" EB_CONTAINER= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --job-name test_PR_18421 --ntasks="16" ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 11725

Test results coming soon (I hope)...

- notification for comment with ID 1722239174 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@branfosj
Copy link
Member

@boegelbot please test @ jsc-zen2
EB_ARGS="--include-easyblocks-from-pr 3003"
CORE_CNT=16

@boegelbot
Copy link
Collaborator

@branfosj: Request for testing this PR well received on jsczen2l1.int.jsc-zen2.easybuild-test.cluster

PR test command 'EB_PR=18421 EB_ARGS="--include-easyblocks-from-pr 3003" EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --mem-per-cpu=4000M --job-name test_PR_18421 --ntasks="16" ~/boegelbot/eb_from_pr_upload_jsc-zen2.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 3377

Test results coming soon (I hope)...

- notification for comment with ID 1722242004 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3003
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
jsczen2c1.int.jsc-zen2.easybuild-test.cluster - Linux Rocky Linux 8.5, x86_64, AMD EPYC 7742 64-Core Processor (zen2), Python 3.6.8
See https://gist.github.com/boegelbot/278b045827b019896a1b57a8e6093ea6 for a full test report.

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3003
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
cnx1 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz (haswell), Python 3.6.8
See https://gist.github.com/boegelbot/731c8157702e4ba96fd3a3546cdf4fe6 for a full test report.

@branfosj branfosj modified the milestones: 4.x, next release (4.8.2?) Sep 17, 2023
@branfosj
Copy link
Member

Going in, thanks @Flamefire!

@branfosj branfosj merged commit 99bc374 into easybuilders:develop Sep 17, 2023
5 checks passed
@Flamefire Flamefire deleted the 20230731133908_new_pr_PyTorch1131 branch September 18, 2023 07:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants