{ai}[foss/2022b] PyTorch v1.13.1 #18421

Flamefire · 2023-07-31T11:39:17Z

(created using eb --new-pr)

Requires add/fix patches for PyTorch 1.13.1 w/ foss/2022a #18371

This is a bit of a struggle as PyTorch 1.13 isn't compatible with neither GCC 12 nor with Python 3.11 however I think I am/was able to backport enough of the changes in PyTorch 2 to that such that this works. I'm a bit afraid of jumping straight to PyTorch 2 and users might want to have PyTorch 1.x and 2.x anyway

IMPORTANT Failures related to Abseil can be fixed by reinstalling the latter including the updated ECs from #18413

…-1.13.1_fix-gcc-12-compilation.patch, PyTorch-1.13.1_fix-protobuf-dependency.patch, PyTorch-1.13.1_fix-warning-in-test-cpp-api.patch, PyTorch-1.13.1_increase-tolerance-test_ops.patch, PyTorch-1.13.1_skip-tests-without-fbgemm.patch

…asyconfigs into 20230731133908_new_pr_PyTorch1131

Flamefire · 2023-08-08T20:54:59Z

Test report by @Flamefire
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
taurusml30 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), 6 x NVIDIA Tesla V100-SXM2-32GB, 440.64.00, Python 2.7.5
See https://gist.github.com/Flamefire/de869fd0841ac2b1bd88ce693e4dcfb3 for a full test report.

Flamefire · 2023-08-09T03:58:47Z

Test report by @Flamefire
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
taurusi8018 - Linux CentOS Linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 470.57.02, Python 2.7.5
See https://gist.github.com/Flamefire/e3f066f904d0f4f0b6af026ba0e52d31 for a full test report.

casparvl · 2023-08-09T15:55:24Z

@boegelbot please test @ generoso
CORE_CNT=16

boegelbot · 2023-08-09T16:00:18Z

@casparvl: Request for testing this PR well received on login1

PR test command 'EB_PR=18421 EB_ARGS= EB_CONTAINER= /opt/software/slurm/bin/sbatch --job-name test_PR_18421 --ntasks="16" ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

exit code: 0
output:

Submitted batch job 11418

Test results coming soon (I hope)...

- notification for comment with ID 1671683522 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

boegelbot · 2023-08-09T19:07:24Z

Test report by @boegelbot
FAILED
Build succeeded for 2 out of 3 (3 easyconfigs in total)
cnx3 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/boegelbot/360443966dec05983156f07d7214e9d4 for a full test report.

…asyconfigs into 20230731133908_new_pr_PyTorch1131

casparvl · 2023-08-10T06:57:11Z

@boegelbot please test @ jsc-zen2

boegelbot · 2023-08-10T07:02:10Z

@casparvl: Request for testing this PR well received on jsczen2l1.int.jsc-zen2.easybuild-test.cluster

PR test command 'EB_PR=18421 EB_ARGS= /opt/software/slurm/bin/sbatch --mem-per-cpu=4000M --job-name test_PR_18421 --ntasks=8 ~/boegelbot/eb_from_pr_upload_jsc-zen2.sh' executed!

exit code: 0
output:

Submitted batch job 3118

Test results coming soon (I hope)...

- notification for comment with ID 1672663950 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

Flamefire · 2023-08-10T08:00:43Z

@casparvl Again it is test_quantization that failed. I see that too in #18424 and am wondering if it is always the same test. I created an update for the easyblock to help with that in the future.

Can you check the log for "FAIL: " or "ERROR: " lines? For me it was test_add_scalar_relu which failed

casparvl · 2023-08-10T10:19:14Z

Test report by @casparvl
FAILED
Build succeeded for 2 out of 3 (3 easyconfigs in total)
tcn1.local.snellius.surf.nl - Linux RHEL 8.6, x86_64, AMD EPYC 7H12 64-Core Processor, Python 3.6.8
See https://gist.github.com/casparvl/711a09587dee94466a4fff6082a4f58d for a full test report.

casparvl · 2023-08-10T11:38:49Z

Different test for me...

======================================================================
FAIL: test_sigmoid (quantization.core.test_quantized_op.TestQuantizedOps)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-nkhrtryq/tmpancn1h4i/lib/python3.10/site-packages/torch/testing/_internal/common_quantization.py", line 283, in wrapper
    fn(*args, **kwargs)
  File "/gpfs/nvme1/1/casparl/ebbuildpath/PyTorch/1.13.1/foss-2022b/pytorch-v1.13.1/test/quantization/core/test_quantized_op.py", line 311, in test_sigmoid
    @given(X=hu.tensor(shapes=hu.array_shapes(1, 5, 1, 5),
  File "/scratch-nvme/1/casparl/generic/software/hypothesis/6.68.2-GCCcore-12.2.0/lib/python3.10/site-packages/hypothesis/core.py", line 1396, in wrapped_test
    raise the_error_hypothesis_found
  File "/gpfs/nvme1/1/casparl/ebbuildpath/PyTorch/1.13.1/foss-2022b/pytorch-v1.13.1/test/quantization/core/test_quantized_op.py", line 325, in test_sigmoid
    self._test_activation_function(X, 'sigmoid', sigmoid_test_configs)
  File "/gpfs/nvme1/1/casparl/ebbuildpath/PyTorch/1.13.1/foss-2022b/pytorch-v1.13.1/test/quantization/core/test_quantized_op.py", line 228, in _test_activation_function
    self.assertEqual(qY, qY_hat, msg='{} - {} failed: ({} vs. {})'.format(
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-nkhrtryq/tmpancn1h4i/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2470, in assertEqual
    assert_equal(
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-nkhrtryq/tmpancn1h4i/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1093, in assert_equal
    raise error_metas[0].to_error(msg)
AssertionError: Quantized tensor-likes are not close!

Mismatched elements: 63 / 75 (84.0%)
Greatest absolute difference: 0.00390625 at index (0, 0, 1) (up to 1e-05 allowed)
Greatest relative difference: 0.0078125 at index (0, 0, 1) (up to 1.3e-06 allowed) : sigmoid - quantized.sigmoid failed: (tensor([[[0.0000, 0.5039, 0.5039, 0.5039, 0.5039],
         [0.5039, 0.5039, 0.5039, 0.5039, 0.5039],
         [0.5039, 0.5039, 0.5039, 0.5039, 0.5039],
         [0.5039, 0.5039, 0.5039, 0.5039, 0.5039],
         [0.5039, 0.5039, 0.5039, 0.5039, 0.5039]],

        [[0.5039, 0.5039, 0.5039, 0.5039, 0.5039],
         [0.5039, 0.5039, 0.5039, 0.5039, 0.5039],
         [0.5039, 0.5039, 0.5039, 0.5039, 0.5039],
         [0.5039, 0.5039, 0.5039, 0.5039, 0.5039],
         [0.5039, 0.5039, 0.5039, 0.5039, 0.5039]],

        [[0.5039, 0.5039, 0.5039, 0.5039, 0.5039],
         [0.5039, 0.5039, 0.5039, 0.5039, 0.5039],
         [0.5039, 0.5039, 0.5039, 0.5039, 0.5000],
         [0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
         [0.5000, 0.5000, 0.5000, 0.5000, 0.5000]]], size=(3, 5, 5),
       dtype=torch.quint8, quantization_scheme=torch.per_tensor_affine,
       scale=0.00390625, zero_point=0) vs. tensor([[[0.0000, 0.5000, 0.5000, 0.5000, 0.5000],
         [0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
         [0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
         [0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
         [0.5000, 0.5000, 0.5000, 0.5000, 0.5000]],

        [[0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
         [0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
         [0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
         [0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
         [0.5000, 0.5000, 0.5000, 0.5000, 0.5000]],

        [[0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
         [0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
         [0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
         [0.5000, 0.5000, 0.5000, 0.5000, 0.5000],
         [0.5000, 0.5000, 0.5000, 0.5000, 0.5000]]], size=(3, 5, 5),
       dtype=torch.quint8, quantization_scheme=torch.per_tensor_affine,
       scale=0.00390625, zero_point=0))
Falsifying example: test_sigmoid(
    X=(array([[[-261630.,     514.,     514.,     514.,     514.],
             [    514.,     514.,     514.,     514.,     514.],
             [    514.,     514.,     514.,     514.,     514.],
             [    514.,     514.,     514.,     514.,     514.],
             [    514.,     514.,     514.,     514.,     514.]],

            [[    514.,     514.,     514.,     514.,     514.],
             [    514.,     514.,     514.,     514.,     514.],
             [    514.,     514.,     514.,     514.,     514.],
             [    514.,     514.,     514.,     514.,     514.],
             [    514.,     514.,     514.,     514.,     514.]],

            [[    514.,     514.,     514.,     514.,     514.],
             [    514.,     514.,     514.,     514.,     514.],
             [    514.,     514.,     514.,     514.,     514.],
             [    514.,     514.,     514.,     514.,     514.],
             [    514.,     514.,     514.,     514.,     514.]]],
           dtype=float32), (1028.0156862745098, 255, torch.quint8)),
    self=<quantization.core.test_quantized_op.TestQuantizedOps testMethod=test_sigmoid>,
)

----------------------------------------------------------------------
Ran 942 tests in 656.469s

FAILED (failures=2, errors=1, skipped=72)

Flamefire · 2023-08-10T15:13:11Z

Thanks for that, this really helps as that example when used manually reproduces this reliably on my machine too.

And it makes me think we have a real bug here: The input can be further reduced to all 514. and the function is an element-wise function. Of the output only the first 64 elements are wrong: They differ by exactly one "scale" (scale=0.00390625, with 0.5=128*scale and 0.5039...=129*scale).
That makes me think that there is an off-by-one error in the vectorized part because only the last 11 elements are correct.

This is either a bug in PyTorch or the compiler and as it affects more than this EC I think we/I should investigate this further.

For reference the reduced test case is attached: reproduce_pytorch_quantization_fail.py

Reported upstream: pytorch/pytorch#107030

branfosj · 2023-08-12T10:20:21Z

Test report by @branfosj
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2983
FAILED
Build succeeded for 6 out of 7 (3 easyconfigs in total)
bear-pg0105u03b.bear.cluster - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/branfosj/5df29b00c85e9c346cc0c79d797b858d for a full test report.

branfosj · 2023-08-12T10:33:59Z

`test_sigmoid_non_observed`

======================================================================
FAIL: test_sigmoid_non_observed (quantization.core.test_quantized_op.TestQuantizedOps)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/dev/shm/branfosj/tmp-up-EL8/eb-ovi_2jjn/tmpfm2kior8/lib/python3.10/site-packages/torch/testing/_internal/common_quantized.py", line 172, in test_fn
    qfunction(*args, **kwargs)
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/1.13.1/foss-2022b/pytorch-v1.13.1/test/quantization/core/test_quantized_op.py", line 292, in test_sigmoid_non_observed
    @given(X=hu.tensor(shapes=hu.array_shapes(1, 5, 1, 5),
  File "/rds/projects/2017/branfosj-rse/easybuild/EL8-ice/software/hypothesis/6.68.2-GCCcore-12.2.0/lib/python3.10/site-packages/hypothesis/core.py", line 1396, in wrapped_test
    raise the_error_hypothesis_found
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/1.13.1/foss-2022b/pytorch-v1.13.1/test/quantization/core/test_quantized_op.py", line 305, in test_sigmoid_non_observed
    self._test_activation_function(X, 'sigmoid', sigmoid_test_configs)
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/1.13.1/foss-2022b/pytorch-v1.13.1/test/quantization/core/test_quantized_op.py", line 228, in _test_activation_function
    self.assertEqual(qY, qY_hat, msg='{} - {} failed: ({} vs. {})'.format(
  File "/dev/shm/branfosj/tmp-up-EL8/eb-ovi_2jjn/tmpfm2kior8/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 2470, in assertEqual
    assert_equal(
  File "/dev/shm/branfosj/tmp-up-EL8/eb-ovi_2jjn/tmpfm2kior8/lib/python3.10/site-packages/torch/testing/_comparison.py", line 1093, in assert_equal
    raise error_metas[0].to_error(msg)
AssertionError: Quantized tensor-likes are not close!

Mismatched elements: 63 / 81 (77.8%)
Greatest absolute difference: 0.00390625 at index (0, 0, 0, 1) (up to 1e-05 allowed)
Greatest relative difference: 0.0078125 at index (0, 0, 0, 1) (up to 1.3e-06 allowed) : sigmoid - <built-in method sigmoid of type object at 0x7fa9ed562640> failed: (tensor([[[[0.0000, 0.5039, 0.5039],
          [0.5039, 0.5039, 0.5039],
          [0.5039, 0.5039, 0.5039]],

         [[0.5039, 0.5039, 0.5039],
          [0.5039, 0.5039, 0.5039],
          [0.5039, 0.5039, 0.5039]],

         [[0.5039, 0.5039, 0.5039],
          [0.5039, 0.5039, 0.5039],
          [0.5039, 0.5039, 0.5039]]],


        [[[0.5039, 0.5039, 0.5039],
          [0.5039, 0.5039, 0.5039],
          [0.5039, 0.5039, 0.5039]],
         [[0.5039, 0.5039, 0.5039],
          [0.5039, 0.5039, 0.5039],
          [0.5039, 0.5039, 0.5039]],

         [[0.5039, 0.5039, 0.5039],
          [0.5039, 0.5039, 0.5039],
          [0.5039, 0.5039, 0.5039]]],
  
    
        [[[0.5039, 0.5039, 0.5039],
          [0.5039, 0.5000, 0.5000],
          [0.5000, 0.5000, 0.5000]],
    
         [[0.5039, 0.5039, 0.5039],
          [0.5000, 0.5000, 0.5000],
          [0.5000, 0.5000, 0.5000]],
    
         [[0.5039, 0.5039, 0.5039],
          [0.5000, 0.5000, 0.5000],
          [0.5000, 0.5000, 0.5000]]]], size=(3, 3, 3, 3), dtype=torch.quint8,
       quantization_scheme=torch.per_tensor_affine, scale=0.00390625,
       zero_point=0) vs. tensor([[[[0.0000, 0.5000, 0.5000],
          [0.5000, 0.5000, 0.5000],
          [0.5000, 0.5000, 0.5000]],

         [[0.5000, 0.5000, 0.5000],
          [0.5000, 0.5000, 0.5000],
          [0.5000, 0.5000, 0.5000]],

         [[0.5000, 0.5000, 0.5000],
          [0.5000, 0.5000, 0.5000],
          [0.5000, 0.5000, 0.5000]]],


        [[[0.5000, 0.5000, 0.5000],
          [0.5000, 0.5000, 0.5000],
          [0.5000, 0.5000, 0.5000]],

         [[0.5000, 0.5000, 0.5000],
          [0.5000, 0.5000, 0.5000],
          [0.5000, 0.5000, 0.5000]],

         [[0.5000, 0.5000, 0.5000],
          [0.5000, 0.5000, 0.5000],
          [0.5000, 0.5000, 0.5000]]],
         
          
        [[[0.5000, 0.5000, 0.5000],
          [0.5000, 0.5000, 0.5000],
          [0.5000, 0.5000, 0.5000]],
        
         [[0.5000, 0.5000, 0.5000],
          [0.5000, 0.5000, 0.5000],
          [0.5000, 0.5000, 0.5000]],
         
         [[0.5000, 0.5000, 0.5000],
          [0.5000, 0.5000, 0.5000],
          [0.5000, 0.5000, 0.5000]]]], size=(3, 3, 3, 3), dtype=torch.quint8,
       quantization_scheme=torch.per_tensor_affine, scale=0.00390625,
       zero_point=0))
Falsifying example: test_sigmoid_non_observed(
    X=(array([[[[-261630.,     514.,     514.],
              [    514.,     514.,     514.],
              [    514.,     514.,     514.]],
          
             [[    514.,     514.,     514.],
              [    514.,     514.,     514.],
              [    514.,     514.,     514.]],
     
             [[    514.,     514.,     514.],
              [    514.,     514.,     514.],
              [    514.,     514.,     514.]]],
     
     
            [[[    514.,     514.,     514.],
              [    514.,     514.,     514.],
              [    514.,     514.,     514.]],
     
             [[    514.,     514.,     514.],
              [    514.,     514.,     514.],
              [    514.,     514.,     514.]],

`TestGradientsCPU.test_forward_mode_AD_linalg_det_singular_cpu_complex128`

______________________________________________________ TestGradientsCPU.test_forward_mode_AD_linalg_det_singular_cpu_complex128 _______________________________________________________
Traceback (most recent call last):
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/1.13.1/foss-2022b/pytorch-v1.13.1/test/test_ops_gradients.py", line 251, in test_forward_mode_AD
    self._forward_grad_helper(device, dtype, op, op.get_op(), is_inplace=False)
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/1.13.1/foss-2022b/pytorch-v1.13.1/test/test_ops_gradients.py", line 239, in _forward_grad_helper
    call_grad_test_helper()
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/1.13.1/foss-2022b/pytorch-v1.13.1/test/test_ops_gradients.py", line 236, in call_grad_test_helper
    self._grad_test_helper(device, dtype, op, variant, check_forward_ad=True, check_backward_ad=False,
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/1.13.1/foss-2022b/pytorch-v1.13.1/test/test_ops_gradients.py", line 139, in _grad_test_helper
    return self._check_helper(device, dtype, op, variant, 'gradcheck', check_forward_ad=check_forward_ad,
  File "/dev/shm/branfosj/build-up-EL8/PyTorch/1.13.1/foss-2022b/pytorch-v1.13.1/test/test_ops_gradients.py", line 108, in _check_helper
    self.assertTrue(gradcheck(fn, gradcheck_args,
  File "/dev/shm/branfosj/tmp-up-EL8/eb-ovi_2jjn/tmpfm2kior8/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3327, in gradcheck
    return torch.autograd.gradcheck(fn, inputs, **kwargs)
  File "/dev/shm/branfosj/tmp-up-EL8/eb-ovi_2jjn/tmpfm2kior8/lib/python3.10/site-packages/torch/autograd/gradcheck.py", line 1418, in gradcheck
    return _gradcheck_helper(**args)
  File "/dev/shm/branfosj/tmp-up-EL8/eb-ovi_2jjn/tmpfm2kior8/lib/python3.10/site-packages/torch/autograd/gradcheck.py", line 1432, in _gradcheck_helper
    _gradcheck_real_imag(gradcheck_fn, func, func_out, tupled_inputs, outputs, eps,
  File "/dev/shm/branfosj/tmp-up-EL8/eb-ovi_2jjn/tmpfm2kior8/lib/python3.10/site-packages/torch/autograd/gradcheck.py", line 1086, in _gradcheck_real_imag
    gradcheck_fn(imag_fn, imag_func_out, imag_inputs, diff_imag_func_out, eps,
  File "/dev/shm/branfosj/tmp-up-EL8/eb-ovi_2jjn/tmpfm2kior8/lib/python3.10/site-packages/torch/autograd/gradcheck.py", line 1311, in _fast_gradcheck
    _check_analytical_numerical_equal(analytical_vJu, numerical_vJu, complex_indices,
  File "/dev/shm/branfosj/tmp-up-EL8/eb-ovi_2jjn/tmpfm2kior8/lib/python3.10/site-packages/torch/autograd/gradcheck.py", line 1283, in _check_analytical_numerical_equal
    raise GradcheckError(_get_notallclose_msg(a, n, j, i, complex_indices, test_imag, is_forward_ad) + jacobians_str)
torch.autograd.gradcheck.GradcheckError: While considering the imaginary part of complex inputs only, Jacobian computed with forward mode mismatch for output 0 with respect to input 0,
numerical:tensor([-0.1722-0.0510j,  0.0925-0.0695j,  0.1424+0.0206j,  0.0174-0.0618j],
       dtype=torch.complex128)
analytical:tensor([-0.0000+0.0000j, 0.0925-0.0695j, 0.1424+0.0206j, 0.0174-0.0618j],
       dtype=torch.complex128, grad_fn=<CopyBackwards>)

The above quantities relating the numerical and analytical jacobians are computed 
in fast mode. See: https://github.com/pytorch/pytorch/issues/53876 for more background 
about fast mode. Below, we recompute numerical and analytical jacobians in slow mode:

Numerical:
 tensor([[-0.0112-0.4397j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.1399+0.1964j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.2135+1.1259j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.2874-0.1255j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.4532+0.1881j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.3703-0.1188j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.1231+0.1741j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 1.0020+0.1501j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.1885+0.2033j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0249+0.4331j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0104+0.4476j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.1520-0.1927j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.1611-1.1551j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.2859+0.1418j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.4700-0.1687j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.6285+0.2119j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.2056-0.2999j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-1.7049-0.2809j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.3266-0.3420j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0316-0.7392j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.3182-0.3713j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0171+0.2675j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.9740+0.8209j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.3359+0.0935j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.2587+0.4802j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.6985+0.0645j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j, -0.2969+0.6312j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.1866-0.3573j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0442-0.1377j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0535+0.1101j,  0.0000+0.0000j,  0.0000+0.0000j], 
        [ 0.0000+0.0000j,  0.4200+0.7239j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j, -0.8191+0.1475j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.4764-0.0655j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.1644-0.0522j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j, -0.0701+0.1281j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.5101+0.0255j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j, -0.1967+0.4683j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.1248-0.2656j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0279-0.1015j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0423+0.0784j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j, -0.5237+0.1324j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0536-0.5345j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j, -0.0438+0.3072j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0032+0.1113j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j, -0.0671-0.0662j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.6072-0.2544j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0322+0.6539j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j, -0.0029-0.3782j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j, -0.0241-0.1335j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0928+0.0676j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.1755+0.2521j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.1530+0.0077j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0345-0.1766j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j, -0.0285-0.1272j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j, -0.0042-0.0547j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j, -0.4295+0.0305j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j, -0.1191+0.1787j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.1879+0.1682j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.1736+0.0569j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0688+0.0342j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.2955+0.5644j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.3128+0.0557j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.1170-0.3543j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j, -0.0255-0.2691j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0056-0.1136j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.2919+0.4545j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.2684+0.0236j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0723-0.3080j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j, -0.0417-0.2254j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j, -0.0038-0.0963j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.5213+0.0991j,  0.0000+0.0000j], 
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.1982-0.1754j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j, -0.1708-0.2597j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j, -0.1889-0.1226j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j, -0.0713-0.0624j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j, -0.0720+0.2371j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j, -0.0247-0.2584j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.1370-0.2756j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j, -0.0175-0.1092j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.1252+0.0815j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.2908+0.1945j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j, -0.3592-0.0726j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j, -0.3160-0.2980j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j, -0.1546-0.0211j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.1615-0.1357j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0195-0.1104j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0251+0.1147j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j, -0.0466+0.1312j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0138+0.0481j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j, -0.0606-0.0298j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.1967-0.0940j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j, -0.1531+0.1694j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j, -0.2602+0.0746j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j, -0.0605+0.0761j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j, -0.0214-0.1297j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j, -0.1634+0.1431j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.1013-0.2037j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.2296-0.1416j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0375-0.0893j],
        [ 0.0000+0.0000j,  0.0000+0.0000j,  0.0000+0.0000j,  0.0554+0.1187j]],
       dtype=torch.complex128)
Analytical:
tensor([[ 0.0000-0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j], 
        [ 0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000-0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000-0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000-0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [ 0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.6985+0.0645j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j, -0.2969+0.6312j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.1866-0.3573j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0442-0.1377j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0535+0.1101j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.4200+0.7239j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j, -0.8191+0.1475j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.4764-0.0655j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.1644-0.0522j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j, -0.0701+0.1281j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.5101+0.0255j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j, -0.1967+0.4683j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.1248-0.2656j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0279-0.1015j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0423+0.0784j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j, -0.5237+0.1324j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0536-0.5345j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j, -0.0438+0.3072j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0032+0.1113j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j, -0.0671-0.0662j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.6072-0.2544j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0322+0.6539j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j, -0.0029-0.3782j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j, -0.0241-0.1335j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0928+0.0676j,  0.0000+0.0000j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.1755+0.2521j,  0.0000+0.0000j], 
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.1530+0.0077j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0345-0.1766j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j, -0.0285-0.1272j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j, -0.0042-0.0547j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j, -0.4295+0.0305j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j, -0.1191+0.1787j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.1879+0.1682j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.1736+0.0569j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0688+0.0342j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.2955+0.5644j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.3128+0.0557j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.1170-0.3543j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j, -0.0255-0.2691j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0056-0.1136j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.2919+0.4545j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.2684+0.0236j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0723-0.3080j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j, -0.0417-0.2254j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j, -0.0038-0.0963j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.5213+0.0991j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.1982-0.1754j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j, -0.1708-0.2597j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j, -0.1889-0.1226j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j, -0.0713-0.0624j,  0.0000+0.0000j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j, -0.0720+0.2371j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j, -0.0247-0.2584j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.1370-0.2756j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j, -0.0175-0.1092j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.1252+0.0815j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.2908+0.1945j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j, -0.3592-0.0726j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j, -0.3160-0.2980j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j, -0.1546-0.0211j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.1615-0.1357j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0195-0.1104j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0251+0.1147j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j, -0.0466+0.1312j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0138+0.0481j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j, -0.0606-0.0298j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.1967-0.0940j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j, -0.1531+0.1694j], 
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j, -0.2602+0.0746j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j, -0.0605+0.0761j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j, -0.0214-0.1297j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j, -0.1634+0.1431j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.1013-0.2037j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.2296-0.1416j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0375-0.0893j],
        [-0.0000+0.0000j,  0.0000-0.0000j,  0.0000+0.0000j,  0.0554+0.1187j]],
       dtype=torch.complex128, grad_fn=<CopySlices>)

The max per-element difference (slow mode) is: 1.7278800558183407.

boegel · 2023-08-12T15:51:04Z

Test report by @boegel
FAILED
Build succeeded for 3 out of 4 (3 easyconfigs in total)
node3112.skitty.os - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz (skylake_avx512), Python 3.6.8
See https://gist.github.com/boegel/4c4bccd1300f4b0dbda5680a27923331 for a full test report.

Flamefire · 2023-08-15T08:49:24Z

@boegel The failure you see is fixed by the updated ECs in #18413

I now mentioned this in the PR description.

@branfosj The quantization test is pytorch/pytorch#107030 which happens randomly due to the random test input. I'll allow a couple tests to fail to compensate as there is no fix I can see.
However the log says: "1 test failure, 0 test errors (out of 87220):" which is incomplete. Looks like we need to update our regex in the easyblock as only my fallback catches test_ops_gradients as failing. Can you look for the lines containing test_ops_gradients in the log or attach it?
As for the failure I'll look into forward-porting a patch we have in 1.12: PyTorch-1.12.1_skip-failing-grad-test.patch. That looks like it may workaround the failure.

…w 2 failing test (test_quantization may randomly fail)

branfosj · 2023-08-15T12:07:48Z

However the log says: "1 test failure, 0 test errors (out of 87220):" which is incomplete. Looks like we need to update our regex in the easyblock as only my fallback catches test_ops_gradients as failing. Can you look for the lines containing test_ops_gradients in the log or attach it?

The relevant lines are (from a different build to the test report):

FAILED test_ops_gradients.py::TestGradientsCPU::test_fn_grad_linalg_det_singular_cpu_complex128 - torch.autograd.gradcheck.GradcheckError: While considering the imaginary part of complex outputs only, Ja...
FAILED test_ops_gradients.py::TestGradientsCPU::test_forward_mode_AD_linalg_det_singular_cpu_complex128 - torch.autograd.gradcheck.GradcheckError: While considering the imaginary part of complex inputs o...

regex = r"^[=-]+\n(FAIL|ERROR): (test_.*?)\s\(.*\n[=-]+\n" matches the information in the failure summary that includes

Ran 942 tests in 552.066s

FAILED (failures=2, skipped=73)

It is not matching for the other type:

========================================================================================== short test summary info ===========================================================================================
FAILED test_ops_gradients.py::TestGradientsCPU::test_fn_grad_linalg_det_singular_cpu_complex128 - torch.autograd.gradcheck.GradcheckError: While considering the imaginary part of complex outputs only, Ja...
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
================================================== 1 failed, 468 passed, 618 skipped, 14 deselected, 18 xfailed, 17 warnings, 2 rerun in 106.23s (0:01:46) ===================================================

Flamefire · 2023-08-15T12:59:04Z

@branfosj I was referring to the old(er) RegExes that are meant to match the counts of failures and tests for each test suite which seem to be faulty as only 1 failure is counted while there are at least 2.

FAILED test_ops_gradients.py::TestGradientsCPU::test_fn_grad_linalg_det_singular_cpu_complex128 - torch.autograd.gradcheck.GradcheckError: While considering the imaginary part of complex outputs only, Ja...
FAILED test_ops_gradients.py::TestGradientsCPU::test_forward_mode_AD_linalg_det_singular_cpu_complex128 - torch.autograd.gradcheck.GradcheckError: While considering the imaginary part of complex inputs o...

This part would be for the new RegEx unless it could be already matched elsewhere, i.e. the one you mention below:

regex = r"^[=-]+\n(FAIL|ERROR): (test_.*?)\s\(.*\n[=-]+\n" matches the information in the failure summary that includes
Ran 942 tests in 552.066s

FAILED (failures=2, skipped=73)

Was this really what you meant here? Because that (new) regEx doesn't match what you quoted, so I'm confused...
This should however be matched by https://github.com/easybuilders/easybuild-easyblocks/blob/3f95af4acb2d8c86728027ec0688ca357e6e1808/easybuild/easyblocks/p/pytorch.py#L304-L312 but seemingly isn't which we need ti fix.

It is not matching for the other type:

========================================================================================== short test summary info ===========================================================================================
FAILED test_ops_gradients.py::TestGradientsCPU::test_fn_grad_linalg_det_singular_cpu_complex128 - torch.autograd.gradcheck.GradcheckError: While considering the imaginary part of complex outputs only, Ja...
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
================================================== 1 failed, 468 passed, 618 skipped, 14 deselected, 18 xfailed, 17 warnings, 2 rerun in 106.23s (0:01:46) ===================================================

This should be matched by https://github.com/easybuilders/easybuild-easyblocks/blob/3f95af4acb2d8c86728027ec0688ca357e6e1808/easybuild/easyblocks/p/pytorch.py#L325-L327 but likely isn't due to the new time format added.

So from a first check the 2nd regexp needs to be updated but somehow the results (1 failure counted) seem to be the other way round.

Can you attach the log? I think it makes sense to turn parts of that into a new test-case and fix them.

branfosj · 2023-08-15T13:05:45Z

@branfosj I was referring to the old(er) RegExes that are meant to match the counts of failures and tests for each test suite which seem to be faulty as only 1 failure is counted while there are at least 2.

Sorry. I'd got confused between that and the changes in easybuilders/easybuild-easyblocks#2983

Log is at https://gist.github.com/branfosj/39d6c72617b71589101fd6bb5870d8ad

I think the issue it related to the sharding - we are running the test in two part:

Running test_ops_gradients ... [2023-08-15 11:55:52.433788]
Ignoring disabled issues:  []
Ignoring disabled issues:  []
Executing ['/rds/projects/2017/branfosj-rse/easybuild/EL8-ice/software/Python/3.10.8-GCCcore-12.2.0/bin/python', '-bb', 'test_ops_gradients.py', '-v', '--use-pytest', '-vv', '-x', '--reruns=2', '-rfEX', '--shard-id=0', '--num-shards=2', '-k=not _linalg_cholesky_'] ... [2023-08-15 11:55:53.696146]
Executing ['/rds/projects/2017/branfosj-rse/easybuild/EL8-ice/software/Python/3.10.8-GCCcore-12.2.0/bin/python', '-bb', 'test_ops_gradients.py', '-v', '--use-pytest', '-vv', '-x', '--reruns=2', '-rfEX', '--shard-id=1', '--num-shards=2', '-k=not _linalg_cholesky_'] ... [2023-08-15 11:55:53.696500]

So, there is one failure for both shards. Also, looking at that log, we are stopping on first failure in each shard.

Flamefire · 2023-08-16T08:38:02Z

@branfosj Thanks. That log is from a different run than #18421 (comment) though, isn't it? I see 4 valid failures in the log and 2 reported by the old code

But you are right that one issue is with the sharding. With my latest change just now I get 3 of the 4 errors counted (but all 4 reported by the newly introduced RegEx collecting single failures) and I don't see how we can count the last one without counting anything twice with the current approach which also attributes the failures to test suites.
The output of the 2 shards is:

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
================================================== 1 failed, 468 passed, 618 skipped, 14 deselected, 18 xfailed, 17 warnings, 2 rerun in 106.23s (0:01:46) ===================================================
If in CI, skip info is located in the xml test reports, please either go to s3 or the hud to download them

FINISHED PRINTING LOG FILE of test_ops_gradients (/dev/shm/branfosj/build-up-EL8/PyTorch/1.13.1/foss-2022b/pytorch-v1.13.1/test/test-reports/test_ops_gradients_rgzvx4s2)


PRINTING LOG FILE of test_ops_gradients (/dev/shm/branfosj/build-up-EL8/PyTorch/1.13.1/foss-2022b/pytorch-v1.13.1/test/test-reports/test_ops_gradients_bvqok09h)
============================================================================================ test session starts =============================================================================================
...
========================================================================================== short test summary info ===========================================================================================
FAILED test_ops_gradients.py::TestGradientsCPU::test_forward_mode_AD_linalg_det_singular_cpu_complex128 - torch.autograd.gradcheck.GradcheckError: While considering the imaginary part of complex inputs o...
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
================================================== 1 failed, 1252 passed, 757 skipped, 18 deselected, 30 xfailed, 81 warnings, 2 rerun in 228.37s (0:03:48) ==================================================
If in CI, skip info is located in the xml test reports, please either go to s3 or the hud to download them

FINISHED PRINTING LOG FILE of test_ops_gradients (/dev/shm/branfosj/build-up-EL8/PyTorch/1.13.1/foss-2022b/pytorch-v1.13.1/test/test-reports/test_ops_gradients_bvqok09h)

test_ops_gradients failed!
Running test_ops_jit ... [2023-08-15 11:59:44.335642]

I.e. it prints the logs of all shards and only then the "test_ops_gradients failed!" message so our regex matches only the 2nd/last summary.

I see 4 solutions:

Patch run_test.py to not shard at all (any maybe disable the exit-on-first-failure)
export BUILD_ENVIRONMENT=slow-gradcheck as a hack to disable the parallelization: https://github.com/pytorch/pytorch/blob/v1.13.1/test/run_test.py#L721 although that might get forgotten when the code changes (and it seems to have changed in PyTorch 2 already and was only introduced in 1.13)
Make matching the test-suite-name optional for this pattern, seems to work for this example at least
Look into parsing the XML report (enabled via --save-xml) which might be the best option but requires quite some work.

Edit: As for 4.:
It needs 2 Python packages: lxml and unittest-xml-reporting and a patch for PyTorch to propagate --save-xml.
But then it has folders named after the tests with 1 or more xml files containing e.g. <testsuites><testsuite name="pytest" errors="0" failures="0" skipped="127" tests="476" time="408.891" timestamp="2023-08-16T13:49:53.750990" hostname="taurusi8002"><testcase classname="TestJitCPU" name="t est_jit_alias_remapping_abs_cpu_float32" time="0.063" file="test_ops_jit.py" /> which does look helpful

Flamefire · 2023-09-14T14:16:36Z

@branfosj I patched the sharding out so our error counting should work and with the allowed failures I'd say this should pass now also for you.

branfosj · 2023-09-14T16:26:55Z

Test report by @branfosj
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3003
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
bear-pg0105u03a.bear.cluster - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/branfosj/7d87a85ff2d3b84654ac7666f8f6a1d4 for a full test report.

branfosj · 2023-09-16T14:14:12Z

@boegelbot please test @ generoso
EB_ARGS="--include-easyblocks-from-pr 3003"
CORE_CNT=16

boegelbot · 2023-09-16T14:15:06Z

@branfosj: Request for testing this PR well received on login1

PR test command 'EB_PR=18421 EB_ARGS="--include-easyblocks-from-pr 3003" EB_CONTAINER= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --job-name test_PR_18421 --ntasks="16" ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

exit code: 0
output:

Submitted batch job 11725

Test results coming soon (I hope)...

- notification for comment with ID 1722239174 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

branfosj · 2023-09-16T14:28:49Z

@boegelbot please test @ jsc-zen2
EB_ARGS="--include-easyblocks-from-pr 3003"
CORE_CNT=16

boegelbot · 2023-09-16T14:30:07Z

@branfosj: Request for testing this PR well received on jsczen2l1.int.jsc-zen2.easybuild-test.cluster

PR test command 'EB_PR=18421 EB_ARGS="--include-easyblocks-from-pr 3003" EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --mem-per-cpu=4000M --job-name test_PR_18421 --ntasks="16" ~/boegelbot/eb_from_pr_upload_jsc-zen2.sh' executed!

exit code: 0
output:

Submitted batch job 3377

Test results coming soon (I hope)...

- notification for comment with ID 1722242004 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

boegelbot · 2023-09-16T17:04:49Z

Test report by @boegelbot
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3003
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
jsczen2c1.int.jsc-zen2.easybuild-test.cluster - Linux Rocky Linux 8.5, x86_64, AMD EPYC 7742 64-Core Processor (zen2), Python 3.6.8
See https://gist.github.com/boegelbot/278b045827b019896a1b57a8e6093ea6 for a full test report.

boegelbot · 2023-09-16T17:26:27Z

Test report by @boegelbot
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3003
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
cnx1 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz (haswell), Python 3.6.8
See https://gist.github.com/boegelbot/731c8157702e4ba96fd3a3546cdf4fe6 for a full test report.

branfosj · 2023-09-17T11:16:20Z

Going in, thanks @Flamefire!

Flamefire marked this pull request as draft July 31, 2023 11:39

Add dependencies

5e0b81d

casparvl mentioned this pull request Jul 31, 2023

{ai}[foss/2022a] PyTorch v1.13.1 w/ CUDA 11.7.0 #18424

Merged

1 task

Flamefire added 5 commits August 1, 2023 14:06

Add more patches

4dc5b47

Add patches for PPC

372353d

Merge branch 'develop' of https://github.com/easybuilders/easybuild-e…

deae1d5

…asyconfigs into 20230731133908_new_pr_PyTorch1131

Update patches to sync with 2022a

13bb357

Add missing patch

1365387

Flamefire marked this pull request as ready for review August 9, 2023 14:56

Merge branch 'develop' of https://github.com/easybuilders/easybuild-e…

f7b336f

…asyconfigs into 20230731133908_new_pr_PyTorch1131

easybuilders deleted a comment from boegelbot Aug 12, 2023

boegel added the update label Aug 12, 2023

boegel added this to the 4.x milestone Aug 12, 2023

easybuilders deleted a comment from boegelbot Aug 12, 2023

Skip test_forward_mode_AD_linalg_det_singular_cpu_complex128 and allo…

8854539

…w 2 failing test (test_quantization may randomly fail)

Flamefire mentioned this pull request Aug 16, 2023

enhance PyTorch easyblock to print individual failed tests easybuilders/easybuild-easyblocks#2983

Merged

Disable test sharding to count failures correctly

ee11def

branfosj modified the milestones: 4.x, next release (4.8.2?) Sep 17, 2023

branfosj approved these changes Sep 17, 2023

View reviewed changes

branfosj merged commit 99bc374 into easybuilders:develop Sep 17, 2023
5 checks passed

Flamefire deleted the 20230731133908_new_pr_PyTorch1131 branch September 18, 2023 07:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

{ai}[foss/2022b] PyTorch v1.13.1 #18421

{ai}[foss/2022b] PyTorch v1.13.1 #18421

Flamefire commented Jul 31, 2023 •

edited

Loading

Flamefire commented Aug 8, 2023

Flamefire commented Aug 9, 2023

casparvl commented Aug 9, 2023

boegelbot commented Aug 9, 2023

boegelbot commented Aug 9, 2023

casparvl commented Aug 10, 2023

boegelbot commented Aug 10, 2023

Flamefire commented Aug 10, 2023

casparvl commented Aug 10, 2023

casparvl commented Aug 10, 2023

Flamefire commented Aug 10, 2023 •

edited

Loading

branfosj commented Aug 12, 2023

branfosj commented Aug 12, 2023

boegel commented Aug 12, 2023

Flamefire commented Aug 15, 2023

branfosj commented Aug 15, 2023

Flamefire commented Aug 15, 2023

branfosj commented Aug 15, 2023

Flamefire commented Aug 16, 2023 •

edited

Loading

Flamefire commented Sep 14, 2023

branfosj commented Sep 14, 2023

branfosj commented Sep 16, 2023

boegelbot commented Sep 16, 2023

branfosj commented Sep 16, 2023

boegelbot commented Sep 16, 2023

boegelbot commented Sep 16, 2023

boegelbot commented Sep 16, 2023

branfosj commented Sep 17, 2023

{ai}[foss/2022b] PyTorch v1.13.1 #18421

{ai}[foss/2022b] PyTorch v1.13.1 #18421

Conversation

Flamefire commented Jul 31, 2023 • edited Loading

Flamefire commented Aug 8, 2023

Flamefire commented Aug 9, 2023

casparvl commented Aug 9, 2023

boegelbot commented Aug 9, 2023

boegelbot commented Aug 9, 2023

casparvl commented Aug 10, 2023

boegelbot commented Aug 10, 2023

Flamefire commented Aug 10, 2023

casparvl commented Aug 10, 2023

casparvl commented Aug 10, 2023

Flamefire commented Aug 10, 2023 • edited Loading

branfosj commented Aug 12, 2023

branfosj commented Aug 12, 2023

boegel commented Aug 12, 2023

Flamefire commented Aug 15, 2023

branfosj commented Aug 15, 2023

Flamefire commented Aug 15, 2023

branfosj commented Aug 15, 2023

Flamefire commented Aug 16, 2023 • edited Loading

Flamefire commented Sep 14, 2023

branfosj commented Sep 14, 2023

branfosj commented Sep 16, 2023

boegelbot commented Sep 16, 2023

branfosj commented Sep 16, 2023

boegelbot commented Sep 16, 2023

boegelbot commented Sep 16, 2023

boegelbot commented Sep 16, 2023

branfosj commented Sep 17, 2023

Flamefire commented Jul 31, 2023 •

edited

Loading

Flamefire commented Aug 10, 2023 •

edited

Loading

Flamefire commented Aug 16, 2023 •

edited

Loading