AD global indexing sparse container is slower #17674

hugary1995 · 2021-04-24T12:27:20Z

Reason

See discussion #17671 . For a specific problem, global sparse is slower than local nonsparse.
For the record, I am attaching the input file here:

[Mesh]
  type = FileMesh
  file = Specimen_QtrModel.e
[]

[GlobalParams]
  displacements = 'disp_x disp_y disp_z'
  volumetric_locking_correction = true
[]

[Modules/TensorMechanics/Master]
  [all]
    strain = FINITE
    incremental = true
    add_variables = true
    use_automatic_differentiation = true
  []
[]

[BCs]
  [Load_RE]
    type = FunctionDirichletBC
    variable = disp_x
    function = '0.0226*t'
    boundary = "Right_end"
    preset = false
  []
  [Symm_LE]
    type = DirichletBC
    variable = disp_x
    value = 0
    boundary = "Left_end"
  []
  [Symm_BY]
    type = DirichletBC
    variable = disp_y
    value = 0.0
    boundary = "Bottom"
  []
  [Symm_LZ]
    type = DirichletBC
    variable = disp_z
    value = 0.0
    boundary = "Lateral_face"
  []
  [Const_RY]
    type = DirichletBC
    variable = disp_y
    value = 0.0
    boundary = "Right_end"
  []
  [Const_RZ]
    type = DirichletBC
    variable = disp_z
    value = 0.0
    boundary = "Right_end"
  []
[]

[Functions]
  [hf]
    type = PiecewiseLinear
    data_file = PW_last_iter.csv
    scale_factor = 1
    format = columns
  []
[]

[Materials]
  [elasticity_tensor]
    type = ADComputeIsotropicElasticityTensor
    youngs_modulus = 2.02E+11
    poissons_ratio = 0.315
  []
  [isotropic_plasticity]
    type = ADIsotropicPlasticityStressUpdate
    yield_stress = 554146341.5
    hardening_function = hf
  []
  [radial_return_stress]
    type = ADComputeMultipleInelasticStress
    inelastic_models = 'isotropic_plasticity'
  []
[]

[Postprocessors]
  [nl_its]
    type = NumNonlinearIterations
  []
  [time]
    type = PerfGraphData
    section_name = Root
    data_type = TOTAL
  []
[]

[Executioner]
  type = Transient
  solve_type = NEWTON
  petsc_options_iname = '-pc_type -pc_factor_mat_solver_package'
  petsc_options_value = 'lu       superlu_dist'

  dt = 0.1
  end_time = 25

  nl_rel_tol = 1e-6
  nl_max_its = 20

  automatic_scaling = true
[]

[Outputs]
  exodus = true
  csv = true
  print_linear_residuals = false
[]

The csv file and the mesh file are in this archive:
Archive.zip

Design

Have @lindsayad figure out what's going on.

Impact

Speed!

The text was updated successfully, but these errors were encountered:

lindsayad · 2021-05-18T21:12:57Z

@hugary1995 what size container(s) were you using for these?

hugary1995 · 2021-05-18T21:34:59Z

I believe it's 50 because I don't recall changing the default of that. Unless the default AD container size changed between these commits.

lindsayad · 2021-05-18T21:35:42Z

👍 (it hasn't changed)

lindsayad · 2021-05-18T21:40:02Z

With size 75 containers, taking 5 time steps, I see 28 seconds for global sparse vs 32 seconds for local nonsparse. I will try with size 50

lindsayad · 2021-05-18T22:04:08Z

Hmm going down to 50 the nonsparse container definitely improves significantly (as it should) but it doesn't appear to be significantly faster than sparse. Children of NonlinearSystem::solve take 19.78s for nonsparse and 21.24s for sparse. 15.65s for FEProblemBase::computeJacobianInteral for nonsparse; 17.11s for sparse. So that's a 9% difference. I don't think that's enough for me to devote a significant amount of time to nor for me to consider changing the default given the tremendous flexibility gains/cleaner code with global-sparse. One of our tasks for next year is general improvement of AD speed with focus on things like vectorization. Hopefully with that we will improve both nonsparse and sparse options.

lindsayad · 2021-05-18T22:07:13Z

I will try running the full end_time = 25 since you reported much larger % differences than I experienced

hugary1995 · 2021-05-18T22:16:50Z

I will try running the full end_time = 25 since you reported much larger % differences than I experienced

Yeah I was about to say that. I believe later on the plasticity model starts to play a role, where there is a newton loop per qp while evaluating the material model.

lindsayad · 2021-05-18T23:46:24Z

Ok with the full run on my machine I see 311s for Jacobian for sparse and 276s for nonsparse, so that's up to 13% difference.

lindsayad · 2021-05-18T23:51:43Z

Sadly I just don't see much room for optimization in our most expensive function

(pprof) list sparsity_union
Total: 9.10mins
ROUTINE ======================== MetaPhysicL::DynamicSparseNumberBase::sparsity_union in /home/lindad/projects/moose/libmesh/installed/include/metaphysicl/dynamicsparsenumberbase.h
  1.81mins   2.05mins (flat, cum) 22.51% of Total
         .          .    275:  // increase it as needed to support e.g. operator+=
         .          .    276:template <typename Data, typename Indices, template <class...> class SubType, class... SubTypeArgs>
         .          .    277:template <typename Indices2>
         .          .    278:inline
         .          .    279:void
     1.51s      1.51s    280:DynamicSparseNumberBase<Data, Indices, SubType, SubTypeArgs...>::sparsity_union (const Indices2 & new_indices)
         .          .    281:{
         .          .    282:  metaphysicl_assert
         .          .    283:    (std::adjacent_find(_indices.begin(), _indices.end()) ==
         .          .    284:     _indices.end());
         .          .    285:  metaphysicl_assert
         .          .    286:    (std::adjacent_find(new_indices.begin(), new_indices.end()) ==
         .          .    287:     new_indices.end());
         .          .    288:#ifdef METAPHYSICL_HAVE_CXX11
         .          .    289:  metaphysicl_assert(std::is_sorted(_indices.begin(), _indices.end()));
         .          .    290:  metaphysicl_assert(std::is_sorted(new_indices.begin(), new_indices.end()));
         .          .    291:#endif
         .          .    292:
         .          .    293:  auto index_it = _indices.begin();
         .          .    294:  auto index2_it = new_indices.begin();
         .          .    295:
         .          .    296:  typedef typename Indices2::value_type I2;
         .          .    297:  typedef typename CompareTypes<I,I2>::supertype max_index_type;
      10ms       10ms    298:  max_index_type unseen_indices = 0;
         .          .    299:
         .          .    300:  const I maxI = std::numeric_limits<I>::max();
         .          .    301:
     2.95s      4.32s    302:  while (index2_it != new_indices.end()) {
     1.79s      4.12s    303:    I idx1 = (index_it == _indices.end()) ? maxI : *index_it;
     250ms      250ms    304:    I2 idx2 = *index2_it;
         .          .    305:
     1.72s      1.72s    306:    while (idx1 < idx2) {
     1.09s      1.09s    307:      ++index_it;
     490ms      490ms    308:      idx1 = (index_it == _indices.end()) ? maxI : *index_it;
         .          .    309:    }
         .          .    310:
    26.32s     26.32s    311:    while ((idx1 == idx2) &&
    18.84s     18.84s    312:           (idx1 != maxI)) {
         .          .    313:      ++index_it;
     2.68s      2.68s    314:      idx1 = (index_it == _indices.end()) ? maxI : *index_it;
     1.59s      1.59s    315:      ++index2_it;
     6.27s      6.27s    316:      idx2 = (index2_it == new_indices.end()) ? maxI : *index2_it;
         .          .    317:    }
         .          .    318:
     8.15s      8.15s    319:    while (idx2 < idx1) {
     460ms      460ms    320:      ++unseen_indices;
     530ms      530ms    321:        ++index2_it;
     3.38s      3.38s    322:      if (index2_it == new_indices.end())
         .          .    323:        break;
         .          .    324:      idx2 = *index2_it;
         .          .    325:    }
         .          .    326:  }
         .          .    327:
         .          .    328:  // The common case is cheap
     1.96s      1.96s    329:  if (!unseen_indices)
         .          .    330:    return;
         .          .    331:
         .          .    332:  std::size_t old_size = this->size();
         .          .    333:
         .      460ms    334:  this->resize(old_size + unseen_indices);
         .          .    335:
         .       70ms    336:  auto md_it = _data.rbegin();
         .      660ms    337:  auto mi_it = _indices.rbegin();
         .          .    338:
         .      270ms    339:  auto d_it =
         .          .    340:    _data.rbegin() + unseen_indices;
         .      250ms    341:  auto i_it =
         .          .    342:    _indices.rbegin() + unseen_indices;
         .      720ms    343:  auto i2_it = new_indices.rbegin();
         .          .    344:
         .          .    345:  // Duplicate copies of rend() to work around
         .          .    346:  // http://www.open-std.org/jtc1/sc22/wg21/docs/lwg-defects.html#179
         .          .    347:  auto      mirend  = _indices.rend();
         .          .    348:  auto  rend  = mirend;
         .          .    349:  auto rend2 = new_indices.rend();
         .          .    350:#ifndef NDEBUG
         .          .    351:  auto      mdrend = _data.rend();
         .          .    352:  auto drend = mdrend;
         .          .    353:#endif
         .          .    354:
        2s      2.38s    355:  for (; mi_it != mirend; ++md_it, ++mi_it) {
     7.51s      7.51s    356:    if ((i_it == rend) ||
     940ms      940ms    357:        ((i2_it != rend2) &&
     7.21s      7.21s    358:         (*i2_it > *i_it))) {
     410ms      410ms    359:      *md_it = 0;
      60ms       60ms    360:      *mi_it = *i2_it;
         .      6.91s    361:      ++i2_it;
         .          .    362:    } else {
         .          .    363:      if ((i2_it != rend2) &&
         .          .    364:          (*i2_it == *i_it))
         .      150ms    365:        ++i2_it;
         .          .    366:      metaphysicl_assert(d_it < drend);
         .          .    367:      metaphysicl_assert(md_it < mdrend);
     100ms      100ms    368:      *md_it = *d_it;
     680ms      680ms    369:      *mi_it = *i_it;
         .       10ms    370:      ++d_it;
         .      780ms    371:      ++i_it;
         .          .    372:    }
         .          .    373:  }
         .          .    374:
         .          .    375:  metaphysicl_assert(i_it  == rend);
         .          .    376:  metaphysicl_assert(i2_it == rend2);
         .          .    377:  metaphysicl_assert(d_it  == drend);
         .          .    378:  metaphysicl_assert(md_it == mdrend);
     9.61s      9.61s    379:}

@roystgnr do you see anything you would do differently?

lindsayad · 2021-05-18T23:56:26Z

Here is the graph

hugary1995 · 2021-05-18T23:58:08Z

Yeah I think 13% isn't enough of a motivation. Could the performance of the AD container be machine dependent? I remember I was seeing a bigger difference. Probably I messed something up.

lindsayad · 2021-05-19T00:32:02Z

No I doubt you messed anything up. Could be that I had more chatter on my machine during the nonsparse profiling than you did or when I was doing the sparse profiling.

roystgnr · 2021-05-19T15:27:41Z

We could try to special-case the "sparsity patterns are identical" use case; potential speedup there might make up for the extra overhead in real unions.

We could try to turn some of these manual loops into STL algorithm invocations, and hope that we're on systems that can do better vectorization or assembly or whatever with those.

I'm not optimistic about either idea.

hugary1995 added the T: task An enhancement to the software. label Apr 24, 2021

lindsayad self-assigned this Apr 24, 2021

GiudGiud added C: Automatic Differentiation Tickets pertaining the MetaPhysicL based forward mode AD system C: Modules/Solid Mechanics Tickets pertaining to the solid_mechanics module labels Apr 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AD global indexing sparse container is slower #17674

AD global indexing sparse container is slower #17674

hugary1995 commented Apr 24, 2021

lindsayad commented May 18, 2021

hugary1995 commented May 18, 2021

lindsayad commented May 18, 2021

lindsayad commented May 18, 2021

lindsayad commented May 18, 2021

lindsayad commented May 18, 2021

hugary1995 commented May 18, 2021

lindsayad commented May 18, 2021

lindsayad commented May 18, 2021

lindsayad commented May 18, 2021

hugary1995 commented May 18, 2021

lindsayad commented May 19, 2021

roystgnr commented May 19, 2021

AD global indexing sparse container is slower #17674

AD global indexing sparse container is slower #17674

Comments

hugary1995 commented Apr 24, 2021

Reason

Design

Impact

lindsayad commented May 18, 2021

hugary1995 commented May 18, 2021

lindsayad commented May 18, 2021

lindsayad commented May 18, 2021

lindsayad commented May 18, 2021

lindsayad commented May 18, 2021

hugary1995 commented May 18, 2021

lindsayad commented May 18, 2021

lindsayad commented May 18, 2021

lindsayad commented May 18, 2021

hugary1995 commented May 18, 2021

lindsayad commented May 19, 2021

roystgnr commented May 19, 2021