Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AD global indexing sparse container is slower #17674

Open
hugary1995 opened this issue Apr 24, 2021 · 13 comments
Open

AD global indexing sparse container is slower #17674

hugary1995 opened this issue Apr 24, 2021 · 13 comments
Assignees
Labels
C: Automatic Differentiation Tickets pertaining the MetaPhysicL based forward mode AD system C: Modules/Solid Mechanics Tickets pertaining to the solid_mechanics module T: task An enhancement to the software.

Comments

@hugary1995
Copy link
Contributor

Reason

See discussion #17671 . For a specific problem, global sparse is slower than local nonsparse.
For the record, I am attaching the input file here:

[Mesh]
  type = FileMesh
  file = Specimen_QtrModel.e
[]

[GlobalParams]
  displacements = 'disp_x disp_y disp_z'
  volumetric_locking_correction = true
[]

[Modules/TensorMechanics/Master]
  [all]
    strain = FINITE
    incremental = true
    add_variables = true
    use_automatic_differentiation = true
  []
[]

[BCs]
  [Load_RE]
    type = FunctionDirichletBC
    variable = disp_x
    function = '0.0226*t'
    boundary = "Right_end"
    preset = false
  []
  [Symm_LE]
    type = DirichletBC
    variable = disp_x
    value = 0
    boundary = "Left_end"
  []
  [Symm_BY]
    type = DirichletBC
    variable = disp_y
    value = 0.0
    boundary = "Bottom"
  []
  [Symm_LZ]
    type = DirichletBC
    variable = disp_z
    value = 0.0
    boundary = "Lateral_face"
  []
  [Const_RY]
    type = DirichletBC
    variable = disp_y
    value = 0.0
    boundary = "Right_end"
  []
  [Const_RZ]
    type = DirichletBC
    variable = disp_z
    value = 0.0
    boundary = "Right_end"
  []
[]

[Functions]
  [hf]
    type = PiecewiseLinear
    data_file = PW_last_iter.csv
    scale_factor = 1
    format = columns
  []
[]

[Materials]
  [elasticity_tensor]
    type = ADComputeIsotropicElasticityTensor
    youngs_modulus = 2.02E+11
    poissons_ratio = 0.315
  []
  [isotropic_plasticity]
    type = ADIsotropicPlasticityStressUpdate
    yield_stress = 554146341.5
    hardening_function = hf
  []
  [radial_return_stress]
    type = ADComputeMultipleInelasticStress
    inelastic_models = 'isotropic_plasticity'
  []
[]

[Postprocessors]
  [nl_its]
    type = NumNonlinearIterations
  []
  [time]
    type = PerfGraphData
    section_name = Root
    data_type = TOTAL
  []
[]

[Executioner]
  type = Transient
  solve_type = NEWTON
  petsc_options_iname = '-pc_type -pc_factor_mat_solver_package'
  petsc_options_value = 'lu       superlu_dist'

  dt = 0.1
  end_time = 25

  nl_rel_tol = 1e-6
  nl_max_its = 20

  automatic_scaling = true
[]

[Outputs]
  exodus = true
  csv = true
  print_linear_residuals = false
[]

The csv file and the mesh file are in this archive:
Archive.zip

Design

Have @lindsayad figure out what's going on.

Impact

Speed!

@hugary1995 hugary1995 added the T: task An enhancement to the software. label Apr 24, 2021
@lindsayad lindsayad self-assigned this Apr 24, 2021
@lindsayad
Copy link
Member

@hugary1995 what size container(s) were you using for these?

@hugary1995
Copy link
Contributor Author

I believe it's 50 because I don't recall changing the default of that. Unless the default AD container size changed between these commits.

@lindsayad
Copy link
Member

👍 (it hasn't changed)

@lindsayad
Copy link
Member

With size 75 containers, taking 5 time steps, I see 28 seconds for global sparse vs 32 seconds for local nonsparse. I will try with size 50

@lindsayad
Copy link
Member

Hmm going down to 50 the nonsparse container definitely improves significantly (as it should) but it doesn't appear to be significantly faster than sparse. Children of NonlinearSystem::solve take 19.78s for nonsparse and 21.24s for sparse. 15.65s for FEProblemBase::computeJacobianInteral for nonsparse; 17.11s for sparse. So that's a 9% difference. I don't think that's enough for me to devote a significant amount of time to nor for me to consider changing the default given the tremendous flexibility gains/cleaner code with global-sparse. One of our tasks for next year is general improvement of AD speed with focus on things like vectorization. Hopefully with that we will improve both nonsparse and sparse options.

@lindsayad
Copy link
Member

I will try running the full end_time = 25 since you reported much larger % differences than I experienced

@hugary1995
Copy link
Contributor Author

I will try running the full end_time = 25 since you reported much larger % differences than I experienced

Yeah I was about to say that. I believe later on the plasticity model starts to play a role, where there is a newton loop per qp while evaluating the material model.

@lindsayad
Copy link
Member

Ok with the full run on my machine I see 311s for Jacobian for sparse and 276s for nonsparse, so that's up to 13% difference.

@lindsayad
Copy link
Member

Sadly I just don't see much room for optimization in our most expensive function

(pprof) list sparsity_union
Total: 9.10mins
ROUTINE ======================== MetaPhysicL::DynamicSparseNumberBase::sparsity_union in /home/lindad/projects/moose/libmesh/installed/include/metaphysicl/dynamicsparsenumberbase.h
  1.81mins   2.05mins (flat, cum) 22.51% of Total
         .          .    275:  // increase it as needed to support e.g. operator+=
         .          .    276:template <typename Data, typename Indices, template <class...> class SubType, class... SubTypeArgs>
         .          .    277:template <typename Indices2>
         .          .    278:inline
         .          .    279:void
     1.51s      1.51s    280:DynamicSparseNumberBase<Data, Indices, SubType, SubTypeArgs...>::sparsity_union (const Indices2 & new_indices)
         .          .    281:{
         .          .    282:  metaphysicl_assert
         .          .    283:    (std::adjacent_find(_indices.begin(), _indices.end()) ==
         .          .    284:     _indices.end());
         .          .    285:  metaphysicl_assert
         .          .    286:    (std::adjacent_find(new_indices.begin(), new_indices.end()) ==
         .          .    287:     new_indices.end());
         .          .    288:#ifdef METAPHYSICL_HAVE_CXX11
         .          .    289:  metaphysicl_assert(std::is_sorted(_indices.begin(), _indices.end()));
         .          .    290:  metaphysicl_assert(std::is_sorted(new_indices.begin(), new_indices.end()));
         .          .    291:#endif
         .          .    292:
         .          .    293:  auto index_it = _indices.begin();
         .          .    294:  auto index2_it = new_indices.begin();
         .          .    295:
         .          .    296:  typedef typename Indices2::value_type I2;
         .          .    297:  typedef typename CompareTypes<I,I2>::supertype max_index_type;
      10ms       10ms    298:  max_index_type unseen_indices = 0;
         .          .    299:
         .          .    300:  const I maxI = std::numeric_limits<I>::max();
         .          .    301:
     2.95s      4.32s    302:  while (index2_it != new_indices.end()) {
     1.79s      4.12s    303:    I idx1 = (index_it == _indices.end()) ? maxI : *index_it;
     250ms      250ms    304:    I2 idx2 = *index2_it;
         .          .    305:
     1.72s      1.72s    306:    while (idx1 < idx2) {
     1.09s      1.09s    307:      ++index_it;
     490ms      490ms    308:      idx1 = (index_it == _indices.end()) ? maxI : *index_it;
         .          .    309:    }
         .          .    310:
    26.32s     26.32s    311:    while ((idx1 == idx2) &&
    18.84s     18.84s    312:           (idx1 != maxI)) {
         .          .    313:      ++index_it;
     2.68s      2.68s    314:      idx1 = (index_it == _indices.end()) ? maxI : *index_it;
     1.59s      1.59s    315:      ++index2_it;
     6.27s      6.27s    316:      idx2 = (index2_it == new_indices.end()) ? maxI : *index2_it;
         .          .    317:    }
         .          .    318:
     8.15s      8.15s    319:    while (idx2 < idx1) {
     460ms      460ms    320:      ++unseen_indices;
     530ms      530ms    321:        ++index2_it;
     3.38s      3.38s    322:      if (index2_it == new_indices.end())
         .          .    323:        break;
         .          .    324:      idx2 = *index2_it;
         .          .    325:    }
         .          .    326:  }
         .          .    327:
         .          .    328:  // The common case is cheap
     1.96s      1.96s    329:  if (!unseen_indices)
         .          .    330:    return;
         .          .    331:
         .          .    332:  std::size_t old_size = this->size();
         .          .    333:
         .      460ms    334:  this->resize(old_size + unseen_indices);
         .          .    335:
         .       70ms    336:  auto md_it = _data.rbegin();
         .      660ms    337:  auto mi_it = _indices.rbegin();
         .          .    338:
         .      270ms    339:  auto d_it =
         .          .    340:    _data.rbegin() + unseen_indices;
         .      250ms    341:  auto i_it =
         .          .    342:    _indices.rbegin() + unseen_indices;
         .      720ms    343:  auto i2_it = new_indices.rbegin();
         .          .    344:
         .          .    345:  // Duplicate copies of rend() to work around
         .          .    346:  // http://www.open-std.org/jtc1/sc22/wg21/docs/lwg-defects.html#179
         .          .    347:  auto      mirend  = _indices.rend();
         .          .    348:  auto  rend  = mirend;
         .          .    349:  auto rend2 = new_indices.rend();
         .          .    350:#ifndef NDEBUG
         .          .    351:  auto      mdrend = _data.rend();
         .          .    352:  auto drend = mdrend;
         .          .    353:#endif
         .          .    354:
        2s      2.38s    355:  for (; mi_it != mirend; ++md_it, ++mi_it) {
     7.51s      7.51s    356:    if ((i_it == rend) ||
     940ms      940ms    357:        ((i2_it != rend2) &&
     7.21s      7.21s    358:         (*i2_it > *i_it))) {
     410ms      410ms    359:      *md_it = 0;
      60ms       60ms    360:      *mi_it = *i2_it;
         .      6.91s    361:      ++i2_it;
         .          .    362:    } else {
         .          .    363:      if ((i2_it != rend2) &&
         .          .    364:          (*i2_it == *i_it))
         .      150ms    365:        ++i2_it;
         .          .    366:      metaphysicl_assert(d_it < drend);
         .          .    367:      metaphysicl_assert(md_it < mdrend);
     100ms      100ms    368:      *md_it = *d_it;
     680ms      680ms    369:      *mi_it = *i_it;
         .       10ms    370:      ++d_it;
         .      780ms    371:      ++i_it;
         .          .    372:    }
         .          .    373:  }
         .          .    374:
         .          .    375:  metaphysicl_assert(i_it  == rend);
         .          .    376:  metaphysicl_assert(i2_it == rend2);
         .          .    377:  metaphysicl_assert(d_it  == drend);
         .          .    378:  metaphysicl_assert(md_it == mdrend);
     9.61s      9.61s    379:}

@roystgnr do you see anything you would do differently?

@lindsayad
Copy link
Member

Here is the graph
full-global-sparse-50

@hugary1995
Copy link
Contributor Author

Yeah I think 13% isn't enough of a motivation. Could the performance of the AD container be machine dependent? I remember I was seeing a bigger difference. Probably I messed something up.

@lindsayad
Copy link
Member

No I doubt you messed anything up. Could be that I had more chatter on my machine during the nonsparse profiling than you did or when I was doing the sparse profiling.

@roystgnr
Copy link
Contributor

We could try to special-case the "sparsity patterns are identical" use case; potential speedup there might make up for the extra overhead in real unions.

We could try to turn some of these manual loops into STL algorithm invocations, and hope that we're on systems that can do better vectorization or assembly or whatever with those.

I'm not optimistic about either idea.

@GiudGiud GiudGiud added C: Automatic Differentiation Tickets pertaining the MetaPhysicL based forward mode AD system C: Modules/Solid Mechanics Tickets pertaining to the solid_mechanics module labels Apr 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C: Automatic Differentiation Tickets pertaining the MetaPhysicL based forward mode AD system C: Modules/Solid Mechanics Tickets pertaining to the solid_mechanics module T: task An enhancement to the software.
Projects
None yet
Development

No branches or pull requests

4 participants