Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert OpenMP parallelization to OneAPI::TBB #6626

Draft
wants to merge 23 commits into
base: main
Choose a base branch
from

Conversation

dbs4261
Copy link
Contributor

@dbs4261 dbs4261 commented Jan 27, 2024

OpenMP acceleration has been migrated to use oneapi::TBB.

Type

  • Bug fix (non-breaking change which fixes an issue): Fixes #
  • New feature (non-breaking change which adds functionality). Resolves #
  • Breaking change (fix or feature that would cause existing functionality to not work as expected) Resolves #N/A

Motivation and Context

Many components of Open3D imply an eventual shift away from OpenMP to TBB. This includes some sections where tbb is only used on one platform as 2D loop unrolling isn't supported on Win32. Lastly, by using multiple parallelization paradigms, nested parallelism is problematic. When using some Open3D methods from a TBB context, an OpenMP thread pool is created for each TBB thread.

Checklist:

  • I have run python util/check_style.py --apply to apply Open3D code style
    to my code.
  • This PR changes Open3D behavior or adds new functionality.
    • Both C++ (Doxygen) and Python (Sphinx / Google style) documentation is
      updated accordingly.
    • I have added or updated C++ and / or Python unit tests OR included test
      results
      (e.g. screenshots or numbers) here.
  • I will follow up and update the code if CI fails.
  • For fork PRs, I have selected Allow edits from maintainers.

Description

Updated parallel for sections to use tbb::parallel_for. Adapted most loops that performed reductions with either omp reduction clauses or with critical sections to tbb::parallel_reduce implementations. Some of which required custom reduction objects instead of using lambdas. Added an atomic version of the ProgressBar for use with TBB.

There is still work to be done in documentation. This will break any user code that directly uses ParallelForCPU as OpenMP critical sections will no longer work. Additionally, TBB has no approach for setting the maximum number of threads like OpenMP does with OMP_NUM_THREADS. In C++ code a tbb::global_control object could be used, but it is unclear to how to provide that sort of functionality for python users.

Copy link

update-docs bot commented Jan 27, 2024

Thanks for submitting this pull request! The maintainers of this repository would appreciate if you could update the CHANGELOG.md based on your changes.

@dbs4261
Copy link
Contributor Author

dbs4261 commented Jan 27, 2024

Ok, I ran my tests in my development environment. Guess I should use the docker containers to replicate the CI environment and figure out those tests.

@dbs4261 dbs4261 changed the title Omp2tbb concert Jan 27, 2024
@dbs4261 dbs4261 changed the title concert Convert OpenMP parallelization to OneAPI::TBB Jan 27, 2024
@ssheorey
Copy link
Member

Hi @dbs4261 thanks for picking this up!

Possibly fixes #6544

@errissa
Copy link
Collaborator

errissa commented Feb 6, 2024

@dbs4261 Thanks for working on this! I just tested this PR on my Mac and got numerous TBB related compilation errors. I tried using the Homebrew version of TBB as well as the "build from source" configuration. There appear to be functions that this PR uses that are missing from the Homebrew and "build from source" versions of TBB on Mac.

I know this PR is still draft but wanted to report what I had found. Please let me know if you need any help testing/diagnosing issues on Mac.

@dbs4261
Copy link
Contributor Author

dbs4261 commented Feb 6, 2024

Hi @ssheorey this PR likely wont fix that issue as I haven't yet changed how the TBB dependency is being accessed. This is likely also why @errissa is facing issues building on Mac.

@errissa is homebrew pulling the OneAPI version of TBB? If you can provide me with the version of TBB you tried and the compiler errors I can take a look and figure out which version is required and work that into the PR.

@ssheorey
Copy link
Member

ssheorey commented Feb 6, 2024

@dbs4261 yes, you are right about not fixing #6544. We should update to the latest oneTBB as part of this PR to fix that though.

This is the latest version of oneTBB and is available for all platforms on github:

https://github.com/oneapi-src/oneTBB/releases/tag/v2021.11.0

The naming is off - this was released in Nov 2023.

I think this should also resolve @errissa 's issues on macOS.

@dbs4261
Copy link
Contributor Author

dbs4261 commented Feb 7, 2024

I agree that setting the version requirement for TBB should be part of this PR. Based on the ubuntu failure in CI, its the collaborative_call_once header that is missing. The TBB repo says the header hasnt been modified in 3 years, so I would think that any version that reports 2021+ should be fine. What does Open3D CI currently use for TBB?

@errissa
Copy link
Collaborator

errissa commented Feb 7, 2024

@dbs4261 @ssheorey is correct about the oneTBB version. Homebrew's most recent version is 2021.11.0 so if this PR builds successfully against it, it would solve the MacOS issue I experienced.

@dbs4261
Copy link
Contributor Author

dbs4261 commented Feb 7, 2024

Looks like the minimum version requirement for collaborative_call_once.h is v2021.4.0. Now it looks like we aren't putting version requirements in the find package scripts in 3rdparty/find_dependencies.cmake, this means that error's like @errissa can still happen when using the system library. This raises the question of if I should set the system version requirement to the same version that I am providing in the ExternalProject_add call, or if I should add the newest version but set the system requirement to the minimal version.

@ssheorey
Copy link
Member

Hi @dbs4261 , our usual policy is to upgrade to the latest version available, but set minimum version to what is required to make everything work. This helps to "future-proof" the updated code as much as possible by incorporating the latest bugfixes. Official binaries will be built with the latest version, but also allows the library to build on older versions by users.

Copy link
Member

@ssheorey ssheorey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Initial look]

@@ -26,13 +26,10 @@ find_package(Git QUIET REQUIRED)
ExternalProject_Add(
ext_tbb
PREFIX tbb
URL https://github.com/wjakob/tbb/archive/141b0e310e1fb552bdca887542c9c1a8544d6503.tar.gz # Sept 2020
URL_HASH SHA256=bb29b76eabf7549660e3dba2feb86ab501469432a15fb0bf2c21e24d6fbc4c72
URL https://github.com/oneapi-src/oneTBB/archive/refs/tags/v2021.4.0.tar.gz
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we upgrade to the latest? v2021.11.0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason why not. I just put in the older version that had all the features I used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a merge conflict here. The CI can run only after it's fixed.

func(i);
}
tbb::parallel_for(tbb::blocked_range<int64_t>(0, n, 32),
[&func](const tbb::blocked_range<int64_t>& range) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many threads will be used here? Currently, it's estimated with utility::EstimateMaxThreads() which gives us one thread per core (excluding hyperthreading).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, avoid using "magic numbers" (32). I think you have a GetDefaultChunkSize() function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will use up to the number of threads in task arena that called it. As for the chunk size, see my other comment.

return "";
}
}
int EstimateMaxThreads() { return tbb::this_task_arena::max_concurrency(); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use the number of cores (not number of HW threads)?

Copy link
Contributor Author

@dbs4261 dbs4261 Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the number of tasks is determined by the caller. A caller could be using a small task arena to deal with IO, while a larger arena deals with processing something else. This actually brings up an issue that I don't yet know how to solve. TBB sets the maximum concurrency with a C++ variable that follows scope rules but doesn't need to be passed to functions. So I don't know how a python user would set the concurrency limit yet. I think it might need to be done with some sort of context manager. But I guess this change behavior in an environment where the number of threads was limited with the OpenMP environment variable.

return 1;
#endif
std::size_t& DefaultGrainSizeTBB() noexcept {
static std::size_t GrainSize = 256;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you comment on how this value was selected? Did you see any performance differences for this value versus other values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I was guessing at grain size from this, but it really should be picked based off of profiling. My understanding is that the grain size provides loose guidance to TBB's automatic chunking mechanism. It works similarly to omp schedule(guided). Overall the goal is to provide plenty of work to each thread so the overhead of chunking is minimized, but small enough chunks that the scheduler can go back in a steal some if one of the threads gets held up. It might be worth taking another pass through the grain sizes that I put in and set them as a magic number times the DefaultGrainSizeTBB (which is mutable). That way the chunk size could be higher for doing a single operation with tensors, and smaller when looping through complex sections like in RANSAC.

@ssheorey
Copy link
Member

ssheorey commented Mar 7, 2024

[Notes about linking and binary distribution]

For linking TBB, recommendation is to link dynamically. For C++ binaries and applications, we will distribute TBB DLL along with the Open3D DLL.
oneapi-src/oneTBB#646

For Python, TBB libraries are available through PyPI, so we can add these as dependencies to requirements.txt
https://community.intel.com/t5/Intel-oneAPI-Threading-Building/How-to-ship-a-package-using-TBB-on-PyPI-manylinux/m-p/1227574

@@ -15,30 +18,57 @@ namespace utility {
class ProgressBar {
public:
ProgressBar(size_t expected_count,
const std::string &progress_info,
std::string progress_info,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why has the const been removed here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It has to be copied into the object, so it passed by value into the constructor and then by move into the member variable.

@ssheorey ssheorey added this to the v0.20 milestone Apr 29, 2024
@ssheorey ssheorey added the build/install Build or installation issue label Apr 30, 2024
dbs4261 and others added 16 commits September 16, 2024 21:23
…he progress bar into its own function and a bulk inplace add function operator+=. Also added TBBProgressBar. It does not inherit from ProgressBar as it uses an atomic for counting and has slightly different internals to use that atomicity.
…versions of format wont automatically convert it to its underlying type.
…e done to prevent assignment to the output pointer.
@benjaminum benjaminum mentioned this pull request Sep 28, 2024
9 tasks
@PKizzle
Copy link

PKizzle commented Oct 24, 2024

Is there anything that can be done to fix the two failing tests?

_______________________ test_get_surface_area[device0] ________________________

device = CPU:0

    @pytest.mark.parametrize("device", list_devices())
    def test_get_surface_area(device):
        # Test with custom parameters.
        cube = o3d.t.geometry.TriangleMesh.create_box(float_dtype=o3c.float64,
                                                      int_dtype=o3c.int32,
                                                      device=device)
        np.testing.assert_equal(cube.get_surface_area(), 6)
    
        empty = o3d.t.geometry.TriangleMesh(device=device)
        empty.get_surface_area()
        np.testing.assert_equal(empty.get_surface_area(), 0)
    
        # test noncontiguous
        sphere = o3d.t.geometry.TriangleMesh.create_sphere(device=device)
        area1 = sphere.get_surface_area()
        sphere.vertex.positions = sphere.vertex.positions.T().contiguous().T()
        sphere.triangle.indices = sphere.triangle.indices.T().contiguous().T()
        area2 = sphere.get_surface_area()
>       np.testing.assert_almost_equal(area1, area2)

python\test\t\geometry\test_trianglemesh.py:859: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

args = (12.501888275146484, 12.501884460449[219](https://github.com/isl-org/Open3D/actions/runs/11150773935/job/30997043371#step:6:220)), kwds = {}

    @wraps(func)
    def inner(*args, **kwds):
        with self._recreate_cm():
>           return func(*args, **kwds)
E           AssertionError: 
E           Arrays are not almost equal to 7 decimals
E            ACTUAL: 12.501888275146484
E            DESIRED: 12.501884460449219

C:\hostedtoolcache\windows\Python\3.11.9\x64\Lib\contextlib.py:81: AssertionError
_______________________________ test_color_map ________________________________

    def test_color_map():
        """
        Hard-coded values are from the 0.12 release. We expect the values to match
        exactly when OMP_NUM_THREADS=1. If more threads are used, there could be
        some small numerical differences.
        """
        o3d.utility.set_verbosity_level(o3d.utility.VerbosityLevel.Debug)
    
        # Load dataset
        mesh, rgbd_images, camera_trajectory = load_fountain_dataset()
    
        # Computes averaged color without optimization, for debugging
        mesh, camera_trajectory = o3d.pipelines.color_map.run_rigid_optimizer(
            mesh, rgbd_images, camera_trajectory,
            o3d.pipelines.color_map.RigidOptimizerOption(maximum_iteration=0))
        vertex_mean = np.mean(np.asarray(mesh.vertex_colors), axis=0)
        extrinsic_mean = np.array(
            [c.extrinsic for c in camera_trajectory.parameters]).mean(axis=0)
>       np.testing.assert_allclose(vertex_mean,
                                   np.array([0.40322907, 0.37276872, 0.543[75](https://github.com/isl-org/Open3D/actions/runs/11150773935/job/30997043371#step:6:76)919]),
                                   rtol=1e-5)

python\test\test_color_map_optimization.py:49: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

args = (<function assert_allclose.<locals>.compare at 0x0000027AB5A5F240>, array([0.42966498, 0.39627099, 0.5[76](https://github.com/isl-org/Open3D/actions/runs/11150773935/job/30997043371#step:6:77)62147]), array([0.40322907, 0.37276872, 0.54375919]))
kwds = {'equal_nan': True, 'err_msg': '', 'header': 'Not equal to tolerance rtol=1e-05, atol=0', 'verbose': True}

    @wraps(func)
    def inner(*args, **kwds):
        with self._recreate_cm():
>           return func(*args, **kwds)
E           AssertionError: 
E           Not equal to tolerance rtol=1e-05, atol=0
E           
E           Mismatched elements: 3 / 3 (100%)
E           Max absolute difference: 0.03286228
E           Max relative difference: 0.06556053
E            x: array([0.429665, 0.396271, 0.576621])
E            y: array([0.403229, 0.372769, 0.543759])

C:\hostedtoolcache\windows\Python\3.11.9\x64\Lib\contextlib.py:[81](https://github.com/isl-org/Open3D/actions/runs/11150773935/job/30997043371#step:6:82): AssertionError

@dbs4261
Copy link
Contributor Author

dbs4261 commented Oct 30, 2024

I can take a look, they are both in the python library, correct?

@PKizzle
Copy link

PKizzle commented Oct 31, 2024

Yes, I guess so. You can find them here:

  • python\test\t\geometry\test_trianglemesh.py:859
  • python\test\test_color_map_optimization.py:49

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build/install Build or installation issue
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.

Error when installing open3d for conda environment, missing libomp, seg fault when installed
5 participants