Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid deprecated method to get Azure subscription ID #378

Merged
merged 2 commits into from
Sep 12, 2022

Conversation

TomAugspurger
Copy link
Member

@TomAugspurger TomAugspurger commented Sep 9, 2022

This removes the use of a deprecated method from azure-cli-core to get the default subscription ID. It also removes the dependency on azure-cli-core.

For the common case, where the user has the Azure CLI installed and configured, we can get the subscription ID by parsing its output.

This PR also enables users to configure the subscription ID either in code or through the Dask config system.

A few related documentation updates for good measure.

FYI, I'm running the azure tests now. I'll let you know when they pass.

Closes #376

Tom Augspurger added 2 commits September 9, 2022 10:29
This removes the use of a deprecated method from azure-cli-core
to get the default subscription ID. It also removes the dependency
on azure-cli-core.

For the common case, where the user has the Azure CLI installed and
configured, we can get the subscription ID by parsing its output.

This PR also enables users to configure the subscription ID either
in code or through the Dask config system.

A few related documentation updates for good measure.
@TomAugspurger
Copy link
Member Author

TomAugspurger commented Sep 9, 2022

For reference, here's my test setup:

FROM daskdev/dask:latest
COPY . /code/dask-cloudprovider
RUN python3 -m pip install /code/dask-cloudprovider[azure]
RUN python3 -m pip install --no-cache -r /code/dask-cloudprovider/requirements_test.txt

I build that image and run it

docker run -it --rm -v (pwd):/dask-cloudprovider \
     -e DASK_CLOUDPROVIDER__AZURE__LOCATION="..." \
     -e DASK_CLOUDPROVIDER__AZURE__RESOURCE_GROUP="..." \
     -e DASK_CLOUDPROVIDER__AZURE__SUBSCRIPTION_ID="..." \
     -e DASK_CLOUDPROVIDER__AZURE__AZUREVM__VNET="..." \
     -e DASK_CLOUDPROVIDER__AZURE__AZUREVM__SECURITY_GROUP="..." \
     -e AZURE_TENANT_ID=$PC_ARM_TENANT_ID -e AZURE_CLIENT_ID=$PC_ARM_CLIENT_ID -e AZURE_CLIENT_SECRET=$PC_ARM_CLIENT_SECRET \
      tomaugspurger/dask-cloudprovider-azure pytest -vs /dask-cloudprovider/dask_cloudprovider/azure/tests --create-external-resources

That container image doesn't have Azure CLI setup / configured. I'll also test it locally, once I get a python environment that matches close enough.

@TomAugspurger
Copy link
Member Author

There were some failures. I'll look into those later.

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Timeout +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Stack of ThreadPoolExecutor-2_0 (140540411098880) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  File "/opt/conda/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/opt/conda/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/opt/conda/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.8/concurrent/futures/thread.py", line 78, in _worker
    work_item = work_queue.get(block=True)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Stack of IO loop (140540401657600) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
  File "/opt/conda/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/opt/conda/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/opt/conda/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.8/site-packages/distributed/utils.py", line 499, in run_loop
    loop.start()
  File "/opt/conda/lib/python3.8/site-packages/tornado/platform/asyncio.py", line 199, in start
    self.asyncio_loop.run_forever()
  File "/opt/conda/lib/python3.8/asyncio/base_events.py", line 570, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.8/asyncio/base_events.py", line 1859, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.8/asyncio/events.py", line 81, in _run
    self._context.run(self._callback, *self._args)
  File "/dask-cloudprovider/dask_cloudprovider/generic/vmcluster.py", line 339, in _start
    await super()._start()
  File "/opt/conda/lib/python3.8/site-packages/distributed/deploy/spec.py", line 309, in _start
    self.scheduler = await self.scheduler
  File "/opt/conda/lib/python3.8/site-packages/distributed/deploy/spec.py", line 64, in _
    await self.start()
  File "/dask-cloudprovider/dask_cloudprovider/generic/vmcluster.py", line 90, in start
    await self.wait_for_scheduler()
  File "/dask-cloudprovider/dask_cloudprovider/generic/vmcluster.py", line 50, in wait_for_scheduler
    while not is_socket_open(ip, port):
  File "/dask-cloudprovider/dask_cloudprovider/utils/socket.py", line 7, in is_socket_open
    connection.connect((ip, int(port)))

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Timeout +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
FAILED
dask-cloudprovider/dask_cloudprovider/azure/tests/test_azurevm.py::test_create_rapids_cluster_sync ERROR
dask-cloudprovider/dask_cloudprovider/azure/tests/test_azurevm.py::test_render_cloud_init FAILED
dask-cloudprovider/dask_cloudprovider/azure/tests/test_azurevm.py::test_render_cloud_init ERROR

=================================================================================================================================== ERRORS ===================================================================================================================================
____________________________________________________________________________________________________________ ERROR at teardown of test_create_rapids_cluster_sync ____________________________________________________________________________________________________________

fixturedef = <FixtureDef argname='event_loop' scope='function' baseid=''>, request = <SubRequest 'event_loop' for <Function test_create_rapids_cluster_sync>>

    @pytest.hookimpl(trylast=True)
    def pytest_fixture_post_finalizer(fixturedef: FixtureDef, request: SubRequest) -> None:
        """Called after fixture teardown"""
        if fixturedef.argname == "event_loop":
            policy = asyncio.get_event_loop_policy()
            try:
                loop = policy.get_event_loop()
            except RuntimeError:
                loop = None
            if loop is not None:
                # Clean up existing loop to avoid ResourceWarnings
>               loop.close()

opt/conda/lib/python3.8/site-packages/pytest_asyncio/plugin.py:364:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
opt/conda/lib/python3.8/asyncio/unix_events.py:58: in close
    super().close()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <_UnixSelectorEventLoop running=True closed=False debug=False>

    def close(self):
        if self.is_running():
>           raise RuntimeError("Cannot close a running event loop")
E           RuntimeError: Cannot close a running event loop

opt/conda/lib/python3.8/asyncio/selector_events.py:89: RuntimeError
________________________________________________________________________________________________________________ ERROR at teardown of test_render_cloud_init _________________________________________________________________________________________________________________

fixturedef = <FixtureDef argname='event_loop' scope='function' baseid=''>, request = <SubRequest 'event_loop' for <Function test_render_cloud_init>>

    @pytest.hookimpl(trylast=True)
    def pytest_fixture_post_finalizer(fixturedef: FixtureDef, request: SubRequest) -> None:
        """Called after fixture teardown"""
        if fixturedef.argname == "event_loop":
            policy = asyncio.get_event_loop_policy()
            try:
                loop = policy.get_event_loop()
            except RuntimeError:
                loop = None
            if loop is not None:
                # Clean up existing loop to avoid ResourceWarnings
>               loop.close()

opt/conda/lib/python3.8/site-packages/pytest_asyncio/plugin.py:364:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
opt/conda/lib/python3.8/asyncio/unix_events.py:58: in close
    super().close()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <_UnixSelectorEventLoop running=True closed=False debug=False>

    def close(self):
        if self.is_running():
>           raise RuntimeError("Cannot close a running event loop")
E           RuntimeError: Cannot close a running event loop

opt/conda/lib/python3.8/asyncio/selector_events.py:89: RuntimeError
================================================================================================================================== FAILURES ==================================================================================================================================
______________________________________________________________________________________________________________________ test_create_rapids_cluster_sync _______________________________________________________________________________________________________________________

    @pytest.mark.asyncio
    @pytest.mark.timeout(1200)
    @skip_without_credentials
    @pytest.mark.external
    async def test_create_rapids_cluster_sync():

>       with AzureVMCluster(
            vm_size="Standard_NC12s_v3",
            docker_image="rapidsai/rapidsai:cuda11.0-runtime-ubuntu18.04-py3.8",
            worker_class="dask_cuda.CUDAWorker",
            worker_options={"rmm_pool_size": "15GB"},
        ) as cluster:

dask-cloudprovider/dask_cloudprovider/azure/tests/test_azurevm.py:88:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
dask-cloudprovider/dask_cloudprovider/azure/azurevm.py:570: in __init__
    super().__init__(debug=debug, **kwargs)
dask-cloudprovider/dask_cloudprovider/generic/vmcluster.py:297: in __init__
    super().__init__(**kwargs, security=self.security)
opt/conda/lib/python3.8/site-packages/distributed/deploy/spec.py:275: in __init__
    self.sync(self._start)
opt/conda/lib/python3.8/site-packages/distributed/utils.py:338: in sync
    return sync(
opt/conda/lib/python3.8/site-packages/distributed/utils.py:401: in sync
    wait(10)
opt/conda/lib/python3.8/site-packages/distributed/utils.py:390: in wait
    return e.wait(timeout)
opt/conda/lib/python3.8/threading.py:558: in wait
    signaled = self._cond.wait(timeout)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <Condition(<unlocked _thread.lock object at 0x7fd20dd94420>, 0)>, timeout = 10

    def wait(self, timeout=None):
        """Wait until notified or until a timeout occurs.

        If the calling thread has not acquired the lock when this method is
        called, a RuntimeError is raised.

        This method releases the underlying lock, and then blocks until it is
        awakened by a notify() or notify_all() call for the same condition
        variable in another thread, or until the optional timeout occurs. Once
        awakened or timed out, it re-acquires the lock and returns.

        When the timeout argument is present and not None, it should be a
        floating point number specifying a timeout for the operation in seconds
        (or fractions thereof).

        When the underlying lock is an RLock, it is not released using its
        release() method, since this may not actually unlock the lock when it
        was acquired multiple times recursively. Instead, an internal interface
        of the RLock class is used, which really unlocks it even when it has
        been recursively acquired several times. Another internal interface is
        then used to restore the recursion level when the lock is reacquired.

        """
        if not self._is_owned():
            raise RuntimeError("cannot wait on un-acquired lock")
        waiter = _allocate_lock()
        waiter.acquire()
        self._waiters.append(waiter)
        saved_state = self._release_save()
        gotit = False
        try:    # restore state no matter what (e.g., KeyboardInterrupt)
            if timeout is None:
                waiter.acquire()
                gotit = True
            else:
                if timeout > 0:
>                   gotit = waiter.acquire(True, timeout)
E                   Failed: Timeout >1200.0s

opt/conda/lib/python3.8/threading.py:306: Failed
___________________________________________________________________________________________________________________________ test_render_cloud_init ___________________________________________________________________________________________________________________________

args = (), kwargs = {}, coro = <coroutine object test_render_cloud_init at 0x7fd20da17640>

    @functools.wraps(func)
    def inner(*args, **kwargs):
        coro = func(*args, **kwargs)
        if not inspect.isawaitable(coro):
            pyfuncitem.warn(
                pytest.PytestWarning(
                    f"The test {pyfuncitem} is marked with '@pytest.mark.asyncio' "
                    "but it is not an async function. "
                    "Please remove asyncio marker. "
                    "If the test is not marked explicitly, "
                    "check for global markers applied via 'pytestmark'."
                )
            )
            return
>       task = asyncio.ensure_future(coro, loop=_loop)

opt/conda/lib/python3.8/site-packages/pytest_asyncio/plugin.py:452:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
opt/conda/lib/python3.8/asyncio/tasks.py:672: in ensure_future
    task = loop.create_task(coro_or_future)
opt/conda/lib/python3.8/asyncio/base_events.py:429: in create_task
    self._check_closed()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <_UnixSelectorEventLoop running=False closed=True debug=False>

    def _check_closed(self):
        if self._closed:
>           raise RuntimeError('Event loop is closed')
E           RuntimeError: Event loop is closed

opt/conda/lib/python3.8/asyncio/base_events.py:508: RuntimeError
============================================================================================================================== warnings summary ==============================================================================================================================
dask_cloudprovider/azure/tests/test_azurevm.py::test_create_cluster
dask_cloudprovider/azure/tests/test_azurevm.py::test_create_cluster_sync
  /opt/conda/lib/python3.8/contextlib.py:120: UserWarning: Creating your cluster is taking a surprisingly long time. This is likely due to pending resources. Hang tight!
    next(self.gen)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
========================================================================================================================== short test summary info ===========================================================================================================================
FAILED dask-cloudprovider/dask_cloudprovider/azure/tests/test_azurevm.py::test_create_rapids_cluster_sync - Failed: Timeout >1200.0s
FAILED dask-cloudprovider/dask_cloudprovider/azure/tests/test_azurevm.py::test_render_cloud_init - RuntimeError: Event loop is closed
ERROR dask-cloudprovider/dask_cloudprovider/azure/tests/test_azurevm.py::test_create_rapids_cluster_sync - RuntimeError: Cannot close a running event loop
ERROR dask-cloudprovider/dask_cloudprovider/azure/tests/test_azurevm.py::test_render_cloud_init - RuntimeError: Cannot close a running event loop
======================================================================================================= 2 failed, 3 passed, 2 warnings, 2 errors in 2228.83s (0:37:08) =======================================================================================================
sys:1: RuntimeWarning: coroutine 'test_render_cloud_init' was never awaited

Copy link
Member

@jacobtomlinson jacobtomlinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great thanks @TomAugspurger.

The tests here are not in a good place so I'm not surprise you're seeing failures. The CI only runs a subset due to either needing credentials or mocking so heavily that the tests aren't really testing anything. This means the tests drift a little as it's not always possible to get everything running locally for every contribution.

There have also been a bunch of asyncio changes in distributed that we likely haven't kept up with here.

"azure-mgmt-network>=16.0.0,<17",
"azure-cli-core>=2.15.1,<2.21.0",
"msrestazure",
"azure-mgmt-compute>=18.0.0",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we keep the upper bound pins here, just move them on to a future major version? The azure packages don't seem afraid of large breaking changes and major version bumps.

Copy link
Member Author

@TomAugspurger TomAugspurger Sep 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't actually know if these libraries use semver, so I don't know what "upper" would be here (the one change that actual did break things was the change to azure-cli-core in a "minor" release, if it is indeed following semver).

Either way, I'd probably prefer not to put an upper bound since if things aren't broken the user doesn't have an easy way to override the pin. If things do break, then the user can set an upper bound in their requirements while we (I) investigate (https://iscinumpy.dev/post/bound-version-constraints/ goes through this in detail).

LMK what you think: happy to add them if you feel strongly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense, I've definitely been bitten by breaking changes in the Azure tools, and the major version seems to bump a lot so I assumed it was because of that. But if it isn't let's go without the pins for now.

@TomAugspurger
Copy link
Member Author

The test_render_cloud_init passes when run individually, so I think it's OK.

The test_create_rapids_cluster_sync test fails while the process is waiting to connect to the scheduler. I see if I can enable some debug diagnostics / SSH access on the VM so I can look at what's going on. The CPU / network are active for a while (downloading the docker image probably?) but the test never gets past waiting for the scheduler. This is on main so I think it's unrelated to this PR. I can investigate it before or after this is merged.

@jacobtomlinson jacobtomlinson merged commit 9f5e730 into dask:main Sep 12, 2022
@jacobtomlinson
Copy link
Member

The RAPIDS docker image is also pretty old here so might need updating. I appreciate you exploring this but I'm definitely happy to merge as is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AzureVMCluster uses deprecated method to get the subscription ID
2 participants