Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci] bump CUDA version and pull dockers from NVIDIA NGC #4112

Merged
merged 2 commits into from
Mar 26, 2021
Merged

Conversation

StrikerRUS
Copy link
Collaborator

Why switch to NVIDIA NGC? Due to #3518.

@jameslamb
Copy link
Collaborator

ugh I'm surprised to see this error from #4057 again (error message was introduced in #4059)

https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=9565&view=logs&j=c28dceab-947a-5848-c21f-eef3695e5f11&t=fa158246-17e2-53d4-5936-86070edbaacf

machines = '127.0.0.1:53229,127.0.0.1:53229'
worker_addresses = dict_keys(['tcp://127.0.0.1:43843', 'tcp://127.0.0.1:42213'])

    def _machines_to_worker_map(machines: str, worker_addresses: List[str]) -> Dict[str, int]:
        """Create a worker_map from machines list.
    
        Given ``machines`` and a list of Dask worker addresses, return a mapping where the keys are
        ``worker_addresses`` and the values are ports from ``machines``.
    
        Parameters
        ----------
        machines : str
            A comma-delimited list of workers, of the form ``ip1:port,ip2:port``.
        worker_addresses : list of str
            A list of Dask worker addresses, of the form ``{protocol}{hostname}:{port}``, where ``port`` is the port Dask's scheduler uses to talk to that worker.
    
        Returns
        -------
        result : Dict[str, int]
            Dictionary where keys are work addresses in the form expected by Dask and values are a port for LightGBM to use.
        """
        machine_addresses = machines.split(",")
    
        if len(set(machine_addresses)) != len(machine_addresses):
>           raise ValueError(f"Found duplicates in 'machines' ({machines}). Each entry in 'machines' must be a unique IP-port combination.")
E           ValueError: Found duplicates in 'machines' (127.0.0.1:53229,127.0.0.1:53229). Each entry in 'machines' must be a unique IP-port combination.

@StrikerRUS
Copy link
Collaborator Author

I have already seen this error one time after #4059. Guess this is somehow related to the initial seed of pseudo-random generator.
I believe more fair test from #4057 (comment) will be to run lgb.dask._find_random_open_port() with Python restart.

@StrikerRUS StrikerRUS marked this pull request as ready for review March 26, 2021 12:45
Copy link
Collaborator

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh really good idea to switch to NGC, thanks

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants