Fix tensor devices for DARTS Trial #2273

sifa1024 · 2024-03-05T10:30:44Z

What this PR does / why we need it:
If I use the original program, I will get this error when running darts-gpu,

<architect.Architect object at 0x7fe597aad780>
Traceback (most recent call last):
  File "/home/sifa/docker/katib/examples/v1beta1/trial-images/darts-cnn-cifar10/run_trial.py", line 259, in <module>
    main()
  File "/home/sifa/docker/katib/examples/v1beta1/trial-images/darts-cnn-cifar10/run_trial.py", line 155, in main
    train(train_loader, valid_loader, model, architect, w_optim, alpha_optim,
  File "/home/sifa/docker/katib/examples/v1beta1/trial-images/darts-cnn-cifar10/run_trial.py", line 194, in train
    architect.unrolled_backward(train_x, train_y, valid_x, valid_y, lr, w_optim)
  File "/home/sifa/docker/katib/examples/v1beta1/trial-images/darts-cnn-cifar10/architect.py", line 69, in unrolled_backward
    self.virtual_step(train_x, train_y, xi, w_optim)
  File "/home/sifa/docker/katib/examples/v1beta1/trial-images/darts-cnn-cifar10/architect.py", line 56, in virtual_step
    vw.copy_(w - torch.FloatTensor(xi) * (m + g + self.w_weight_decay * w))
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Which issue(s) this PR fixes

None. I've create pull request directly.

Checklist:

Docs included if any changes are user facing

tenzen-y · 2024-03-05T10:36:28Z

@sifa1024 You need to sign to commit with the email used during sign the CLA.

andreyvelich

Thank you for the fix! I left a small comment.

andreyvelich · 2024-03-05T18:24:08Z

examples/v1beta1/trial-images/darts-cnn-cifar10/architect.py

+        # Check device use cuda or cpu 
+        use_cuda = list(range(torch.cuda.device_count()))
+        if use_cuda:
+            print("Using CUDA")
+        device = torch.device("cuda" if use_cuda else "cpu")
+


We identify device here:

katib/examples/v1beta1/trial-images/darts-cnn-cifar10/run_trial.py

Lines 86 to 100 in 4d3ea0c

if len(all_gpus) > 0:

device = torch.device("cuda")

torch.cuda.set_device(all_gpus[0])

np.random.seed(2)

torch.manual_seed(2)

torch.cuda.manual_seed_all(2)

torch.backends.cudnn.benchmark = True

print(">>> Use GPU for Training <<<")

print("Device ID: {}".format(torch.cuda.current_device()))

print("Device name: {}".format(torch.cuda.get_device_name(0)))

print("Device availability: {}\n".format(torch.cuda.is_available()))

else:

device = torch.device("cpu")

print(">>> Use CPU for Training <<<")

.
Can we just pass the device to the Architect class ?

Yes, we can. But is it a good idea to send the device name?

I think, it's fine since we don't need to invoke torch API again to understand if we have GPU available.

OK I will change it.

@andreyvelich Please check it and thank you for your help.

Thanks, I restarted tests.

72907153+sifa1024@users.noreply.github.com Signed-off-by: Chen Pin-Han <72907153+sifa1024@users.noreply.github.com>

tenzen-y

/lgtm
/approve

/hold
for restarting failed Go Test / Unit Test (1.26.1) (pull_request).

@kubeflow/wg-automl-leads Could you restart CI?

google-oss-prow · 2024-03-07T06:08:30Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sifa1024, tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [tenzen-y]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tenzen-y · 2024-03-08T07:34:37Z

@kubeflow/wg-automl-leads Could you approve CI, again?

tenzen-y · 2024-03-08T07:38:49Z

@sifa1024
This is a future reference.
We should do an actual rebase instead of a merge.

sifa1024 · 2024-03-08T07:46:02Z

@tenzen-y OK. I'm sorry. But I found a commit in my branch was not updated, so I just updated this.

tenzen-y · 2024-03-09T12:31:26Z

@kubeflow/wg-automl-leads Could you restart Go Test / Unit Test (1.25.0) (pull_request) ?

/lgtm

andreyvelich · 2024-03-10T03:15:23Z

Thank you for your contribution @sifa1024!
/hold cancel

google-oss-prow bot requested review from andreyvelich, anencore94 and gaocegege March 5, 2024 10:30

google-oss-prow bot added the size/S label Mar 5, 2024

sifa1024 force-pushed the patch-1 branch from 3d12ad6 to 4d3ea0c Compare March 5, 2024 12:17

andreyvelich reviewed Mar 5, 2024

View reviewed changes

sifa1024 added 3 commits March 6, 2024 22:04

Update architect.py

9b69e9d

72907153+sifa1024@users.noreply.github.com Signed-off-by: Chen Pin-Han <72907153+sifa1024@users.noreply.github.com>

Update run_trial.py

4f3ef04

72907153+sifa1024@users.noreply.github.com Signed-off-by: Chen Pin-Han <72907153+sifa1024@users.noreply.github.com>

Update architect.py

bbb9f3e

72907153+sifa1024@users.noreply.github.com Signed-off-by: Chen Pin-Han <72907153+sifa1024@users.noreply.github.com>

sifa1024 force-pushed the patch-1 branch from 2d282c7 to bbb9f3e Compare March 6, 2024 14:07

sifa1024 requested a review from andreyvelich March 6, 2024 14:07

tenzen-y reviewed Mar 7, 2024

View reviewed changes

google-oss-prow bot added the do-not-merge/hold label Mar 7, 2024

google-oss-prow bot assigned tenzen-y Mar 7, 2024

google-oss-prow bot added the lgtm label Mar 7, 2024

google-oss-prow bot added the approved label Mar 7, 2024

Merge branch 'kubeflow:master' into patch-1

a15d42f

google-oss-prow bot removed the lgtm label Mar 8, 2024

google-oss-prow bot added the lgtm label Mar 9, 2024

andreyvelich changed the title ~~Update architect.py~~ Fix tensor devices for DARTS Trial Mar 10, 2024

google-oss-prow bot removed the do-not-merge/hold label Mar 10, 2024

google-oss-prow bot merged commit 61406a5 into kubeflow:master Mar 10, 2024
59 checks passed

sifa1024 deleted the patch-1 branch March 11, 2024 09:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix tensor devices for DARTS Trial #2273

Fix tensor devices for DARTS Trial #2273

sifa1024 commented Mar 5, 2024 •

edited

Loading

tenzen-y commented Mar 5, 2024

andreyvelich left a comment

andreyvelich Mar 5, 2024

sifa1024 Mar 6, 2024

andreyvelich Mar 6, 2024

sifa1024 Mar 6, 2024

sifa1024 Mar 6, 2024 •

edited

Loading

andreyvelich Mar 6, 2024

tenzen-y left a comment

google-oss-prow bot commented Mar 7, 2024

tenzen-y commented Mar 8, 2024

tenzen-y commented Mar 8, 2024

sifa1024 commented Mar 8, 2024 •

edited

Loading

tenzen-y commented Mar 9, 2024

andreyvelich commented Mar 10, 2024

	if len(all_gpus) > 0:
	device = torch.device("cuda")
	torch.cuda.set_device(all_gpus[0])
	np.random.seed(2)
	torch.manual_seed(2)
	torch.cuda.manual_seed_all(2)
	torch.backends.cudnn.benchmark = True
	print(">>> Use GPU for Training <<<")
	print("Device ID: {}".format(torch.cuda.current_device()))
	print("Device name: {}".format(torch.cuda.get_device_name(0)))
	print("Device availability: {}\n".format(torch.cuda.is_available()))
	else:
	device = torch.device("cpu")
	print(">>> Use CPU for Training <<<")

Fix tensor devices for DARTS Trial #2273

Fix tensor devices for DARTS Trial #2273

Conversation

sifa1024 commented Mar 5, 2024 • edited Loading

tenzen-y commented Mar 5, 2024

andreyvelich left a comment

Choose a reason for hiding this comment

andreyvelich Mar 5, 2024

Choose a reason for hiding this comment

sifa1024 Mar 6, 2024

Choose a reason for hiding this comment

andreyvelich Mar 6, 2024

Choose a reason for hiding this comment

sifa1024 Mar 6, 2024

Choose a reason for hiding this comment

sifa1024 Mar 6, 2024 • edited Loading

Choose a reason for hiding this comment

andreyvelich Mar 6, 2024

Choose a reason for hiding this comment

tenzen-y left a comment

Choose a reason for hiding this comment

google-oss-prow bot commented Mar 7, 2024

tenzen-y commented Mar 8, 2024

tenzen-y commented Mar 8, 2024

sifa1024 commented Mar 8, 2024 • edited Loading

tenzen-y commented Mar 9, 2024

andreyvelich commented Mar 10, 2024

sifa1024 commented Mar 5, 2024 •

edited

Loading

sifa1024 Mar 6, 2024 •

edited

Loading

sifa1024 commented Mar 8, 2024 •

edited

Loading