Skip to content
This repository has been archived by the owner on Sep 18, 2024. It is now read-only.

[WIP] Enable optional Pod Spec for FrameworkController platform #3379

Merged
merged 11 commits into from
Apr 1, 2021

Conversation

mbu93
Copy link
Contributor

@mbu93 mbu93 commented Feb 10, 2021

As discussed in #3350, this PR introduces:

  • The option to specify a custom framework controller config for framework controller mode
  • A minimum working setup for the mnist-pytorch example (nni config and framework controller config)
  • Required changes to the related python parser / webserver logic

In addition, the following changes have been applied:

  • Option to specify a k8s namespace for the trials to run

Currently, the following components have not yet been considered (thus the WIP state):

  • documentation for the code and examples
  • unit tests covering the changes

@ghost
Copy link

ghost commented Feb 10, 2021

CLA assistant check
All CLA requirements met.

}, {Optional('configPath'): setType('configPath', str),
Optional('storage'): setChoice('storage', 'nfs', 'azureStorage', 'pvc'),
Optional('serviceAccountName'): setType('serviceAccountName', str),
Optional('configPath'): setType('configPath', str),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

duplicated configPath

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted it.

Optional('storage'): setChoice('storage', 'nfs', 'azureStorage', 'pvc'),
Optional('serviceAccountName'): setType('serviceAccountName', str),
Optional('configPath'): setType('configPath', str),
'pvc': {'path': setType('server', str)},
Copy link
Contributor

@SparkSnail SparkSnail Feb 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is configPath setting only for pvc storage? When set pvc storage, is it a optional field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it so configPath is optional for NFS and Azure and mandatory for PVC, as it is required there to attach a PVC at all (via custom template)

@@ -364,7 +370,7 @@ def validate(self, data):
frameworkcontroller_trial_schema = {
'trial': {
'codeDir': setPathCheck('codeDir'),
'taskRoles': [{
Optional('taskRoles'): [{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If set taskRoles as optional field, better to add validation, refer https://github.com/microsoft/nni/blob/master/nni/tools/nnictl/config_schema.py#L536

Copy link
Contributor Author

@mbu93 mbu93 Mar 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally missed that validation part of the code! Super cool. I added a (even a little more extended) validation for the frameworkcontroller job config.

minSucceededTaskCount: -1
}
const trialConfig = <FrameworkControllerTrialConfigTemplate>{
name: x.name,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will x.name be undefined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the new validation mechanism, it is ensured, that the name field will never be empty. Is this sufficient or should
I handle the undefined case at that point anyways?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is validated in other place, it's ok here.

configTaskRoles = this.parseCustomTaskRoles(this.fcTemplate.spec.taskRoles)
}
const namespace = this.fcClusterConfig.namespace ? this.fcClusterConfig.namespace : "default";
this.genericK8sClient.setNamespace = namespace
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

miss ;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@@ -156,7 +253,7 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple

// Validate to make sure codeDir doesn't have too many files
try {
await validateCodeDir(this.fcTrialConfig.codeDir);
await validateCodeDir(this.fcTrialConfig ? this.fcTrialConfig.codeDir : './');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this.fcTrialConfig.codeDir is undefined, what does ./ contains?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this again. I actually didn't realize, the path is modified within the python logic and felt like setting it to the "current" folder per default (which wouldn't turn out to work as I realized).

@@ -202,6 +301,10 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple
const fcClusterConfigNFS: FrameworkControllerClusterConfigNFS = <FrameworkControllerClusterConfigNFS>this.fcClusterConfig;
const nfsConfig: NFSConfig = fcClusterConfigNFS.nfs;
return `nfs://${nfsConfig.server}:${destDirectory}`;
} else if (this.fcClusterConfig.storage === 'pvc') {
await cpp.exec(`mkdir -p ${this.trialLocalNFSTempFolder}/${destDirectory}`);
Copy link
Contributor

@SparkSnail SparkSnail Feb 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trialLocalNFSTempFolder is used for NFS storage, rename trialLocalNFSTempFolder to trialLocalPVCTempFolder here, or unify them to trialLocalTempFolder.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I unified it :)

@SparkSnail
Copy link
Contributor

please add doc for the configuration change.

@SparkSnail SparkSnail closed this Feb 22, 2021
@SparkSnail SparkSnail reopened this Feb 22, 2021
emptyDir: {}
- name: data-volume
persistentVolumeClaim:
claimName: nni-storage
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does users need to create PVC manually before create NNI experiments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I added some documentation for the new custom capability and explicitly mentioned this prerequisite. I'm aware of the
fact, that the k8s API offers and endpoint to create a PVC algorithmically, but this can easily mess up a cluster's storage management, so this seems like a bad idea. If a person is willing to use a PVC, then I guess it should be somewhat enforced this person has sufficient rights to do so and is aware what's about to happen, which makes me think manual creation beforehand is a good idea. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree to let users manually create PVC, as long as we declare it in doc and examples.

@mbu93
Copy link
Contributor Author

mbu93 commented Feb 23, 2021

Hey @SparkSnail,
Thanks for the review! I'll try to take care of the required changes asap! :)

@J-shang
Copy link
Contributor

J-shang commented Feb 25, 2021

Hi @mbu93 Thanks for your contribution, we are about to code freeze, if it's not urgent, this pr will defer to the next version.

@mbu93
Copy link
Contributor Author

mbu93 commented Mar 4, 2021

Hey @SparkSnail, I was busy with a deadline and didn't find the time to take care of the PR. Next Version is fine, it's not urgent :)

- a check for empty taskRoles
- a check for empty or duplicate taskRole names
- configPath validation for frameworkcontroller jobs (necessity / availability)
- an optional configPath field for NFS and Azure configs
- custom templates for NFS and azure
- revoked path defaults ("./") as this is already handled in config
parsing
- unified temp paths
@mbu93
Copy link
Contributor Author

mbu93 commented Mar 12, 2021

Note: I just realized there is a little bug pending, that will cause the 2nd+ trial to use the wrong command, leading to a failed reporting of the results. I must have overseen that when testing. I'll take care of it asap.

@mbu93
Copy link
Contributor Author

mbu93 commented Mar 19, 2021

@SparkSnail @J-shang I applied the changes as proposed in the review and solved the remaining issue. Currently, however, python unit tests fail due to Yann LeCun's page being unreachable when downloading the MNIST dataset for some test scenarios :/ Speaking about tests: Is there any work to be done in this PR?

@SparkSnail SparkSnail closed this Mar 23, 2021
@SparkSnail SparkSnail reopened this Mar 23, 2021
@SparkSnail
Copy link
Contributor

@SparkSnail @J-shang I applied the changes as proposed in the review and solved the remaining issue. Currently, however, python unit tests fail due to Yann LeCun's page being unreachable when downloading the MNIST dataset for some test scenarios :/ Speaking about tests: Is there any work to be done in this PR?

Looks good to me.

@SparkSnail SparkSnail merged commit fbffbc7 into microsoft:master Apr 1, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants