-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[WIP] Enable optional Pod Spec for FrameworkController platform #3379
Conversation
nni/tools/nnictl/config_schema.py
Outdated
}, {Optional('configPath'): setType('configPath', str), | ||
Optional('storage'): setChoice('storage', 'nfs', 'azureStorage', 'pvc'), | ||
Optional('serviceAccountName'): setType('serviceAccountName', str), | ||
Optional('configPath'): setType('configPath', str), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicated configPath
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted it.
Optional('storage'): setChoice('storage', 'nfs', 'azureStorage', 'pvc'), | ||
Optional('serviceAccountName'): setType('serviceAccountName', str), | ||
Optional('configPath'): setType('configPath', str), | ||
'pvc': {'path': setType('server', str)}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is configPath
setting only for pvc
storage? When set pvc
storage, is it a optional field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it so configPath is optional for NFS and Azure and mandatory for PVC, as it is required there to attach a PVC at all (via custom template)
@@ -364,7 +370,7 @@ def validate(self, data): | |||
frameworkcontroller_trial_schema = { | |||
'trial': { | |||
'codeDir': setPathCheck('codeDir'), | |||
'taskRoles': [{ | |||
Optional('taskRoles'): [{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If set taskRoles
as optional field, better to add validation, refer https://github.com/microsoft/nni/blob/master/nni/tools/nnictl/config_schema.py#L536
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I totally missed that validation part of the code! Super cool. I added a (even a little more extended) validation for the frameworkcontroller job config.
minSucceededTaskCount: -1 | ||
} | ||
const trialConfig = <FrameworkControllerTrialConfigTemplate>{ | ||
name: x.name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will x.name be undefined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the new validation mechanism, it is ensured, that the name field will never be empty. Is this sufficient or should
I handle the undefined case at that point anyways?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is validated in other place, it's ok here.
configTaskRoles = this.parseCustomTaskRoles(this.fcTemplate.spec.taskRoles) | ||
} | ||
const namespace = this.fcClusterConfig.namespace ? this.fcClusterConfig.namespace : "default"; | ||
this.genericK8sClient.setNamespace = namespace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
miss ;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
@@ -156,7 +253,7 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple | |||
|
|||
// Validate to make sure codeDir doesn't have too many files | |||
try { | |||
await validateCodeDir(this.fcTrialConfig.codeDir); | |||
await validateCodeDir(this.fcTrialConfig ? this.fcTrialConfig.codeDir : './'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this.fcTrialConfig.codeDir
is undefined, what does ./
contains?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this again. I actually didn't realize, the path is modified within the python logic and felt like setting it to the "current" folder per default (which wouldn't turn out to work as I realized).
@@ -202,6 +301,10 @@ class FrameworkControllerTrainingService extends KubernetesTrainingService imple | |||
const fcClusterConfigNFS: FrameworkControllerClusterConfigNFS = <FrameworkControllerClusterConfigNFS>this.fcClusterConfig; | |||
const nfsConfig: NFSConfig = fcClusterConfigNFS.nfs; | |||
return `nfs://${nfsConfig.server}:${destDirectory}`; | |||
} else if (this.fcClusterConfig.storage === 'pvc') { | |||
await cpp.exec(`mkdir -p ${this.trialLocalNFSTempFolder}/${destDirectory}`); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trialLocalNFSTempFolder is used for NFS storage, rename trialLocalNFSTempFolder to trialLocalPVCTempFolder here, or unify them to trialLocalTempFolder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I unified it :)
please add doc for the configuration change. |
emptyDir: {} | ||
- name: data-volume | ||
persistentVolumeClaim: | ||
claimName: nni-storage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does users need to create PVC manually before create NNI experiments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I added some documentation for the new custom capability and explicitly mentioned this prerequisite. I'm aware of the
fact, that the k8s API offers and endpoint to create a PVC algorithmically, but this can easily mess up a cluster's storage management, so this seems like a bad idea. If a person is willing to use a PVC, then I guess it should be somewhat enforced this person has sufficient rights to do so and is aware what's about to happen, which makes me think manual creation beforehand is a good idea. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree to let users manually create PVC, as long as we declare it in doc and examples.
Hey @SparkSnail, |
Hi @mbu93 Thanks for your contribution, we are about to code freeze, if it's not urgent, this pr will defer to the next version. |
Hey @SparkSnail, I was busy with a deadline and didn't find the time to take care of the PR. Next Version is fine, it's not urgent :) |
- a check for empty taskRoles - a check for empty or duplicate taskRole names - configPath validation for frameworkcontroller jobs (necessity / availability) - an optional configPath field for NFS and Azure configs - custom templates for NFS and azure - revoked path defaults ("./") as this is already handled in config parsing - unified temp paths
Note: I just realized there is a little bug pending, that will cause the 2nd+ trial to use the wrong command, leading to a failed reporting of the results. I must have overseen that when testing. I'll take care of it asap. |
@SparkSnail @J-shang I applied the changes as proposed in the review and solved the remaining issue. Currently, however, python unit tests fail due to Yann LeCun's page being unreachable when downloading the MNIST dataset for some test scenarios :/ Speaking about tests: Is there any work to be done in this PR? |
Looks good to me. |
As discussed in #3350, this PR introduces:
In addition, the following changes have been applied:
Currently, the following components have not yet been considered (thus the WIP state):