Bootstrap fixes #250

tcare · 2020-04-02T21:25:51Z

Project name is now additionally restricted to letters and underscores
Fix an issue where we try to load a non-existant dataset from sklearn after bootstrapping
Added some general error handling
Standardized arguments to short and long forms & updated README
General code cleanup

eedorenko

Added a comment

eedorenko · 2020-04-02T22:39:54Z

ml_service/pipelines/diabetes_regression_build_train_pipeline.py

+        try:
+            from sklearn.datasets import load_diabetes
+        except ImportError as e:
+            print("Project has already been bootstrapped, you must provide your own data.")  # NOQA: E501


I wouldn't be so strong in this message. We don't know the reason of the error for sure. We just guess. I would go with something like "Failed to load diabetes dataset, perhaps the project has already ..."

The thing is that it will still rename load_diabetes into load_we_dont_call_his_name which introduces a buggy code (that we handle with this try-except). Perhaps it would make sense to move out all this load_diabetes dataset creation to a separate module (imported in this file) and exclude that module/file from "files" in replace_project_name.

Yes, separating dataset creation is the right way to go. A quick hotfix is to call replaceprojectname on the training script to rename the specific import.

I thought ImportError would be enough of a limited scope to avoid any weirdness. However you made me realize that this would actually be the first time that we encounter sklearn in the flow, so this could hide a dependency error. At the very least, we need to check that sklearn exists and the loading function doesn't.

Re: moving it out, if we're going to do the non-hacky fix, I feel that this should be generic enough that you don't need to rely on having a predefined dataset to export and rather can use a csv.

tcare · 2020-04-02T23:47:02Z

I decided to take the middle ground. CSV creation from the diabetes data has been factored out, but when the project bootstraps the CSV loading will fail with a message that they need to provide a CSV. This way, we avoid a situation where we silently use diabetes data (happened to me :)) and still make it easy to bring a CSV to use.

eedorenko · 2020-04-03T01:27:46Z

lgtm

tcare added 5 commits April 2, 2020 12:24

Manual style fix pass

86a0286

Bootstrap script: enforce letters and underscores only

f113ec2

Improve error handling and argument validation

944f764

Avoid sklearn import error after bootstrap script runs

96d4ad1

Update bootstrap README with standardized args

f8f88ce

tcare requested review from dtzar, eedorenko and sudivate April 2, 2020 21:25

Linting fixes

6c9cfa4

tcare force-pushed the tcare/bootstrap-fixes branch from d8a4eee to 6c9cfa4 Compare April 2, 2020 21:32

eedorenko approved these changes Apr 2, 2020

View reviewed changes

Factor out diabetes CSV creation

9fdf97c

eedorenko merged commit 3ed9a90 into master Apr 3, 2020

dtzar deleted the tcare/bootstrap-fixes branch April 9, 2020 21:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bootstrap fixes #250

Bootstrap fixes #250

tcare commented Apr 2, 2020

eedorenko left a comment

eedorenko Apr 2, 2020

sudivate Apr 2, 2020

tcare Apr 2, 2020

tcare commented Apr 2, 2020

eedorenko commented Apr 3, 2020

Bootstrap fixes #250

Bootstrap fixes #250

Conversation

tcare commented Apr 2, 2020

eedorenko left a comment

Choose a reason for hiding this comment

eedorenko Apr 2, 2020

Choose a reason for hiding this comment

sudivate Apr 2, 2020

Choose a reason for hiding this comment

tcare Apr 2, 2020

Choose a reason for hiding this comment

tcare commented Apr 2, 2020

eedorenko commented Apr 3, 2020