[KED-1081] Make the folder /data/ as abstract data folder #105

arita37 · 2019-09-27T07:51:41Z

Description

Usually, we have to deal with very large datasets (ie >100 Go) and text type,
storing in the folder /data/ is not possible.

tsanikgr · 2019-09-27T09:06:52Z

Hi @arita37

Absolute paths are already supported by all filepath based AbstractDataSets.

Please have a look here to see how to define such datasets in tha DataCatalog: make sure to replace filepath from a relative path (data/...) to an absolute (/data).

Working with big datasets typically involves working with cloud solutions. For that case, kedro also supports many cloud based datasets (see here and here)

arita37 · 2019-09-28T13:37:23Z

Hello,
Thanks for the reply and details, this is useful.

This is more a generic design perspective (not just the path),

Let me precise my point :

 Separation of code source (git versionable) vs data (not git versionnable).
 Having the dataset mixed with code, is not always good  practice for software dev.
 Thats why, I propose to consider data/    as a folder of "abstract dataset" (ie represented by .yml)

Flid · 2019-10-04T09:35:00Z

Hello @arita37

Thank you for feedback! Let me address the comments.

We don't encourage anyone to store the data files in git. data directory is in .gitignore file, that's why by default the files are just ignored.
You can already split your catalog into multiple files: https://kedro.readthedocs.io/en/latest/04_user_guide/03_configuration.html#loading
What's the purpose of the suggested CLI folder? Currently we assume that all the additional CLI commands should be put into kedro_cli.py file, which is a file of a created projects and is easily editable. All the commands added there become available via kedro COMMAND ....
We'll check it out later, thank you.

But non of the points above seem to relate to the original question: abstract /data/ directory.

If you have massive files and complex ways of accessing them, you have a few options:

Kedro DataSets, which can have any complex logic behind. We have a lot of the standard and contrib data sets available, you can easily implement your own ones, the interface is simple. Any contribution back to Kedro is welcome!
If you want to make it really transparent for the code, like reading from a file, which does all the magic inside, you can use FUSE. Basically write a script (even in Python) to handle file system events on the directory.
All sorts of SSHFS/GFS/FTPFS - virtual file systems, not storing the data locally, but surving as a transparent proxy. Usually based on FUSE.

Does it help somehow? It would be nice to know more about your use-cases and requirements.

arita37 · 2019-10-04T09:57:18Z

Hello, Thanks for replying Question We are evaluating vs MLFlow which a pipeline for ML. Thats, why having the data folder (doc example ) with the code was strange from prod view. 1) this is not clear when packaged in Docker which folder are kept and which one removed ? 2) model storage and serialization is not clear. Same for pre-processing with states (ie clustering, ) How this is handle ? Or do we have to do it manually ? In theory, you should have created a folder for model storage (ie same style than data).

yetudada · 2019-10-04T10:22:42Z

Hi @arita37, I'll leave @Flid to handle the bulk of this conversation but I'll just make a few comments about MLflow.

They solve orthogonal problems and can be used together, Kedro focuses on development experience, code organisation and data abstraction and Mlflow provides tracking and better support for versioning
We have a Medium post going out soon about how our teams use both together leveraging MLflow's tracking ability
However, I definitely recommend checking out Use mlflow for better versioning and collaboration #113 to see how this team has used Kedro & Mlflow together

Flid · 2019-10-04T10:30:37Z

https://github.com/quantumblacklabs/kedro-docker/blob/53eb98201048fd4e2eed74bfb1738ab97ac5ad7a/kedro_docker/template/.dockerignore#L13 - data is not copied into a Docker image.
data/06_models is supposed to store the models. But we don't enforce it, just recommend. How you serialise the model is completely up to you, Kedro has a few useful data sets like PickleLocalDataSet that might be useful for general case scenarios.
Not sure I understand the question. If you are interested in the concept of a node itself - (here's the description what it is and how it works)[https://kedro.readthedocs.io/en/latest/04_user_guide/05_nodes.html]. If by "data split" you mean splitting into training and test parts - it's not something Kedro is responsible for. Kedro helps you organise the pipeline code, give it a structure with added benefits. The actual logic is out of scope, other frameworks (including MLFlow you've mentioned) should be used.

arita37 · 2019-10-04T12:33:35Z

Thanks for your prompt reply. 1) noted, Although from software Engi. Perspective data/ should be abstract (ie virtual)... Suppose the code references some csv into data/ folder : so it runs fine on local.... But, when it is transfered to docker , it failed because data/ is not transfered. Framework should try to enforce better “best practice”.... (at least showning the docs).

Flid · 2019-10-04T14:23:22Z

Well, agree it doesn't play nicely in Docker with all cases. Still you can run the container with an additional volume, mounting your host stuff under data directory inside.
I'm not sure what you mean by "virtual", none of the definition I can find seems relevant. But anyway you have a full power of *nix: soft and hard links, FUSE, Docker volumes, virtual file systems. Kedro doesn't try to cover everything. It's actually a Unix way to do a tool, which does one small thing, but does it great and integrates with other tools easily, isn't it? :)
Versioning is also supported by Kedro! It's not perfect yet, we work on it, but I think it can already do what you need.
Re-reading the initial question with this information - Kedro is not a platform to organise multiple machines into a cluster, it's a framework to organise the code into pipelines, and then you can run the pipeline nodes in many ways. For example, check out kedro-airflow. Same can be done for other platforms, because Kedro pipeline is essentially a python script in the end, you can run nodes in isolation even on different machines, just make sure you pass the data in between and connect it all together, just what we do for Airflow.
⬆️
I didn't, sorry, can't help.
Thank you :)
Again, there are lots of possible pipeline structures for multiple different use-cases. Kedro is a general purpose framework for creating pipelines. It doesn't enforce the internal node structure, doesn't provide data science tools or anything to use from inside the nodes. Kedro is a logical glue between the nodes. That's why it can usually be easily used with any DS frameworks, anything you can call from Python.

arita37 · 2019-10-05T05:55:21Z

What is missing for more production :

Model folder (as abstract folder) and model lifecycle (versionning, tag).
Data and model are completely different concept (its like you are saying code is like data...)
Clear separation of Train and Inference by framework :
Train: dataset --> Process --> Train --> Model Storage + Statistics
Inference : Model Load, Dataset --> inference --> Results + Statistics

Flid · 2019-10-07T08:48:23Z

I still think it's a big deal, as you are not enforced to use the provided structure and keep data in one place, you can use any virtual FS or soft/hard-links. It seems to be a sensible setup for most simple use-cases, not over-complicating things for everyone.
Again, Kedro does not tell you how to process your data and train your models. It's completely up to you how you split your pipeline into data cleaning, training, inference, validation, whatever. With the feature we are about to release in a few days, you'll even be able to make them separate pipelines and run individually.
Do you mean the lengthy model training produces some results periodically, and you wish to see the intermediate results? Anyway, it's a particular framework producing this, it seems to be out of scope for Kedro, but we need to think how to connect them nicely.
Indeed it's a good idea to do that. And even better, it's already supported 😄 Check out kedro jupyter convert command. It creates the nodes with the tagged code. Doesn't integrate these nodes into pipelines of course, but still.

arita37 · 2019-10-08T01:13:33Z

The view was : as AbstractDataset and connector was developed,
why not developed AbstractModel to manage the model lifecycle.
AbstractModel is common to all kind of Machine Learning cycle (esp. handling of model drifting).
There are automatic ML tools which already normalize the code.
As the end goal seems to to convert Jupyter to runnable code in Docker.
Why not adding more pragma tags to allows better conversion :
#PIPELINE: mypipeline_name
<python_code>

yetudada · 2019-10-29T03:59:41Z

Hi @arita37, I want to see if I can create some actionable items from this issue and then close it with the appropriate tasks.

So I have a query about your 3rd point:
You've spoken about an AbstractModel, what would you like this class to do for you? And which frameworks would be best to work with this?

And to your 4th point:
We support a workflow that allows users to use Jupyter Notebooks for what they're good for, exploratory data analysis and initial pipeline development but we do encourage that users move from Jupyter Notebooks to Python script with node tagged cells and the kedro jupyter convert command. When you're referring to better conversion, what do you mean?

yetudada · 2019-12-10T21:18:03Z

@arita37 It would be great to get more input from you when you have time. For now, I'll close this issue but I'll be happy to re-open it when you're ready.

arita37 added the Issue: Feature Request New feature or improvement to existing feature label Sep 27, 2019

Flid changed the title ~~Make the folder /data/ as abstract data folder~~ [KED-1081] Make the folder /data/ as abstract data folder Oct 4, 2019

khdlim mentioned this issue Oct 8, 2019

conditional evaluation of nodes / sub-pipelines #104

Closed

yetudada closed this as completed Dec 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KED-1081] Make the folder /data/ as abstract data folder #105

[KED-1081] Make the folder /data/ as abstract data folder #105

arita37 commented Sep 27, 2019 •

edited

Loading

tsanikgr commented Sep 27, 2019 •

edited

Loading

arita37 commented Sep 28, 2019 •

edited

Loading

Flid commented Oct 4, 2019

arita37 commented Oct 4, 2019 via email •

edited

Loading

yetudada commented Oct 4, 2019

Flid commented Oct 4, 2019

arita37 commented Oct 4, 2019 via email •

edited

Loading

Flid commented Oct 4, 2019 •

edited

Loading

arita37 commented Oct 5, 2019 •

edited

Loading

Flid commented Oct 7, 2019

arita37 commented Oct 8, 2019 •

edited

Loading

yetudada commented Oct 29, 2019

yetudada commented Dec 10, 2019

[KED-1081] Make the folder /data/ as abstract data folder #105

[KED-1081] Make the folder /data/ as abstract data folder #105

Comments

arita37 commented Sep 27, 2019 • edited Loading

Description

tsanikgr commented Sep 27, 2019 • edited Loading

arita37 commented Sep 28, 2019 • edited Loading

Flid commented Oct 4, 2019

arita37 commented Oct 4, 2019 via email • edited Loading

yetudada commented Oct 4, 2019

Flid commented Oct 4, 2019

arita37 commented Oct 4, 2019 via email • edited Loading

Flid commented Oct 4, 2019 • edited Loading

arita37 commented Oct 5, 2019 • edited Loading

Flid commented Oct 7, 2019

arita37 commented Oct 8, 2019 • edited Loading

yetudada commented Oct 29, 2019

yetudada commented Dec 10, 2019

arita37 commented Sep 27, 2019 •

edited

Loading

tsanikgr commented Sep 27, 2019 •

edited

Loading

arita37 commented Sep 28, 2019 •

edited

Loading

arita37 commented Oct 4, 2019 via email •

edited

Loading

arita37 commented Oct 4, 2019 via email •

edited

Loading

Flid commented Oct 4, 2019 •

edited

Loading

arita37 commented Oct 5, 2019 •

edited

Loading

arita37 commented Oct 8, 2019 •

edited

Loading