Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[KED-1081] Make the folder /data/ as abstract data folder #105

Closed
arita37 opened this issue Sep 27, 2019 · 13 comments
Closed

[KED-1081] Make the folder /data/ as abstract data folder #105

arita37 opened this issue Sep 27, 2019 · 13 comments
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@arita37
Copy link

arita37 commented Sep 27, 2019

Description

Usually, we have to deal with very large datasets (ie >100 Go) and text type,
storing in the folder /data/ is not possible.

@arita37 arita37 added the Issue: Feature Request New feature or improvement to existing feature label Sep 27, 2019
@tsanikgr
Copy link
Contributor

tsanikgr commented Sep 27, 2019

Hi @arita37

Absolute paths are already supported by all filepath based AbstractDataSets.

Please have a look here to see how to define such datasets in tha DataCatalog: make sure to replace filepath from a relative path (data/...) to an absolute (/data).

Working with big datasets typically involves working with cloud solutions. For that case, kedro also supports many cloud based datasets (see here and here)

@arita37
Copy link
Author

arita37 commented Sep 28, 2019

Hello,
Thanks for the reply and details, this is useful.

This is more a generic design perspective (not just the path),

Let me precise my point :

  1.  Separation of code source (git versionable) vs data (not git versionnable).
     Having the dataset mixed with code, is not always good  practice for software dev.
     Thats why, I propose to consider data/    as a folder of "abstract dataset" (ie represented by .yml)
    

@Flid
Copy link
Contributor

Flid commented Oct 4, 2019

Hello @arita37

Thank you for feedback! Let me address the comments.

  1. We don't encourage anyone to store the data files in git. data directory is in .gitignore file, that's why by default the files are just ignored.
  2. You can already split your catalog into multiple files: https://kedro.readthedocs.io/en/latest/04_user_guide/03_configuration.html#loading
  3. What's the purpose of the suggested CLI folder? Currently we assume that all the additional CLI commands should be put into kedro_cli.py file, which is a file of a created projects and is easily editable. All the commands added there become available via kedro COMMAND ....
  4. We'll check it out later, thank you.

But non of the points above seem to relate to the original question: abstract /data/ directory.

If you have massive files and complex ways of accessing them, you have a few options:

  1. Kedro DataSets, which can have any complex logic behind. We have a lot of the standard and contrib data sets available, you can easily implement your own ones, the interface is simple. Any contribution back to Kedro is welcome!
  2. If you want to make it really transparent for the code, like reading from a file, which does all the magic inside, you can use FUSE. Basically write a script (even in Python) to handle file system events on the directory.
  3. All sorts of SSHFS/GFS/FTPFS - virtual file systems, not storing the data locally, but surving as a transparent proxy. Usually based on FUSE.

Does it help somehow? It would be nice to know more about your use-cases and requirements.

@Flid Flid changed the title Make the folder /data/ as abstract data folder [KED-1081] Make the folder /data/ as abstract data folder Oct 4, 2019
@arita37
Copy link
Author

arita37 commented Oct 4, 2019 via email

@yetudada
Copy link
Contributor

yetudada commented Oct 4, 2019

Hi @arita37, I'll leave @Flid to handle the bulk of this conversation but I'll just make a few comments about MLflow.

  • They solve orthogonal problems and can be used together, Kedro focuses on development experience, code organisation and data abstraction and Mlflow provides tracking and better support for versioning
  • We have a Medium post going out soon about how our teams use both together leveraging MLflow's tracking ability
  • However, I definitely recommend checking out Use mlflow for better versioning and collaboration #113 to see how this team has used Kedro & Mlflow together

@Flid
Copy link
Contributor

Flid commented Oct 4, 2019

  1. https://github.com/quantumblacklabs/kedro-docker/blob/53eb98201048fd4e2eed74bfb1738ab97ac5ad7a/kedro_docker/template/.dockerignore#L13 - data is not copied into a Docker image.
  2. data/06_models is supposed to store the models. But we don't enforce it, just recommend. How you serialise the model is completely up to you, Kedro has a few useful data sets like PickleLocalDataSet that might be useful for general case scenarios.
  3. Not sure I understand the question. If you are interested in the concept of a node itself - (here's the description what it is and how it works)[https://kedro.readthedocs.io/en/latest/04_user_guide/05_nodes.html]. If by "data split" you mean splitting into training and test parts - it's not something Kedro is responsible for. Kedro helps you organise the pipeline code, give it a structure with added benefits. The actual logic is out of scope, other frameworks (including MLFlow you've mentioned) should be used.

@arita37
Copy link
Author

arita37 commented Oct 4, 2019 via email

@Flid
Copy link
Contributor

Flid commented Oct 4, 2019

  1. Well, agree it doesn't play nicely in Docker with all cases. Still you can run the container with an additional volume, mounting your host stuff under data directory inside.
    I'm not sure what you mean by "virtual", none of the definition I can find seems relevant. But anyway you have a full power of *nix: soft and hard links, FUSE, Docker volumes, virtual file systems. Kedro doesn't try to cover everything. It's actually a Unix way to do a tool, which does one small thing, but does it great and integrates with other tools easily, isn't it? :)
  2. Versioning is also supported by Kedro! It's not perfect yet, we work on it, but I think it can already do what you need.
  3. Re-reading the initial question with this information - Kedro is not a platform to organise multiple machines into a cluster, it's a framework to organise the code into pipelines, and then you can run the pipeline nodes in many ways. For example, check out kedro-airflow. Same can be done for other platforms, because Kedro pipeline is essentially a python script in the end, you can run nodes in isolation even on different machines, just make sure you pass the data in between and connect it all together, just what we do for Airflow.
  4. ⬆️
  5. I didn't, sorry, can't help.
  6. Thank you :)
  7. Again, there are lots of possible pipeline structures for multiple different use-cases. Kedro is a general purpose framework for creating pipelines. It doesn't enforce the internal node structure, doesn't provide data science tools or anything to use from inside the nodes. Kedro is a logical glue between the nodes. That's why it can usually be easily used with any DS frameworks, anything you can call from Python.

@arita37
Copy link
Author

arita37 commented Oct 5, 2019

What is missing for more production :

  1. Model folder (as abstract folder) and model lifecycle (versionning, tag).
    Data and model are completely different concept (its like you are saying code is like data...)

  2. Clear separation of Train and Inference by framework :
    Train: dataset --> Process --> Train --> Model Storage + Statistics
    Inference : Model Load, Dataset --> inference --> Results + Statistics

@Flid
Copy link
Contributor

Flid commented Oct 7, 2019

  1. I still think it's a big deal, as you are not enforced to use the provided structure and keep data in one place, you can use any virtual FS or soft/hard-links. It seems to be a sensible setup for most simple use-cases, not over-complicating things for everyone.
  2. Again, Kedro does not tell you how to process your data and train your models. It's completely up to you how you split your pipeline into data cleaning, training, inference, validation, whatever. With the feature we are about to release in a few days, you'll even be able to make them separate pipelines and run individually.
  3. Do you mean the lengthy model training produces some results periodically, and you wish to see the intermediate results? Anyway, it's a particular framework producing this, it seems to be out of scope for Kedro, but we need to think how to connect them nicely.
  4. Indeed it's a good idea to do that. And even better, it's already supported 😄 Check out kedro jupyter convert command. It creates the nodes with the tagged code. Doesn't integrate these nodes into pipelines of course, but still.

@arita37
Copy link
Author

arita37 commented Oct 8, 2019

  1. The view was : as AbstractDataset and connector was developed,
    why not developed AbstractModel to manage the model lifecycle.
    AbstractModel is common to all kind of Machine Learning cycle (esp. handling of model drifting).

  2. There are automatic ML tools which already normalize the code.
    As the end goal seems to to convert Jupyter to runnable code in Docker.
    Why not adding more pragma tags to allows better conversion :
    #PIPELINE: mypipeline_name
    <python_code>

@yetudada
Copy link
Contributor

Hi @arita37, I want to see if I can create some actionable items from this issue and then close it with the appropriate tasks.

So I have a query about your 3rd point:
You've spoken about an AbstractModel, what would you like this class to do for you? And which frameworks would be best to work with this?

And to your 4th point:
We support a workflow that allows users to use Jupyter Notebooks for what they're good for, exploratory data analysis and initial pipeline development but we do encourage that users move from Jupyter Notebooks to Python script with node tagged cells and the kedro jupyter convert command. When you're referring to better conversion, what do you mean?

@yetudada
Copy link
Contributor

@arita37 It would be great to get more input from you when you have time. For now, I'll close this issue but I'll be happy to re-open it when you're ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
None yet
Development

No branches or pull requests

4 participants