Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify new pyspark add-ons flow supports Alloy projects #3220

Closed
merelcht opened this issue Oct 24, 2023 · 8 comments
Closed

Verify new pyspark add-ons flow supports Alloy projects #3220

merelcht opened this issue Oct 24, 2023 · 8 comments
Assignees

Comments

@merelcht
Copy link
Member

Description

Projects created with Alloy rely on the pyspark starter. This starter is being archived in favour of the new add-ons flow. When #3073 is completed we need to double check that projects using Alloy can still be created as expected without the pyspark starter.

@merelcht
Copy link
Member Author

merelcht commented Oct 31, 2023

I tested this using: https://github.com/McK-Internal/InsureX-insurex/blob/a98305850c94996eed91ab715de7ab98b987908e/src/apps/home_fraud.yml and the instructions here https://brix.quantumblack.com/products/alloy/docs/03_tutorial/02_running_app/#building-an-app.

I updated home_fraud.yml and tested with local setup to verify the add ons flow:

home_fraud:
  metadata:
    name: home_fraud
    description: Fraud detection for household claims.
    version:
      attr: src.home_fraud.__version__
    readme:
      file:
        - README.md
  template:
    directory: /Users/merel_theisen/Projects/kedro-starters/spaceflights-pyspark
    root: "{{ cookiecutter.repo_name }}"
    requirements: "{{ cookiecutter.repo_name }}/requirements.txt"
    cookiecutter_json:
      _copy_without_render:
        - src/insurex/hooks/studio_hooks.py
      project_name: home_fraud
      repo_name: home_fraud
      python_package: home_fraud
      kedro_version: 0.18.15
      add_ons: "['Testing', 'Custom Logging', 'Data Structure', 'Pyspark']"
  docs:
    type: sphinx
  build:
    app_class: kedro
    packages:
      - insurex
      - insurex-datasets
      - customerone-lib-ab-testing
      - feature-helpers

Differences found in old and new flow:

  • Empty pipelines folder exists, this should be removed. So update: https://github.com/kedro-org/kedro/blob/develop/kedro/templates/project/hooks/utils.py#L122-L125 -> This empty pipelines folder exists in the base template as well.
  • test_run.py exists in the old flow but is removed in the new one.
    • Double check if this does show when testing is an add-on. -> Yes it works
    • tests in the old flow are under src.
  • The rendered template is different. In the new flow this does contain the pipelines and tests as well. Verify if this is an issue.
  • Placeholder credentials.yml files are missing from the new starters and should be added: Add credentials.yml placeholder files to all spaceflights projects kedro-starters#174
  • Quotes missing from add_ons in pyproject.toml. Not quite missing quotes, but it's unclear how cookiecutter input is parsed when using the app .yml in alloy. -> Fixed, syntax can be like:
    • "Data Structure, Pyspark"
    • "[Data Structure, Pyspark, Testing, Custom Logging]"
    • "['Testing', 'Custom Logging', 'Data Structure', 'Pyspark']"

Notes for alloy team/verticals:

  • The new flow will use the new pyproject.toml and requirements.txt structure
  • logging.yml has been moved outside of conf/base to conf/
  • tests have been moved outside of src
  • With the new flow you'll get an empty pipelines directory, which is inline with our base template. The pyspark starter was actually inconsistent in this aspect.
  • To get the same result they must select add-ons: Data Structure, Pyspark, Testing, Custom Logging, Documentation

@marc-solomon
Copy link

thanks for testing this Merel! 🙏

QQ:

To get the same result they must select add-ons: Data Structure, Pyspark, Testing, Custom Logging, Documentation

does same result here mean: the same rendered project structure users would get preadd-ons workflow, when using the pyspark starter?

@merelcht
Copy link
Member Author

merelcht commented Nov 7, 2023

To get the same result they must select add-ons: Data Structure, Pyspark, Testing, Custom Logging, Documentation

does same result here mean: the same rendered project structure users would get preadd-ons workflow, when using the pyspark starter?

Yes, but with the deliberate changes noted in my comment:

  • The new flow will use the new pyproject.toml and requirements.txt structure
  • logging.yml has been moved outside of conf/base to conf/
  • tests have been moved outside of src
  • With the new flow you'll get an empty pipelines directory, which is inline with our base template. The pyspark starter was actually inconsistent in this aspect.

I also noticed that running alloy app build ... will generate a rendered and template directory. The rendered directory works fully as expected with the add-ons workflow, but the template is essentially just a copy from our full template, so that does include the example pipelines as well. Is that an issue? @marc-solomon

@nmorcotiloqb
Copy link

Hi Merel! 👋

Thank you for the detailed description of new changes and steps.

I've tested new starter for our default demo project which previously used kedro-starters/pyspark.

Using following template:

template:  # Clone or reference a cookiecutter template
    repo: https://github.com/kedro-org/kedro-starters
    directory: kedro-starters/spaceflights-pyspark
    checkout: main
    cookiecutter_json: # Provide placeholders to cookiecutter.json
      ...
      add_ons: "['Testing', 'Custom Logging', 'Data Structure', 'Pyspark']"

and having kedro@develop installed locally, demo project is built successfully.

I'm having some small issues with kedro run. our spaceflights package comes with some data ( data/01_raw/companies.csv) which is stored under data/.

After we render the template, I observed that kedro post_gen_project.py is calling setup_template_add_ons, which under the hood execute following logical block:

if "Pyspark" in selected_add_ons_list:
        _handle_starter_setup(selected_add_ons_list, python_package_name)

and the all data is removing from data/01_raw. https://github.com/kedro-org/kedro/blob/develop/kedro/templates/project/hooks/utils.py#L116
so the kedro run fail.

Is it expected? How we can overcome this on alloy side and disable this overwrite? The data is present in template and only disappear when we render the template using cookiecutter.

Another small issue is that for template, we already generate requirements.txt with all packages requirements:

	# via spaceflightstools
boltons
	# via spaceflights

When template is rendered, sort_requirements(requirements_file_path) is executed from post_gen_project.py which generate following requirements.txt.

# via spaceflights
# via spaceflightsde
# ...
# via spaceflightstools
# via spaceflightstools
PyYAML>=6.0.1
boltons

It is possible to disable sorting from alloy side?

Thank you very much!
Let me know if you have any questions.

@merelcht
Copy link
Member Author

Thanks for testing @nmorcotiloqb! To answer your questions:

I'm having some small issues with kedro run. our spaceflights package comes with some data ( data/01_raw/companies.csv) which is stored under data/.

Is it expected? How we can overcome this on alloy side and disable this overwrite? The data is present in template and only disappear when we render the template using cookiecutter.

We're currently implementing the last step of the new project creation flow which allows you to include an example. With that enabled you will get the spaceflights data as well as the example pipelines. When @datajoely reached out he said you were using the pyspark starter which doesn't include any data, so I only looked at replicating that scenario.

Another small issue is that for template, we already generate requirements.txt with all packages requirements:

	# via spaceflightstools
boltons
	# via spaceflights

When template is rendered, sort_requirements(requirements_file_path) is executed from post_gen_project.py
It is possible to disable sorting from alloy side?

At the moment we don't have any settings to enable/disable the sorting of requirements. We could just not sort them if this causes big problems for the verticals. cc: @astrojuanlu @marc-solomon

@nmorcotiloqb
Copy link

Thanks for the detailed answer @merelcht !

you were using [the pyspark starter](https://github.com/kedro-org/kedro-starters/tree/main/pyspark) which doesn't include any data, so I only looked at replicating that scenario.

Yes, that's true, we are using pyspark starter that doesn't come with any data by default.

On the build time, package can have some data which alloy can move using merge and it will be present in the template phase.
Unfortunately, this is done before cookiecutter render so alloy loses the control in this stage.

It would be possible to have additional add-on, similar with Data Structure ( something like : Default Data ) that will tell post-gen script to not remove files under data on rendering phase?
Or maybe to pass some optional flags, into the cookiecutter.json ( example: keep_data ) that will change the behaviour of the post_gen_script, and it could be controlled by alloy and at the same time it will not be exposed for the simple user.

What do you think?
Thank you!

@merelcht
Copy link
Member Author

I think that would be a too specific alloy add-on, where this flow is meant to be generic for any Kedro user. We also don't want to bloat the number of add-ons, because that could make the Kedro project creation too tedious for users. IMO this goes beyond the scope of this immediate ticket, but it would be helpful to have a meeting to discuss the product vision for this piece on the Kedro side with the Alloy team, so we're all more aligned on how this workflow is supposed to work and be used.

@merelcht
Copy link
Member Author

Closing this specific issue and will move forward by scheduling a meeting to align product visions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

3 participants