Verify new `pyspark` add-ons flow supports Alloy projects #3220

merelcht · 2023-10-24T14:08:36Z

Description

Projects created with Alloy rely on the pyspark starter. This starter is being archived in favour of the new add-ons flow. When #3073 is completed we need to double check that projects using Alloy can still be created as expected without the pyspark starter.

The text was updated successfully, but these errors were encountered:

merelcht · 2023-10-31T11:44:51Z

I tested this using: https://github.com/McK-Internal/InsureX-insurex/blob/a98305850c94996eed91ab715de7ab98b987908e/src/apps/home_fraud.yml and the instructions here https://brix.quantumblack.com/products/alloy/docs/03_tutorial/02_running_app/#building-an-app.

I updated home_fraud.yml and tested with local setup to verify the add ons flow:

home_fraud:
  metadata:
    name: home_fraud
    description: Fraud detection for household claims.
    version:
      attr: src.home_fraud.__version__
    readme:
      file:
        - README.md
  template:
    directory: /Users/merel_theisen/Projects/kedro-starters/spaceflights-pyspark
    root: "{{ cookiecutter.repo_name }}"
    requirements: "{{ cookiecutter.repo_name }}/requirements.txt"
    cookiecutter_json:
      _copy_without_render:
        - src/insurex/hooks/studio_hooks.py
      project_name: home_fraud
      repo_name: home_fraud
      python_package: home_fraud
      kedro_version: 0.18.15
      add_ons: "['Testing', 'Custom Logging', 'Data Structure', 'Pyspark']"
  docs:
    type: sphinx
  build:
    app_class: kedro
    packages:
      - insurex
      - insurex-datasets
      - customerone-lib-ab-testing
      - feature-helpers

Differences found in old and new flow:

Empty pipelines folder exists, this should be removed. So update: https://github.com/kedro-org/kedro/blob/develop/kedro/templates/project/hooks/utils.py#L122-L125 -> This empty pipelines folder exists in the base template as well.
test_run.py exists in the old flow but is removed in the new one.
- Double check if this does show when testing is an add-on. -> Yes it works
- tests in the old flow are under src.
The rendered template is different. In the new flow this does contain the pipelines and tests as well. Verify if this is an issue.
Placeholder credentials.yml files are missing from the new starters and should be added: Add credentials.yml placeholder files to all spaceflights projects kedro-starters#174
Quotes missing from add_ons in pyproject.toml. Not quite missing quotes, but it's unclear how cookiecutter input is parsed when using the app .yml in alloy. -> Fixed, syntax can be like:
- "Data Structure, Pyspark"
- "[Data Structure, Pyspark, Testing, Custom Logging]"
- "['Testing', 'Custom Logging', 'Data Structure', 'Pyspark']"

Notes for alloy team/verticals:

The new flow will use the new pyproject.toml and requirements.txt structure
logging.yml has been moved outside of conf/base to conf/
tests have been moved outside of src
With the new flow you'll get an empty pipelines directory, which is inline with our base template. The pyspark starter was actually inconsistent in this aspect.
To get the same result they must select add-ons: Data Structure, Pyspark, Testing, Custom Logging, Documentation

marc-solomon · 2023-11-07T11:47:50Z

thanks for testing this Merel! 🙏

QQ:

To get the same result they must select add-ons: Data Structure, Pyspark, Testing, Custom Logging, Documentation

does same result here mean: the same rendered project structure users would get preadd-ons workflow, when using the pyspark starter?

merelcht · 2023-11-07T12:07:24Z

To get the same result they must select add-ons: Data Structure, Pyspark, Testing, Custom Logging, Documentation

does same result here mean: the same rendered project structure users would get preadd-ons workflow, when using the pyspark starter?

Yes, but with the deliberate changes noted in my comment:

The new flow will use the new pyproject.toml and requirements.txt structure
logging.yml has been moved outside of conf/base to conf/
tests have been moved outside of src
With the new flow you'll get an empty pipelines directory, which is inline with our base template. The pyspark starter was actually inconsistent in this aspect.

I also noticed that running alloy app build ... will generate a rendered and template directory. The rendered directory works fully as expected with the add-ons workflow, but the template is essentially just a copy from our full template, so that does include the example pipelines as well. Is that an issue? @marc-solomon

nmorcotiloqb · 2023-11-13T13:41:01Z

Hi Merel! 👋

Thank you for the detailed description of new changes and steps.

I've tested new starter for our default demo project which previously used kedro-starters/pyspark.

Using following template:

template:  # Clone or reference a cookiecutter template
    repo: https://github.com/kedro-org/kedro-starters
    directory: kedro-starters/spaceflights-pyspark
    checkout: main
    cookiecutter_json: # Provide placeholders to cookiecutter.json
      ...
      add_ons: "['Testing', 'Custom Logging', 'Data Structure', 'Pyspark']"

and having kedro@develop installed locally, demo project is built successfully.

I'm having some small issues with kedro run. our spaceflights package comes with some data ( data/01_raw/companies.csv) which is stored under data/.

After we render the template, I observed that kedro post_gen_project.py is calling setup_template_add_ons, which under the hood execute following logical block:

if "Pyspark" in selected_add_ons_list:
        _handle_starter_setup(selected_add_ons_list, python_package_name)

and the all data is removing from data/01_raw. https://github.com/kedro-org/kedro/blob/develop/kedro/templates/project/hooks/utils.py#L116
so the kedro run fail.

Is it expected? How we can overcome this on alloy side and disable this overwrite? The data is present in template and only disappear when we render the template using cookiecutter.

Another small issue is that for template, we already generate requirements.txt with all packages requirements:

	# via spaceflightstools
boltons
	# via spaceflights

When template is rendered, sort_requirements(requirements_file_path) is executed from post_gen_project.py which generate following requirements.txt.

# via spaceflights
# via spaceflightsde
# ...
# via spaceflightstools
# via spaceflightstools
PyYAML>=6.0.1
boltons

It is possible to disable sorting from alloy side?

Thank you very much!
Let me know if you have any questions.

merelcht · 2023-11-13T14:42:20Z

Thanks for testing @nmorcotiloqb! To answer your questions:

I'm having some small issues with kedro run. our spaceflights package comes with some data ( data/01_raw/companies.csv) which is stored under data/.

Is it expected? How we can overcome this on alloy side and disable this overwrite? The data is present in template and only disappear when we render the template using cookiecutter.

We're currently implementing the last step of the new project creation flow which allows you to include an example. With that enabled you will get the spaceflights data as well as the example pipelines. When @datajoely reached out he said you were using the pyspark starter which doesn't include any data, so I only looked at replicating that scenario.

Another small issue is that for template, we already generate requirements.txt with all packages requirements:
	# via spaceflightstools
boltons
	# via spaceflights
When template is rendered, sort_requirements(requirements_file_path) is executed from post_gen_project.py
It is possible to disable sorting from alloy side?

At the moment we don't have any settings to enable/disable the sorting of requirements. We could just not sort them if this causes big problems for the verticals. cc: @astrojuanlu @marc-solomon

nmorcotiloqb · 2023-11-15T07:35:00Z

Thanks for the detailed answer @merelcht !

you were using [the pyspark starter](https://github.com/kedro-org/kedro-starters/tree/main/pyspark) which doesn't include any data, so I only looked at replicating that scenario.

Yes, that's true, we are using pyspark starter that doesn't come with any data by default.

On the build time, package can have some data which alloy can move using merge and it will be present in the template phase.
Unfortunately, this is done before cookiecutter render so alloy loses the control in this stage.

It would be possible to have additional add-on, similar with Data Structure ( something like : Default Data ) that will tell post-gen script to not remove files under data on rendering phase?
Or maybe to pass some optional flags, into the cookiecutter.json ( example: keep_data ) that will change the behaviour of the post_gen_script, and it could be controlled by alloy and at the same time it will not be exposed for the simple user.

What do you think?
Thank you!

merelcht · 2023-11-17T11:33:47Z

I think that would be a too specific alloy add-on, where this flow is meant to be generic for any Kedro user. We also don't want to bloat the number of add-ons, because that could make the Kedro project creation too tedious for users. IMO this goes beyond the scope of this immediate ticket, but it would be helpful to have a meeting to discuss the product vision for this piece on the Kedro side with the Alloy team, so we're all more aligned on how this workflow is supposed to work and be used.

merelcht · 2023-11-20T12:12:43Z

Closing this specific issue and will move forward by scheduling a meeting to align product visions.

merelcht added this to the Project creation wizard i.e. utility modules / project add-ons milestone Oct 24, 2023

merelcht self-assigned this Oct 26, 2023

github-actions bot mentioned this issue Nov 1, 2023

Monthly issue metrics report #3256

Closed

merelcht mentioned this issue Nov 1, 2023

Add the viz add-ons to kedro new #3228

Merged

7 tasks

merelcht closed this as completed Nov 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Verify new `pyspark` add-ons flow supports Alloy projects #3220

Verify new `pyspark` add-ons flow supports Alloy projects #3220

merelcht commented Oct 24, 2023

merelcht commented Oct 31, 2023 •

edited

Loading

marc-solomon commented Nov 7, 2023

merelcht commented Nov 7, 2023

nmorcotiloqb commented Nov 13, 2023

merelcht commented Nov 13, 2023

nmorcotiloqb commented Nov 15, 2023

merelcht commented Nov 17, 2023

merelcht commented Nov 20, 2023

Verify new pyspark add-ons flow supports Alloy projects #3220

Verify new pyspark add-ons flow supports Alloy projects #3220

Comments

merelcht commented Oct 24, 2023

Description

merelcht commented Oct 31, 2023 • edited Loading

marc-solomon commented Nov 7, 2023

merelcht commented Nov 7, 2023

nmorcotiloqb commented Nov 13, 2023

merelcht commented Nov 13, 2023

nmorcotiloqb commented Nov 15, 2023

merelcht commented Nov 17, 2023

merelcht commented Nov 20, 2023

Verify new `pyspark` add-ons flow supports Alloy projects #3220

Verify new `pyspark` add-ons flow supports Alloy projects #3220

merelcht commented Oct 31, 2023 •

edited

Loading