Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MAINTENANCE] Add force_reuse_spark_context to DatasourceConfigSchema #2968

Closed

Conversation

pasmavie
Copy link
Contributor

@pasmavie pasmavie commented Jun 29, 2021

With PR 2733 a parameter was added to SparkDFDatasource, to make GE's context reuse an existing Spark Context.

This was extremely useful.

However, when using a dynamic Data Context configuration (e.g. in EMR) like

data_context_config = DataContextConfig(
    config_version=2,
    plugins_directory=None,
    config_variables_file_path=None,
    datasources={
        "my_data_source": {
            "class_name": "SparkDFDatasource",
            "spark_config": dict(spark.sparkContext.getConf().getAll()),
            "force_reuse_spark_context": True,
            "module_name": "great_expectations.datasource",
            "batch_kwargs_generators": {},
        }
    },
    ...

I've found that the spark_config and force_reuse_spark_context parameters weren't actually passed to geat_expectations.core.util.get_or_create_spark_application().

In fact, the parameters are lost when the DataContextConfig object is dumped into a dictionary, because the DatasourceConfigSchema (elem of the DataContextConfigSchema) doesn't list these parameters.

Changes proposed in this pull request:

  • Add spark_config and force_reuse_spark_context to the DatasourceConfigSchema

Definition of Done

Please delete options that are not relevant.

  • My code follows the Great Expectations style guide
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added unit tests where applicable and made sure that new and existing tests are passing.
  • I have run any local integration tests and made sure that nothing is broken.

@netlify
Copy link

netlify bot commented Jun 29, 2021

✔️ Deploy Preview for knoxpod ready!

🔨 Explore the source changes: 227511e

🔍 Inspect the deploy log: https://app.netlify.com/sites/knoxpod/deploys/60ed73cafe66250007930236

😎 Browse the preview: https://deploy-preview-2968--knoxpod.netlify.app

@gx-cla-bot
Copy link

gx-cla-bot bot commented Jun 29, 2021

A new contributor, HUZZAH! Welcome and thanks for joining our community. In order to accept a pull request we require that all contributors sign our Contributor License Agreement. We have two different CLAs, depending on whether you are contributing to GE in a personal or professional capacity. Please sign the one that is applicable to your situation so that we may accept your contribution:

Individual Contributor License Agreement v1.0
Software Grant and Corporate Contributor License Agreement v1.0

Once you have signed the CLA, you can add a comment with the text @cla-bot check and the bot will update the PR status!

Please reach out to @kyleaton, on our Slack if you have any questions or if you have already signed the CLA and are receiving this message in error.

Users missing a CLA: gio@gio

@pasmavie
Copy link
Contributor Author

@cla-bot check

@talagluck talagluck self-requested a review July 12, 2021 14:39
@talagluck
Copy link
Contributor

Hi @gipaetusb, thank you so much for the contribution - this is awesome! Just a few things:

  • It looks like we have a CLA record for the username gipaetusb, but there were a few commits made as gio or gio@gio and we are missing a CLA for that username. Are you able to fix this? Please let me know if you have any questions.
  • It looks like a number of unit tests are failing as a result, and they will need to be updated, but I would expect the changes will be pretty small.
  • Do you think this PR would benefit from a unit test for the additional functionality?
    Please let me know if you have any questions or need additional help. Thanks again!

gipaetusb and others added 21 commits July 13, 2021 11:07
…n citations (great-expectations#2964)

* refactor(expectationsuite): Add profiler_config to citations
* test(fixture): Add Bobster config to improve tests
* chore(misc): Move global YAML obj into fixture
The team was having issues getting the mssql docker image to initialize
during Azure CI/CD.

Some research into the matter revealed that without explictly stating a
version for the image, the default of 'latest' is applied. Docker was
unable to find this image (I believe it is to be deprecated on the MS
side).

Per Microsoft docs, pulling the image from mcr.microsoft.com and
specifying the 2019-latest version is the recommended approach. All
files that used this outdated image have been updated appropriately.

Microsoft docs: https://docs.microsoft.com/en-us/sql/linux/quickstart-install-connect-docker?view=sql-server-ver15&pivots=cs1-bash
* feat(profiler): Add citations to Profiler.profile()
* docs(profiler): Update citation msg and docstring
* test(alice): Update alice fixture to include citations
* test(bobby): Update bobby fixture to include citations
…ctations#2963)

* Adding in GDOC-127

Moving over files from previous non checked in work

* Updating the PR comments

Updating for PR comments

* Remove an obsolete info note

Remove an obsolete info note

* Addin stubs and sidebar change

Addin stubs and sidebar change
* [DOCS] How to connect to a MySQL database
* fixed broken link in next steps
…#2933)

* Initial docs

* Update documentation for rule based profiler

* Linting

* Updated tests in script runner

* Updated to reflect develop

* Updated per feedback

* Addressed Alex's feedback

* Some updates to structure and text

* Updated core concepts

* New tutorial and core-concepts format for docs

* Incorporated feedback from PRs

* Fix broken link

* More text updates

* Fixed typo

* Fixed integration test
* [FEATURE] NumericMetricRangeMultiBatchParameterBuilder -- First MultiBatch Parameter Builder (great-expectations#2860)
* Bobby use case: Expectation Suite for number of rows in active batch to be similar to that in previous batches.
* Validator will no longer be passed to Rule Based Profiler.
* Fixtures for the entire active table multi-batch use-case.
* Elegant refactoring of metric computations.
* Bootstrapped Estimator and Range Parameter Builder.
* Robustness and Efficiency.
* Update requirements for scipy (need new bootstrap API).
* DocStrings.
* Feature Maturity clause.
* ChangeLog update.
…hout additional requirements* files (great-expectations#2982)


* Fix requirements.

* get documentation integration tests to pass
…xl for pandas >= 1.3.0 (great-expectations#2989)

- [BUGFIX] Modify read_excel() to handle new optional-dependency openpyxl for pandas >= 1.3.0
* Updated diagram image

* Updated profiler docs
…rsion 1.3.0 (great-expectations#2995)

* Update old CLI tests in response to Pandas upgrade to version 1.3.0 on 7/2/2021.
…tations#2996)

* Remove mostly from bobster test fixture and minor cleanup.
petermoyer and others added 18 commits July 13, 2021 11:07
…t-expectations#2972)

* Add in Contributing fields and updates

Solves GDOC-140m GDOC-129, GDOC-123, GDOC-122

* Updating per comments

Fixing some links and style

* Update

Updates

* Fix the root and the glossary link

Fix the root and the glossary link

* Fix typo and heading

Fix typo and heading

* Update contributing-github.md

remove a line

Co-authored-by: James Campbell <james.p.campbell@gmail.com>
)

The documentation page for instantiating a Data Context without a yaml file was missing an import, without which the code snippet wouldn't work. Adding the missing import in this change. 

This is the page concerned: https://docs.greatexpectations.io/en/latest/guides/how_to_guides/configuring_data_contexts/how_to_instantiate_a_data_context_without_a_yml_file.html

Co-authored-by: talagluck <tal@superconductive.com>
Instrument test_yaml_config() and update Anonymizers
…to_contain_set (great-expectations#2997)

* dates for expect_column_distinct_values_to_contain_set

* support date in panda schema

* add date to internal schema

* add missing support for DATE types in SQL backends

* lint util.py

* fix typos
…or read_parquet (great-expectations#2506)

* [BUGFIX] Fix issue where compression key was added to reader_method for read_parquet

Signed-off-by: James Campbell <james.p.campbell@gmail.com>

* Update tests/execution_engine/test_pandas_execution_engine.py

Co-authored-by: Luke Dyer <luke321321@gmail.com>

* Update tests for "None" when param should be inferred

Signed-off-by: James Campbell <james.p.campbell@gmail.com>

* Update test path->bytes

Signed-off-by: James Campbell <james.p.campbell@gmail.com>

Co-authored-by: Sam Bail <sam@superconductive.com>
Co-authored-by: Luke Dyer <luke321321@gmail.com>
* [BUGFIX] fix check for null value

* use-isinstance

* fix

* precommit

* syntax tweak

Co-authored-by: James Campbell <james.p.campbell@gmail.com>

Co-authored-by: James Campbell <james.p.campbell@gmail.com>
* update header to match io

* adjust cloud link

* mobile header update

* adjust community desktop modal width and position

* adjust padding of li items in mobile menu and width

Co-authored-by: Abe Gong <abegong@users.noreply.github.com>
…erBuilder (great-expectations#3001)

* Small fix for batch_filter (the tests already cover it).

* Estimator of mostly using the precision measure.

* Estimator of mostly using the precision measure.

* Estimator of mostly using the precision measure.

* docstring

* Bootstrap estimator for NumericMetricRangeMultiBatchParameterBuilder needs to include both mean and std statistics.

* merge

* Use np.quantile for bootstrap statistic.

* typo

* optimization

* pandas upgrade

* Implement custom bootstrap and test.

* cleanup

* use quantiles and false_positive_rate uniformly

* clean up

* only recent numpy versions allowed for rule_based_profiler tests

* Improve code readability

Signed-off-by: James Campbell <james.p.campbell@gmail.com>

Co-authored-by: James Campbell <james.p.campbell@gmail.com>
* Adding in url links and style

Adding in url links and style

* Fix broken link

Fix a broken link

* Update docs/contributing/contributing-test.md

Co-authored-by: William Shin <will@superconductive.com>

* Update docs/contributing/contributing-setup.md

Co-authored-by: William Shin <will@superconductive.com>

* Update docs/contributing/contributing.md

Co-authored-by: William Shin <will@superconductive.com>

* Update docs/contributing/contributing-checklist.md

Co-authored-by: William Shin <will@superconductive.com>

* Update correct links

Co-authored-by: William Shin <will@superconductive.com>
* validating data without a checkpoint

* corrections for links and standardization of TODOs
… from RTD to Docusaurus (great-expectations#3014)

* docs: Port .rst guide to .md
* fix: Update broken link
…TD to Docusaurus (great-expectations#3011)

* docs: Port over guide from RTD to Docusaurus
@pasmavie
Copy link
Contributor Author

@cla-bot check

@pasmavie
Copy link
Contributor Author

Hi @talagluck, thanka for taking a look:

  • I've amended the commits so that the username is gipaetusb for all. I hope it's good now.
  • Regarding the failing unit tests, I don't have time to take a look now but I can do in the future :)
  • I'll also add a unit test for this feat

@talagluck
Copy link
Contributor

Hi @gipaetusb - thanks for the follow-up! It looks like the CLA verification is still failing, and I still see the user gio when I look at the commit history. Any ideas?

Sounds good about the tests - I'll check in in in a week or so to see how things are going. Thanks again!

@talagluck
Copy link
Contributor

Hi @gipaetusb ! I wanted to check in here. It looks like we are still having CLA issues. I also wanted to check in to see if you had time to fix the failing tests, or if you needed some help?

@pasmavie
Copy link
Contributor Author

Hi @talagluck thank you for the follow up.
I'm pretty busy lately so I didn't try to fix the CLA nor the unit test. Dedicated 30min this morning but amending the old commit and rebasing this branch with all the changes that happened after has become too cumbersome.

Much easier to open a new, cleaner PR right? Here you go: #3126

If someone can help with the unit tests, that would be great :) !

@talagluck
Copy link
Contributor

Thanks for the follow-up, @gipaetusb ! Someone from our team will jump in next week to review and work on the unit tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.