Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add force_reuse_spark_context to SparkDFExecutionEngine #3330

Closed
fep2 opened this issue Aug 31, 2021 · 3 comments
Closed

Add force_reuse_spark_context to SparkDFExecutionEngine #3330

fep2 opened this issue Aug 31, 2021 · 3 comments
Labels
community devrel This item is being addressed by the Developer Relations Team

Comments

@fep2
Copy link
Contributor

fep2 commented Aug 31, 2021

Describe the bug
Currently we are not able to re use a spark session when using a runtime data source

To Reproduce
I've looked into this fix in my current setup, unfortunately if I changed my config from Datasource to SparkDFDatasource I get the following issue return datasource.get_batch_list_from_batch_request( AttributeError: 'SparkDFDatasource' object has no attribute 'get_batch_list_from_batch_request'

Next if I change it back to Datasource with the fix I get the following error datasource: Datasource = cast(Datasource, self.datasources[datasource_name]) KeyError: 'my_spark_datasource' this is do to the fact that the Datasource class when instantiated doesn't know what to do with the force_reuse_spark_context flag and the error get's hidden (this needs to be fix) and the my_spark_datasource is never instantiated causing it to throw this KeyError exception

Here a reference of what my data_source config looks like
{ "my_spark_datasource": { "class_name": "Datasource", "force_reuse_spark_context": True, "execution_engine": { "class_name": "SparkDFExecutionEngine" }, "data_connectors": { "my_runtime_data_connector": { "module_name": "great_expectations.datasource.data_connector", "class_name": "RuntimeDataConnector", "batch_identifiers": [ "some_key" ] } } } }

In this case I want a runtime batch following the directions laid out here -> https://discuss.greatexpectations.io/t/how-to-validate-spark-dataframes-in-0-13/582

I think the solution is to not only pass force_reuse_spark_context to the SparkDFDatasource but also pass it to SparkDFExecutionEngine I was able to get a working solution by adding the following to ExecutionEngineSchema

class ExecutionEngineConfigSchema(Schema):
    class Meta:
        unknown = INCLUDE

    class_name = fields.String(required=True)
    module_name = fields.String(missing="great_expectations.execution_engine")
    connection_string = fields.String(required=False, allow_none=True)
    credentials = fields.Raw(required=False, allow_none=True)
    spark_config = fields.Raw(required=False, allow_none=True)
    boto3_options = fields.Dict(
        keys=fields.Str(), values=fields.Str(), required=False, allow_none=True
    )
    caching = fields.Boolean(required=False, allow_none=True)
    batch_spec_defaults = fields.Dict(required=False, allow_none=True)
    force_reuse_spark_context = fields.Bool(required=False, missing=False)

This is what my data_source config looks like

{
        "my_spark_datasource": {
            "class_name": "Datasource",
            "execution_engine": {
                "class_name": "SparkDFExecutionEngine",
                "force_reuse_spark_context": True,
            },
            "data_connectors": {
                "my_runtime_data_connector": {
                    "module_name": "great_expectations.datasource.data_connector",
                    "class_name": "RuntimeDataConnector",
                    "batch_identifiers": [
                        "some_key"
                    ]
                }
            }
        }
    }

Expected behavior
Just the ability to use SparkEngine with spark session re use flag

Environment (please complete the following information):

  • Operating System: Linux
  • Great Expectations Version: 0.13.23

Additional context
I pretty much laid out what needs to be fixed, unfortunately I have prior commitments and can't do the work myself.
I also think the solution proposed here will help as well.
#3126

@talagluck talagluck added the devrel This item is being addressed by the Developer Relations Team label Sep 1, 2021
@talagluck
Copy link
Contributor

Hi @fep2 - thanks for raising this issue! We'll take a look and figure out a timeline for when we can get this done, and be in touch.

@fep2
Copy link
Contributor Author

fep2 commented Oct 14, 2021

#3541 this PR address the issue

@talagluck
Copy link
Contributor

This is resolved by #3541 - thanks for raising and solving this issue, @fep2!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community devrel This item is being addressed by the Developer Relations Team
Projects
None yet
Development

No branches or pull requests

3 participants