Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(ingestion/glue): Add support for missing config options for profiling in Glue #10858

Conversation

sagar-salvi-apptware
Copy link
Contributor

@sagar-salvi-apptware sagar-salvi-apptware commented Jul 5, 2024

Summary:

  1. Glue Source Profiling Configuration:

    • Configuration parameters include:
      • enabled: Flag to enable or disable profiling (default: false).
      • profile_table_level_only: Flag to enable profiling at the table level only, excluding column-level profiling (default: false).
      • max_workers: Number of worker threads to use for profiling, defaulting to 5 times the CPU count.
  2. Test Cases:

    • Added new test cases for Glue source to verify the new profiling configuration.
    • Ensured that existing Glue source test cases are compatible with the new configuration.
  3. Documentation Update:

    • Updated the updating-datahub.md file to document the breaking change related to the profiling configuration for Glue source under the "Breaking Changes" section.

QA:

  • Validated the change locally to ensure its working as expected

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

Summary by CodeRabbit

Summary by CodeRabbit

  • New Features

    • Introduced profiling support for AWS Glue data ingestion, including options to enable profiling and restrict it to table-level only.
  • Improvements

    • Simplified profiling logic for enhanced readability and efficiency.
  • Tests

    • Added comprehensive test cases to validate profiling functionalities within the AWS Glue data ingestion module.

Copy link
Contributor

coderabbitai bot commented Jul 5, 2024

Walkthrough

The recent changes significantly enhance the AWS Glue source for metadata ingestion by introducing comprehensive support for data profiling. Key updates include refined configuration options for profiling, enabling management at both table and partition levels, and improved handling through streamlined methods. Additionally, new test cases and profiling-related data stubs have been created to validate these enhancements effectively, ensuring robust performance and functionality.

Changes

Files/Modules Change Summary
metadata-ingestion/src/datahub/ingestion/source/aws/glue.py
metadata-ingestion/src/datahub/ingestion/source/glue_profiling_config.py
Updated profiling handling in GlueSourceConfig and introduced new boolean fields enabled and profile_table_level_only in GlueProfilingConfig. Improved methods for conditional profiling.
metadata-ingestion/tests/unit/test_glue_source.py
metadata-ingestion/tests/unit/test_glue_source_stubs.py
Added import for GlueProfilingConfig, new functions and mock definitions for profiling configurations and related data. Enhanced test cases to validate profiling functionality.
metadata-ingestion/tests/unit/glue/glue_mces_golden_profiling.json Introduced a JSON file detailing metadata related to profiling data, including dataset attributes and profiling metrics.

Poem

In fields of data, Glue now spies,
With threading paths and profiling wise.
Tables and partitions, all in line,
Data’s story, clear and fine.
With tests in tow, it’s robust, no fears,
Hop along, in profiling cheers! 🐇✨


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Jul 5, 2024
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between ebe7b2d and 3c389ea.

Files selected for processing (2)
  • metadata-ingestion/src/datahub/ingestion/source/aws/glue.py (7 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/glue_profiling_config.py (2 hunks)
Additional comments not posted (8)
metadata-ingestion/src/datahub/ingestion/source/glue_profiling_config.py (3)

23-26: Add tests for the new max_workers field.

The max_workers field is correctly defined with a default value and description. Ensure that tests are added to cover this new configuration option.

Verification successful

Tests for the max_workers field are present.

The following test files contain references to the max_workers field, confirming that it is covered by tests:

  • metadata-ingestion/tests/integration/trino/test_trino.py
  • metadata-ingestion/tests/integration/snowflake/test_snowflake.py
  • metadata-ingestion/tests/unit/utilities/test_partition_executor.py
  • metadata-ingestion/tests/integration/snowflake/test_snowflake_classification.py
  • metadata-ingestion/tests/unit/utilities/test_backpressure_aware_executor.py
  • metadata-ingestion/tests/integration/bigquery_v2/test_bigquery.py
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that tests for the `max_workers` field are present.
# Test: Search for the usage of the `max_workers` field in test files. Expect: Test cases covering the `max_workers` field.
rg --type python --word-regexp --glob '*test*.py' $'max_workers'

Length of output: 94


Script:

#!/bin/bash
# Description: Verify that tests for the `max_workers` field are present.
# Test: Search for the usage of the `max_workers` field in test files. Expect: Test cases covering the `max_workers` field.
rg --type py --word-regexp --glob '*test*.py' max_workers

Length of output: 1334


15-18: Add tests for the new profile_table_level_only field.

The profile_table_level_only field is correctly defined with a default value and description. Ensure that tests are added to cover this new configuration option.

Verification successful

Tests for the new profile_table_level_only field are present.

The profile_table_level_only field is covered in the following test files:

  • smoke-test/tests/privileges/test_privileges.py
  • metadata-ingestion/tests/unit/test_ge_profiling_config.py
  • metadata-ingestion/tests/integration/snowflake/test_snowflake.py
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that tests for the `profile_table_level_only` field are present.
# Test: Search for the usage of the `profile_table_level_only` field in test files. Expect: Test cases covering the `profile_table_level_only` field.
rg --type python --word-regexp --glob '*test*.py' $'profile_table_level_only'

Length of output: 107


Script:

#!/bin/bash
# Description: Verify that tests for the `profile_table_level_only` field are present.
# Test: Search for the usage of the `profile_table_level_only` field in test files. Expect: Test cases covering the `profile_table_level_only` field.
rg --word-regexp --glob '*test*.py' 'profile_table_level_only'

Length of output: 943


11-14: Add tests for the new enabled field.

The enabled field is correctly defined with a default value and description. Ensure that tests are added to cover this new configuration option.

Verification successful

Tests for the new enabled field are present.

The enabled field is covered in multiple test files, ensuring that this new configuration option is adequately tested.

  • smoke-test/tests/test_stateful_ingestion.py
  • smoke-test/test_e2e.py
  • smoke-test/tests/managed_ingestion/managed_ingestion_test.py
  • smoke-test/tests/privileges/test_privileges.py
  • metadata-ingestion/tests/integration/superset/test_superset.py
  • metadata-ingestion/tests/integration/snowflake/test_snowflake_stateful.py
  • metadata-ingestion/tests/integration/unity/test_unity_catalog_ingest.py
  • metadata-ingestion/tests/integration/salesforce/test_salesforce.py
  • metadata-ingestion/tests/integration/tableau/test_tableau_ingest.py
  • metadata-ingestion/tests/integration/s3/test_s3.py
  • metadata-ingestion/tests/integration/trino/test_trino.py
  • metadata-ingestion/tests/integration/okta/test_okta.py
  • metadata-ingestion/tests/integration/snowflake/test_snowflake.py
  • metadata-ingestion/tests/integration/snowflake/test_snowflake_classification.py
  • metadata-ingestion/tests/integration/powerbi/test_profiling.py
  • metadata-ingestion/tests/integration/powerbi/test_stateful_ingestion.py
  • metadata-ingestion/tests/integration/ldap/test_ldap_stateful.py
  • metadata-ingestion/tests/integration/qlik_sense/test_qlik_sense.py
  • metadata-ingestion/tests/unit/test_unity_catalog_config.py
  • metadata-ingestion/tests/integration/metabase/test_metabase.py
  • metadata-ingestion/tests/integration/kafka/test_kafka_state.py
  • metadata-ingestion/tests/integration/lookml/test_lookml.py
  • metadata-ingestion/tests/integration/looker/test_looker.py
  • metadata-ingestion/tests/integration/kafka-connect/test_kafka_connect.py
  • metadata-ingestion/tests/unit/test_glue_source.py
  • metadata-ingestion/tests/integration/iceberg/test_iceberg.py
  • metadata-ingestion/tests/unit/test_redash_source.py
  • metadata-ingestion/tests/unit/test_ge_profiling_config.py
  • metadata-ingestion/tests/unit/test_classification.py
  • metadata-ingestion/tests/unit/test_bigqueryv2_usage_source.py
  • metadata-ingestion/tests/unit/test_datahub_source.py
  • metadata-ingestion/tests/integration/dynamodb/test_dynamodb.py
  • metadata-ingestion/tests/integration/dbt/test_dbt.py
  • metadata-ingestion/tests/unit/stateful_ingestion/state/test_stateful_ingestion.py
  • metadata-ingestion/tests/unit/stateful_ingestion/state/test_redundant_run_skip_handler.py
  • metadata-ingestion/tests/unit/test_elasticsearch_source.py
  • metadata-ingestion/tests/unit/stateful_ingestion/test_configs.py
  • metadata-ingestion/tests/unit/stateful_ingestion/provider/test_provider.py
  • metadata-ingestion/tests/integration/azure_ad/test_azure_ad.py
  • metadata-ingestion/tests/integration/bigquery_v2/test_bigquery.py
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that tests for the `enabled` field are present.
# Test: Search for the usage of the `enabled` field in test files. Expect: Test cases covering the `enabled` field.
rg --type python --word-regexp --glob '*test*.py' $'enabled'

Length of output: 90


Script:

#!/bin/bash
# Description: Verify that tests for the `enabled` field are present.
# Test: Search for the usage of the `enabled` field in test files. Expect: Test cases covering the `enabled` field.
rg --type py --word-regexp --glob '*test*.py' enabled

Length of output: 10014

metadata-ingestion/src/datahub/ingestion/source/aws/glue.py (5)

162-165: Add tests for the updated profiling field.

The profiling field is updated to use a default factory. Ensure that tests are added to cover this change.

Verification successful

Tests for the profiling field are already present.

The search results indicate that multiple test files reference and test the profiling field. No additional tests are necessary.

  • smoke-test/tests/privileges/test_privileges.py
  • smoke-test/tests/managed_ingestion/managed_ingestion_test.py
  • metadata-ingestion/tests/unit/test_elasticsearch_source.py
  • metadata-ingestion/tests/unit/test_iceberg.py
  • metadata-ingestion/tests/unit/test_unity_catalog_config.py
  • metadata-ingestion/src/datahub/ingestion/source/unity/connection_test.py
  • metadata-ingestion/tests/integration/salesforce/test_salesforce.py
  • metadata-ingestion/tests/integration/s3/test_s3.py
  • metadata-ingestion/tests/integration/powerbi/test_profiling.py
  • metadata-ingestion/tests/integration/snowflake/test_snowflake.py
  • metadata-ingestion/tests/integration/unity/test_unity_catalog_ingest.py
  • metadata-ingestion/tests/integration/trino/test_trino.py
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that tests for the `profiling` field are present.
# Test: Search for the usage of the `profiling` field in test files. Expect: Test cases covering the `profiling` field.
rg --type python --word-regexp --glob '*test*.py' $'profiling'

Length of output: 92


Script:

#!/bin/bash
# Description: Verify that tests for the `profiling` field are present.
# Test: Search for the usage of the `profiling` field in test files. Expect: Test cases covering the `profiling` field.
rg --type py --word-regexp --glob '*test*.py' profiling

Length of output: 4010


176-179: Add tests for the new is_profiling_enabled method.

The is_profiling_enabled method is correctly defined to check if profiling is enabled. Ensure that tests are added to cover this new method.


5-5: Remove unused imports if not needed.

Ensure that the newly added imports for ThreadPoolExecutor and as_completed are used in the code. If not, remove the unused imports.


Line range hint 840-884: Add tests for the refactored get_profile_if_enabled method.

The get_profile_if_enabled method has been refactored to use a ThreadPoolExecutor for processing partitions concurrently and added error handling. Ensure that tests are added to cover these changes.


788-820: Add tests for the new profiling logic in _create_profile_mcp.

The _create_profile_mcp method has been updated to include logic for handling profiling settings. Ensure that tests are added to cover these changes.

@sagar-salvi-apptware sagar-salvi-apptware force-pushed the fix/ING-586/glue-missing-config-options-in-profiling branch from 3c389ea to f2cff22 Compare July 8, 2024 05:37
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Outside diff range and nitpick comments (1)
metadata-ingestion/tests/unit/test_glue_source.py (1)

90-123: Add docstring to the glue_source_with_profiling function.

To improve readability and maintainability, add a docstring describing the purpose and usage of the function.

def glue_source_with_profiling(
    platform_instance: Optional[str] = None,
    use_s3_bucket_tags: bool = False,
    use_s3_object_tags: bool = False,
    extract_delta_schema_from_parameters: bool = False,
) -> GlueSource:
    """
    Returns a GlueSource object configured for table-level data profiling.

    Args:
        platform_instance (Optional[str]): The platform instance.
        use_s3_bucket_tags (bool): Whether to use S3 bucket tags.
        use_s3_object_tags (bool): Whether to use S3 object tags.
        extract_delta_schema_from_parameters (bool): Whether to extract delta schema from parameters.

    Returns:
        GlueSource: Configured GlueSource object.
    """
    profiling_config = GlueProfilingConfig(
        enabled=True,
        profile_table_level_only=False,
        row_count="row_count",
        column_count="column_count",
        unique_count="unique_count",
        unique_proportion="unique_proportion",
        null_count="null_count",
        null_proportion="null_proportion",
        min="min",
        max="max",
        mean="mean",
        median="median",
        stdev="stdev",
    )

    return GlueSource(
        ctx=PipelineContext(run_id="glue-source-test"),
        config=GlueSourceConfig(
            aws_region="us-west-2",
            extract_transforms=False,
            platform_instance=platform_instance,
            use_s3_bucket_tags=use_s3_bucket_tags,
            use_s3_object_tags=use_s3_object_tags,
            extract_delta_schema_from_parameters=extract_delta_schema_from_parameters,
            profiling=profiling_config,
        ),
    )
Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 3c389ea and f2cff22.

Files selected for processing (5)
  • metadata-ingestion/src/datahub/ingestion/source/aws/glue.py (7 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/glue_profiling_config.py (2 hunks)
  • metadata-ingestion/tests/unit/glue/glue_mces_golden.json (1 hunks)
  • metadata-ingestion/tests/unit/test_glue_source.py (5 hunks)
  • metadata-ingestion/tests/unit/test_glue_source_stubs.py (1 hunks)
Files not summarized due to errors (1)
  • metadata-ingestion/tests/unit/glue/glue_mces_golden.json: Error: Message exceeds token limit
Files skipped from review as they are similar to previous changes (2)
  • metadata-ingestion/src/datahub/ingestion/source/aws/glue.py
  • metadata-ingestion/src/datahub/ingestion/source/glue_profiling_config.py
Additional context used
Biome
metadata-ingestion/tests/unit/glue/glue_mces_golden.json

[error] 17-17: Expected a property but instead found '}'.

Expected a property here.

(parse)


[error] 20-20: Expected a property but instead found '}'.

Expected a property here.

(parse)


[error] 22-22: Expected a property but instead found '}'.

Expected a property here.

(parse)


[error] 29-29: Expected a property but instead found '}'.

Expected a property here.

(parse)


[error] 36-36: Expected a property but instead found '}'.

Expected a property here.

(parse)


[error] 43-43: Expected a property but instead found '}'.

Expected a property here.

(parse)


[error] 57-57: JSON standard does not allow single quoted strings

Use double quotes to escape the string.

(parse)


[error] 69-69: JSON standard does not allow single quoted strings

Use double quotes to escape the string.

(parse)


[error] 73-73: Expected a property but instead found '}'.

Expected a property here.

(parse)


[error] 77-77: Expected a property but instead found '}'.

Expected a property here.

(parse)


[error] 88-88: Expected a property but instead found '}'.

Expected a property here.

(parse)


[error] 108-108: JSON standard does not allow single quoted strings

Use double quotes to escape the string.

(parse)


[error] 109-109: Expected a property but instead found '}'.

Expected a property here.

(parse)


[error] 121-121: JSON standard does not allow single quoted strings

Use double quotes to escape the string.

(parse)


[error] 122-122: Expected a property but instead found '}'.

Expected a property here.

(parse)


[error] 134-134: JSON standard does not allow single quoted strings

Use double quotes to escape the string.

(parse)


[error] 135-135: Expected a property but instead found '}'.

Expected a property here.

(parse)


[error] 147-147: JSON standard does not allow single quoted strings

Use double quotes to escape the string.

(parse)


[error] 148-148: Expected a property but instead found '}'.

Expected a property here.

(parse)


[error] 160-160: JSON standard does not allow single quoted strings

Use double quotes to escape the string.

(parse)

Additional comments not posted (5)
metadata-ingestion/tests/unit/test_glue_source.py (1)

473-518: Ensure proper cleanup in test_glue_ingest_with_profiling.

Add cleanup code to ensure that resources are properly released after the test execution.

@freeze_time(FROZEN_TIME)
def test_glue_ingest_with_profiling(
    tmp_path: Path,
    pytestconfig: PytestConfig,
    platform_instance: str,
    mce_file: str,
    mce_golden_file: str,
) -> None:
    glue_source_instance = glue_source_with_profiling(
        platform_instance=platform_instance
    )

    with Stubber(glue_source_instance.glue_client) as glue_stubber:
        glue_stubber.add_response("get_databases", get_databases_response_profiling, {})

        glue_stubber.add_response(
            "get_tables",
            get_tables_response_profiling_1,
            {"DatabaseName": "flights-database-profiling"},
        )

        glue_stubber.add_response(
            "get_table",
            {"Table": tables_profiling_1[0]},
            {"DatabaseName": "flights-database-profiling", "Name": "avro-profiling"},
        )

        mce_objects = [wu.metadata for wu in glue_source_instance.get_workunits()]

        glue_stubber.assert_no_pending_responses()

        write_metadata_file(tmp_path / mce_file, mce_objects)

    # Verify the output.
    test_resources_dir = pytestconfig.rootpath / "tests/unit/glue"
    mce_helpers.check_golden_file(
        pytestconfig,
        output_path=tmp_path / mce_file,
        golden_path=test_resources_dir / mce_golden_file,
    )
metadata-ingestion/tests/unit/test_glue_source_stubs.py (4)

883-902: LGTM!

The get_databases_response_profiling dictionary is correctly structured and consistent with existing database responses. Profiling-related parameters are appropriately included.


903-986: LGTM!

The tables_profiling_1 list and get_tables_response_profiling_1 dictionary are correctly structured and consistent with existing table responses. Profiling-related parameters for table columns are appropriately included.


Line range hint 987-1000:
LGTM!

The mock_get_object_response function is correctly implemented to mock S3 client responses. It encodes the provided raw body and creates a StreamingBody object.


Line range hint 1001-1018:
LGTM!

The get_object_response_1, get_object_response_2, get_bucket_tagging, and get_object_tagging functions are correctly implemented to return mock S3 responses. They utilize the mock_get_object_response helper function and provide appropriate content for testing.

@sagar-salvi-apptware sagar-salvi-apptware force-pushed the fix/ING-586/glue-missing-config-options-in-profiling branch from f2cff22 to bb3659e Compare July 8, 2024 10:38
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Outside diff range and nitpick comments (1)
metadata-ingestion/src/datahub/ingestion/source/glue_profiling_config.py (1)

19-26: Field max_workers is correctly added but consider updating the description.

The max_workers field specifies the number of worker threads for profiling. The default value and description are appropriate. However, the description could be clearer by mentioning that the default value is based on the number of CPU cores.

- description="Number of worker threads to use for profiling. Set to 1 to disable."
+ description="Number of worker threads to use for profiling. Default is 5 times the number of CPU cores. Set to 1 to disable."
Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between f2cff22 and bb3659e.

Files selected for processing (5)
  • metadata-ingestion/src/datahub/ingestion/source/aws/glue.py (7 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/glue_profiling_config.py (2 hunks)
  • metadata-ingestion/tests/unit/glue/glue_mces_golden_profiling.json (1 hunks)
  • metadata-ingestion/tests/unit/test_glue_source.py (5 hunks)
  • metadata-ingestion/tests/unit/test_glue_source_stubs.py (1 hunks)
Files skipped from review as they are similar to previous changes (3)
  • metadata-ingestion/src/datahub/ingestion/source/aws/glue.py
  • metadata-ingestion/tests/unit/test_glue_source.py
  • metadata-ingestion/tests/unit/test_glue_source_stubs.py
Additional comments not posted (15)
metadata-ingestion/src/datahub/ingestion/source/glue_profiling_config.py (2)

11-14: Field enabled is correctly added.

The enabled field allows toggling profiling on or off. The default value and description are appropriate.


15-18: Field profile_table_level_only is correctly added.

The profile_table_level_only field allows limiting profiling to table-level. The default value and description are appropriate.

metadata-ingestion/tests/unit/glue/glue_mces_golden_profiling.json (13)

1-21: Container properties are correctly added.

The container properties include custom properties, name, and qualified name. All fields are appropriately set.


22-33: Status aspect is correctly added.

The status aspect indicates that the container is not removed. The field is appropriately set.


34-44: Data platform instance is correctly added.

The data platform instance specifies that the platform is Glue. The field is appropriately set.


45-57: SubTypes aspect is correctly added.

The subTypes aspect indicates that the container is a Database. The field is appropriately set.


58-91: Dataset properties are correctly added.

The dataset properties include various custom properties, such as schema versions, average record size, classification, compression type, object count, record count, size, data type, location, input and output formats, compression status, number of buckets, and serde info. All fields are appropriately set.


92-95: Dataset name and qualified name are correctly added.

The dataset name and qualified name fields are appropriately set.


96-210: Schema metadata is correctly added.

The schema metadata includes schema name, platform, version, creation and modification times, hash, platform schema, and field details. All fields are appropriately set.


211-215: Data platform instance is correctly added.

The data platform instance specifies that the platform is Glue. The field is appropriately set.


216-230: Ownership aspect is correctly added.

The ownership aspect includes owner details and last modification time. All fields are appropriately set.


231-233: Snapshot aspect is correctly added.

The snapshot aspect is correctly formatted.


234-247: SubTypes aspect is correctly added.

The subTypes aspect indicates that the dataset is a Table. The field is appropriately set.


248-258: Container aspect is correctly added.

The container aspect specifies the container URN. The field is appropriately set.


259-287: Dataset profile is correctly added.

The dataset profile includes timestamp, partition specification, and field profiles. All fields are appropriately set.

@sagar-salvi-apptware sagar-salvi-apptware force-pushed the fix/ING-586/glue-missing-config-options-in-profiling branch from 91b113a to 467b489 Compare July 8, 2024 14:44
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between bb3659e and 467b489.

Files selected for processing (5)
  • metadata-ingestion/src/datahub/ingestion/source/aws/glue.py (7 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/glue_profiling_config.py (2 hunks)
  • metadata-ingestion/tests/unit/glue/glue_mces_golden_profiling.json (1 hunks)
  • metadata-ingestion/tests/unit/test_glue_source.py (5 hunks)
  • metadata-ingestion/tests/unit/test_glue_source_stubs.py (1 hunks)
Files skipped from review as they are similar to previous changes (4)
  • metadata-ingestion/src/datahub/ingestion/source/aws/glue.py
  • metadata-ingestion/src/datahub/ingestion/source/glue_profiling_config.py
  • metadata-ingestion/tests/unit/glue/glue_mces_golden_profiling.json
  • metadata-ingestion/tests/unit/test_glue_source.py
Additional comments not posted (3)
metadata-ingestion/tests/unit/test_glue_source_stubs.py (3)

883-901: LGTM! The get_databases_response_profiling structure is consistent with existing database response structures.

The added data structure aligns well with the expected schema and usage.


903-986: LGTM! The tables_profiling_1 structure is consistent with existing table response structures.

The added data structure aligns well with the expected schema and usage.


986-987: LGTM! The get_tables_response_profiling_1 structure is consistent with existing table list response structures.

The added data structure aligns well with the expected schema and usage.

@sagar-salvi-apptware sagar-salvi-apptware force-pushed the fix/ING-586/glue-missing-config-options-in-profiling branch 2 times, most recently from d4e767c to a13f2ad Compare July 17, 2024 12:58
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 467b489 and a13f2ad.

Files selected for processing (6)
  • docs/how/updating-datahub.md (1 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/aws/glue.py (7 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/glue_profiling_config.py (2 hunks)
  • metadata-ingestion/tests/unit/glue/glue_mces_golden_profiling.json (1 hunks)
  • metadata-ingestion/tests/unit/test_glue_source.py (5 hunks)
  • metadata-ingestion/tests/unit/test_glue_source_stubs.py (1 hunks)
Additional comments not posted (12)
metadata-ingestion/src/datahub/ingestion/source/glue_profiling_config.py (1)

11-18: New configuration options added to GlueProfilingConfig.

The fields enabled and profile_table_level_only have been added with appropriate default values and descriptions. This is a positive change as it enhances configurability and provides clear documentation for each option.

metadata-ingestion/src/datahub/ingestion/source/aws/glue.py (6)

5-5: Addition of ThreadPoolExecutor and as_completed imports

The imports for ThreadPoolExecutor and as_completed have been added to support the new multi-threading features for profiling. This change is consistent with the PR objectives and summary.


162-163: Change in default value handling for profiling configuration

The profiling field in GlueSourceConfig now uses default_factory instead of a direct assignment. This is a Pythonic way to ensure that mutable default values are handled correctly, preventing potential bugs where the default value is shared across instances.


176-176: Modification in is_profiling_enabled method

The method now checks if profiling is enabled based on the new profiling configuration. This change aligns with the added profiling capabilities and ensures that profiling is only performed when configured.


788-820: Enhanced profiling logic in _create_profile_mcp

This section has been updated to include new profiling metrics such as unique count, unique proportion, null count, null proportion, min, max, mean, median, and standard deviation. These changes enhance the profiling capabilities of the system and align with the PR's objectives to improve profiling features.


Line range hint 840-882: Refactoring of get_profile_if_enabled to use ThreadPoolExecutor

The method has been refactored to use ThreadPoolExecutor for handling partition profiling in a multi-threaded manner. This optimization is crucial for performance improvement when dealing with large datasets and aligns with the PR's goal to optimize the profiling process.


892-905: Addition of _create_partition_profile_mcp method

This new method handles the creation of partition profiles. It is a direct response to the PR objectives to add missing configuration options and enhance profiling at the partition level. As previously noted in the outdated comments, tests for this method should be verified or added.

metadata-ingestion/tests/unit/test_glue_source_stubs.py (5)

883-901: Review: Added profiling database stub.

The added database stub for profiling (flights-database-profiling) appears correctly structured and includes comprehensive metadata. This aligns with the PR's objective to enhance profiling capabilities.


Line range hint 987-1002: Review: Utility function for mocking S3 responses.

The mock_get_object_response function is well-documented and serves its purpose of simulating S3 get_object responses for testing. This is a good practice for unit tests, ensuring that tests do not rely on actual S3 interactions.


Line range hint 1003-1005: Review: Specific object response functions.

Functions get_object_response_1 and get_object_response_2 correctly utilize the mock_get_object_response to simulate specific S3 object responses. This modular approach enhances test readability and maintainability.

Also applies to: 1011-1013


Line range hint 1014-1016: Review: Tagging response functions.

The functions get_bucket_tagging and get_object_tagging provide mocked responses for S3 tagging. This is essential for testing any logic that depends on S3 tags, ensuring the system's robustness in handling tag-related features.

Also applies to: 1018-1020


903-986: Review: Added profiling table stub.

The table stub for avro-profiling is detailed, including extensive column metadata and storage configurations. This is crucial for accurate profiling and testing. However, ensure that the Parameters field for each column, such as unique_proportion, min, median, etc., is being utilized as expected in the profiling logic to avoid redundancy or misconfiguration.

@sagar-salvi-apptware sagar-salvi-apptware changed the title fix(ingestion/glue): add missing config options in profiling fix(ingestion/glue): Add support for missing config options for profiling in Glue Jul 18, 2024
@sagar-salvi-apptware sagar-salvi-apptware force-pushed the fix/ING-586/glue-missing-config-options-in-profiling branch from a13f2ad to a2fbece Compare July 23, 2024 14:00
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between a13f2ad and a2fbece.

Files selected for processing (6)
  • docs/how/updating-datahub.md (1 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/aws/glue.py (7 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/glue_profiling_config.py (2 hunks)
  • metadata-ingestion/tests/unit/glue/glue_mces_golden_profiling.json (1 hunks)
  • metadata-ingestion/tests/unit/test_glue_source.py (5 hunks)
  • metadata-ingestion/tests/unit/test_glue_source_stubs.py (1 hunks)
Files skipped from review as they are similar to previous changes (5)
  • docs/how/updating-datahub.md
  • metadata-ingestion/src/datahub/ingestion/source/aws/glue.py
  • metadata-ingestion/src/datahub/ingestion/source/glue_profiling_config.py
  • metadata-ingestion/tests/unit/glue/glue_mces_golden_profiling.json
  • metadata-ingestion/tests/unit/test_glue_source_stubs.py
Additional comments not posted (2)
metadata-ingestion/tests/unit/test_glue_source.py (2)

103-136: LGTM! Ensure profiling feature is thoroughly tested.

The new function glue_source_with_profiling is well-structured and integrates the profiling configurations correctly.

However, ensure that all aspects of the profiling feature are thoroughly tested.


689-734: LGTM! Ensure comprehensive test coverage for profiling.

The new test function test_glue_ingest_with_profiling is well-structured and verifies the profiling feature against a golden file.

However, ensure that all profiling metrics and edge cases are covered in the tests.

@sagar-salvi-apptware sagar-salvi-apptware force-pushed the fix/ING-586/glue-missing-config-options-in-profiling branch from a2fbece to 1b2f7a6 Compare July 26, 2024 16:37
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between a2fbece and 1b2f7a6.

Files selected for processing (6)
  • docs/how/updating-datahub.md (1 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/aws/glue.py (5 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/glue_profiling_config.py (1 hunks)
  • metadata-ingestion/tests/unit/glue/glue_mces_golden_profiling.json (1 hunks)
  • metadata-ingestion/tests/unit/test_glue_source.py (5 hunks)
  • metadata-ingestion/tests/unit/test_glue_source_stubs.py (1 hunks)
Files skipped from review as they are similar to previous changes (4)
  • docs/how/updating-datahub.md
  • metadata-ingestion/src/datahub/ingestion/source/glue_profiling_config.py
  • metadata-ingestion/tests/unit/glue/glue_mces_golden_profiling.json
  • metadata-ingestion/tests/unit/test_glue_source_stubs.py
Additional context used
Ruff
metadata-ingestion/src/datahub/ingestion/source/aws/glue.py

5-5: concurrent.futures.ThreadPoolExecutor imported but unused

Remove unused import

(F401)


5-5: concurrent.futures.as_completed imported but unused

Remove unused import

(F401)

Additional comments not posted (5)
metadata-ingestion/tests/unit/test_glue_source.py (2)

103-136: LGTM!

The glue_source_with_profiling function correctly sets up the profiling configuration and returns a GlueSource instance.


689-734: LGTM!

The test_glue_ingest_with_profiling function correctly tests the profiling functionality within the glue source ingestion process.

metadata-ingestion/src/datahub/ingestion/source/aws/glue.py (3)

Line range hint 171-190:
LGTM!

The changes to the GlueSourceConfig class, including the profiling configuration and the is_profiling_enabled method, are correct.


871-903: LGTM!

The changes to the _create_profile_mcp method, which conditionally handles profiling based on profile_table_level_only, are correct.


923-923: LGTM!

The refactoring of the get_profile_if_enabled method simplifies the check for profiling status and improves readability.

@sagar-salvi-apptware sagar-salvi-apptware force-pushed the fix/ING-586/glue-missing-config-options-in-profiling branch from 1b2f7a6 to 49290f7 Compare July 26, 2024 16:43
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 1b2f7a6 and 49290f7.

Files selected for processing (2)
  • metadata-ingestion/src/datahub/ingestion/source/aws/glue.py (4 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/glue_profiling_config.py (1 hunks)
Files skipped from review as they are similar to previous changes (2)
  • metadata-ingestion/src/datahub/ingestion/source/aws/glue.py
  • metadata-ingestion/src/datahub/ingestion/source/glue_profiling_config.py

Copy link
Collaborator

@mayurinehate mayurinehate left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor comment on tests. Otherwise LGTM

@sagar-salvi-apptware sagar-salvi-apptware force-pushed the fix/ING-586/glue-missing-config-options-in-profiling branch from 49290f7 to 5646c18 Compare July 29, 2024 09:26
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 49290f7 and 5646c18.

Files selected for processing (3)
  • metadata-ingestion/src/datahub/ingestion/source/aws/glue.py (4 hunks)
  • metadata-ingestion/src/datahub/ingestion/source/glue_profiling_config.py (1 hunks)
  • metadata-ingestion/tests/unit/test_glue_source.py (5 hunks)
Files skipped from review as they are similar to previous changes (2)
  • metadata-ingestion/src/datahub/ingestion/source/aws/glue.py
  • metadata-ingestion/src/datahub/ingestion/source/glue_profiling_config.py
Additional comments not posted (4)
metadata-ingestion/tests/unit/test_glue_source.py (4)

16-20: Imports look good!

The new import GlueProfilingConfig is necessary for the profiling configurations.


Line range hint 45-66:
Imports for profiling stub functions look good!

The new imports for profiling stub functions are necessary for the added test cases.


103-136: Function glue_source_with_profiling looks good!

The function correctly sets up the profiling configurations and returns the GlueSource instance.


689-724: Function test_glue_ingest_with_profiling looks good!

The function correctly sets up the test environment, mocks necessary responses, and verifies the output against the golden file.

@anshbansal
Copy link
Collaborator

Failure in airflow unrelated

@anshbansal anshbansal merged commit a09575f into datahub-project:master Jul 29, 2024
57 of 58 checks passed
arosanda added a commit to infobip/datahub that referenced this pull request Sep 23, 2024
* feat(forms) Handle deleting forms references when hard deleting forms (datahub-project#10820)

* refactor(ui): Misc improvements to the setup ingestion flow (ingest uplift 1/2)  (datahub-project#10764)

Co-authored-by: John Joyce <john@Johns-MBP.lan>
Co-authored-by: John Joyce <john@ip-192-168-1-200.us-west-2.compute.internal>

* fix(ingestion/airflow-plugin): pipeline tasks discoverable in search (datahub-project#10819)

* feat(ingest/transformer): tags to terms transformer (datahub-project#10758)

Co-authored-by: Aseem Bansal <asmbansal2@gmail.com>

* fix(ingestion/unity-catalog): fixed issue with profiling with GE turned on (datahub-project#10752)

Co-authored-by: Aseem Bansal <asmbansal2@gmail.com>

* feat(forms) Add java SDK for form entity PATCH + CRUD examples (datahub-project#10822)

* feat(SDK) Add java SDK for structuredProperty entity PATCH + CRUD examples (datahub-project#10823)

* feat(SDK) Add StructuredPropertyPatchBuilder in python sdk and provide sample CRUD files (datahub-project#10824)

* feat(forms) Add CRUD endpoints to GraphQL for Form entities (datahub-project#10825)

* add flag for includeSoftDeleted in scroll entities API (datahub-project#10831)

* feat(deprecation) Return actor entity with deprecation aspect (datahub-project#10832)

* feat(structuredProperties) Add CRUD graphql APIs for structured property entities (datahub-project#10826)

* add scroll parameters to openapi v3 spec (datahub-project#10833)

* fix(ingest): correct profile_day_of_week implementation (datahub-project#10818)

* feat(ingest/glue): allow ingestion of empty databases from Glue (datahub-project#10666)

Co-authored-by: Harshal Sheth <hsheth2@gmail.com>

* feat(cli): add more details to get cli (datahub-project#10815)

* fix(ingestion/glue): ensure date formatting works on all platforms for aws glue (datahub-project#10836)

* fix(ingestion): fix datajob patcher (datahub-project#10827)

* fix(smoke-test): add suffix in temp file creation (datahub-project#10841)

* feat(ingest/glue): add helper method to permit user or group ownership (datahub-project#10784)

* feat(): Show data platform instances in policy modal if they are set on the policy (datahub-project#10645)

Co-authored-by: Hendrik Richert <hendrik.richert@swisscom.com>

* docs(patch): add patch documentation for how implementation works (datahub-project#10010)

Co-authored-by: John Joyce <john@acryl.io>

* fix(jar): add missing custom-plugin-jar task (datahub-project#10847)

* fix(): also check exceptions/stack trace when filtering log messages (datahub-project#10391)

Co-authored-by: John Joyce <john@acryl.io>

* docs(): Update posts.md (datahub-project#9893)

Co-authored-by: Hyejin Yoon <0327jane@gmail.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* chore(ingest): update acryl-datahub-classify version (datahub-project#10844)

* refactor(ingest): Refactor structured logging to support infos, warnings, and failures structured reporting to UI (datahub-project#10828)

Co-authored-by: John Joyce <john@Johns-MBP.lan>
Co-authored-by: Harshal Sheth <hsheth2@gmail.com>

* fix(restli): log aspect-not-found as a warning rather than as an error (datahub-project#10834)

* fix(ingest/nifi): remove duplicate upstream jobs (datahub-project#10849)

* fix(smoke-test): test access to create/revoke personal access tokens (datahub-project#10848)

* fix(smoke-test): missing test for move domain (datahub-project#10837)

* ci: update usernames to not considered for community (datahub-project#10851)

* env: change defaults for data contract visibility (datahub-project#10854)

* fix(ingest/tableau): quote special characters in external URL (datahub-project#10842)

* fix(smoke-test): fix flakiness of auto complete test

* ci(ingest): pin dask dependency for feast (datahub-project#10865)

* fix(ingestion/lookml): liquid template resolution and view-to-view cll (datahub-project#10542)

* feat(ingest/audit): add client id and version in system metadata props (datahub-project#10829)

* chore(ingest): Mypy 1.10.1 pin (datahub-project#10867)

* docs: use acryl-datahub-actions as expected python package to install (datahub-project#10852)

* docs: add new js snippet (datahub-project#10846)

* refactor(ingestion): remove company domain for security reason (datahub-project#10839)

* fix(ingestion/spark): Platform instance and column level lineage fix (datahub-project#10843)

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* feat(ingestion/tableau): optionally ingest multiple sites and create site containers (datahub-project#10498)

Co-authored-by: Yanik Häni <Yanik.Haeni1@swisscom.com>

* fix(ingestion/looker): Add sqlglot dependency and remove unused sqlparser (datahub-project#10874)

* fix(manage-tokens): fix manage access token policy (datahub-project#10853)

* Batch get entity endpoints (datahub-project#10880)

* feat(system): support conditional write semantics (datahub-project#10868)

* fix(build): upgrade vercel builds to Node 20.x (datahub-project#10890)

* feat(ingest/lookml): shallow clone repos (datahub-project#10888)

* fix(ingest/looker): add missing dependency (datahub-project#10876)

* fix(ingest): only populate audit stamps where accurate (datahub-project#10604)

* fix(ingest/dbt): always encode tag urns (datahub-project#10799)

* fix(ingest/redshift): handle multiline alter table commands (datahub-project#10727)

* fix(ingestion/looker): column name missing in explore (datahub-project#10892)

* fix(lineage) Fix lineage source/dest filtering with explored per hop limit (datahub-project#10879)

* feat(conditional-writes): misc updates and fixes (datahub-project#10901)

* feat(ci): update outdated action (datahub-project#10899)

* feat(rest-emitter): adding async flag to rest emitter (datahub-project#10902)

Co-authored-by: Gabe Lyons <gabe.lyons@acryl.io>

* feat(ingest): add snowflake-queries source (datahub-project#10835)

* fix(ingest): improve `auto_materialize_referenced_tags_terms` error handling (datahub-project#10906)

* docs: add new company to adoption list (datahub-project#10909)

* refactor(redshift): Improve redshift error handling with new structured reporting system (datahub-project#10870)

Co-authored-by: John Joyce <john@Johns-MBP.lan>
Co-authored-by: Harshal Sheth <hsheth2@gmail.com>

* feat(ui) Finalize support for all entity types on forms (datahub-project#10915)

* Index ExecutionRequestResults status field (datahub-project#10811)

* feat(ingest): grafana connector (datahub-project#10891)

Co-authored-by: Shirshanka Das <shirshanka@apache.org>
Co-authored-by: Harshal Sheth <hsheth2@gmail.com>

* fix(gms) Add Form entity type to EntityTypeMapper (datahub-project#10916)

* feat(dataset): add support for external url in Dataset (datahub-project#10877)

* docs(saas-overview) added missing features to observe section (datahub-project#10913)

Co-authored-by: John Joyce <john@acryl.io>

* fix(ingest/spark): Fixing Micrometer warning (datahub-project#10882)

* fix(structured properties): allow application of structured properties without schema file (datahub-project#10918)

* fix(data-contracts-web) handle other schedule types (datahub-project#10919)

* fix(ingestion/tableau): human-readable message for PERMISSIONS_MODE_SWITCHED error (datahub-project#10866)

Co-authored-by: Harshal Sheth <hsheth2@gmail.com>

* Add feature flag for view defintions (datahub-project#10914)

Co-authored-by: Ethan Cartwright <ethan.cartwright@acryl.io>

* feat(ingest/BigQuery): refactor+parallelize dataset metadata extraction (datahub-project#10884)

* fix(airflow): add error handling around render_template() (datahub-project#10907)

* feat(ingestion/sqlglot): add optional `default_dialect` parameter to sqlglot lineage (datahub-project#10830)

* feat(mcp-mutator): new mcp mutator plugin (datahub-project#10904)

* fix(ingest/bigquery): changes helper function to decode unicode scape sequences (datahub-project#10845)

* feat(ingest/postgres): fetch table sizes for profile (datahub-project#10864)

* feat(ingest/abs): Adding azure blob storage ingestion source (datahub-project#10813)

* fix(ingest/redshift): reduce severity of SQL parsing issues (datahub-project#10924)

* fix(build): fix lint fix web react (datahub-project#10896)

* fix(ingest/bigquery): handle quota exceeded for project.list requests (datahub-project#10912)

* feat(ingest): report extractor failures more loudly (datahub-project#10908)

* feat(ingest/snowflake): integrate snowflake-queries into main source (datahub-project#10905)

* fix(ingest): fix docs build (datahub-project#10926)

* fix(ingest/snowflake): fix test connection (datahub-project#10927)

* fix(ingest/lookml): add view load failures to cache (datahub-project#10923)

* docs(slack) overhauled setup instructions and screenshots (datahub-project#10922)

Co-authored-by: John Joyce <john@acryl.io>

* fix(airflow): Add comma parsing of owners to DataJobs (datahub-project#10903)

* fix(entityservice): fix merging sideeffects (datahub-project#10937)

* feat(ingest): Support System Ingestion Sources, Show and hide system ingestion sources with Command-S (datahub-project#10938)

Co-authored-by: John Joyce <john@Johns-MBP.lan>

* chore() Set a default lineage filtering end time on backend when a start time is present (datahub-project#10925)

Co-authored-by: John Joyce <john@ip-192-168-1-200.us-west-2.compute.internal>
Co-authored-by: John Joyce <john@Johns-MBP.lan>

* Added relationships APIs to V3. Added these generic APIs to V3 swagger doc. (datahub-project#10939)

* docs: add learning center to docs (datahub-project#10921)

* doc: Update hubspot form id (datahub-project#10943)

* chore(airflow): add python 3.11 w/ Airflow 2.9 to CI (datahub-project#10941)

* fix(ingest/Glue): column upstream lineage between S3 and Glue (datahub-project#10895)

* fix(ingest/abs): split abs utils into multiple files (datahub-project#10945)

* doc(ingest/looker): fix doc for sql parsing documentation (datahub-project#10883)

Co-authored-by: Harshal Sheth <hsheth2@gmail.com>

* fix(ingest/bigquery): Adding missing BigQuery types (datahub-project#10950)

* fix(ingest/setup): feast and abs source setup (datahub-project#10951)

* fix(connections) Harden adding /gms to connections in backend (datahub-project#10942)

* feat(siblings) Add flag to prevent combining siblings in the UI (datahub-project#10952)

* fix(docs): make graphql doc gen more automated (datahub-project#10953)

* feat(ingest/athena): Add option for Athena partitioned profiling (datahub-project#10723)

* fix(spark-lineage): default timeout for future responses (datahub-project#10947)

* feat(datajob/flow): add environment filter using info aspects (datahub-project#10814)

* fix(ui/ingest): correct privilege used to show tab (datahub-project#10483)

Co-authored-by: Kunal-kankriya <127090035+Kunal-kankriya@users.noreply.github.com>

* feat(ingest/looker): include dashboard urns in browse v2 (datahub-project#10955)

* add a structured type to batchGet in OpenAPI V3 spec (datahub-project#10956)

* fix(ui): scroll on the domain sidebar to show all domains (datahub-project#10966)

* fix(ingest/sagemaker): resolve incorrect variable assignment for SageMaker API call (datahub-project#10965)

* fix(airflow/build): Pinning mypy (datahub-project#10972)

* Fixed a bug where the OpenAPI V3 spec was incorrect. The bug was introduced in datahub-project#10939. (datahub-project#10974)

* fix(ingest/test): Fix for mssql integration tests (datahub-project#10978)

* fix(entity-service) exist check correctly extracts status (datahub-project#10973)

* fix(structuredProps) casing bug in StructuredPropertiesValidator (datahub-project#10982)

* bugfix: use anyOf instead of allOf when creating references in openapi v3 spec (datahub-project#10986)

* fix(ui): Remove ant less imports (datahub-project#10988)

* feat(ingest/graph): Add get_results_by_filter to DataHubGraph (datahub-project#10987)

* feat(ingest/cli): init does not actually support environment variables (datahub-project#10989)

* fix(ingest/graph): Update get_results_by_filter graphql query (datahub-project#10991)

* feat(ingest/spark): Promote beta plugin (datahub-project#10881)

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* feat(ingest): support domains in meta -> "datahub" section (datahub-project#10967)

* feat(ingest): add `check server-config` command (datahub-project#10990)

* feat(cli): Make consistent use of DataHubGraphClientConfig (datahub-project#10466)

Deprecates get_url_and_token() in favor of a more complete option: load_graph_config() that returns a full DatahubClientConfig.
This change was then propagated across previous usages of get_url_and_token so that connections to DataHub server from the client respect the full breadth of configuration specified by DatahubClientConfig.

I.e: You can now specify disable_ssl_verification: true in your ~/.datahubenv file so that all cli functions to the server work when ssl certification is disabled.

Fixes datahub-project#9705

* fix(ingest/s3): Fixing container creation when there is no folder in path (datahub-project#10993)

* fix(ingest/looker): support platform instance for dashboards & charts (datahub-project#10771)

* feat(ingest/bigquery): improve handling of information schema in sql parser (datahub-project#10985)

* feat(ingest): improve `ingest deploy` command (datahub-project#10944)

* fix(backend): allow excluding soft-deleted entities in relationship-queries; exclude soft-deleted members of groups (datahub-project#10920)

- allow excluding soft-deleted entities in relationship-queries
- exclude soft-deleted members of groups

* fix(ingest/looker): downgrade missing chart type log level (datahub-project#10996)

* doc(acryl-cloud): release docs for 0.3.4.x (datahub-project#10984)

Co-authored-by: John Joyce <john@acryl.io>
Co-authored-by: RyanHolstien <RyanHolstien@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: Pedro Silva <pedro@acryl.io>

* fix(protobuf/build): Fix protobuf check jar script (datahub-project#11006)

* fix(ui/ingest): Support invalid cron jobs (datahub-project#10998)

* fix(ingest): fix graph config loading (datahub-project#11002)

Co-authored-by: Pedro Silva <pedro@acryl.io>

* feat(docs): Document __DATAHUB_TO_FILE_ directive (datahub-project#10968)

Co-authored-by: Harshal Sheth <hsheth2@gmail.com>

* fix(graphql/upsertIngestionSource): Validate cron schedule; parse error in CLI (datahub-project#11011)

* feat(ece): support custom ownership type urns in ECE generation (datahub-project#10999)

* feat(assertion-v2): changed Validation tab to Quality and created new Governance tab (datahub-project#10935)

* fix(ingestion/glue): Add support for missing config options for profiling in Glue (datahub-project#10858)

* feat(propagation): Add models for schema field docs, tags, terms (datahub-project#2959) (datahub-project#11016)

Co-authored-by: Chris Collins <chriscollins3456@gmail.com>

* docs: standardize terminology to DataHub Cloud (datahub-project#11003)

* fix(ingestion/transformer): replace the externalUrl container (datahub-project#11013)

* docs(slack) troubleshoot docs (datahub-project#11014)

* feat(propagation): Add graphql API (datahub-project#11030)

Co-authored-by: Chris Collins <chriscollins3456@gmail.com>

* feat(propagation):  Add models for Action feature settings (datahub-project#11029)

* docs(custom properties): Remove duplicate from sidebar (datahub-project#11033)

* feat(models): Introducing Dataset Partitions Aspect (datahub-project#10997)

Co-authored-by: John Joyce <john@Johns-MBP.lan>
Co-authored-by: John Joyce <john@ip-192-168-1-200.us-west-2.compute.internal>

* feat(propagation): Add Documentation Propagation Settings (datahub-project#11038)

* fix(models): chart schema fields mapping, add dataHubAction entity, t… (datahub-project#11040)

* fix(ci): smoke test lint failures (datahub-project#11044)

* docs: fix learning center color scheme & typo (datahub-project#11043)

* feat: add cloud main page (datahub-project#11017)

Co-authored-by: Jay <159848059+jayacryl@users.noreply.github.com>

* feat(restore-indices): add additional step to also clear system metadata service (datahub-project#10662)

Co-authored-by: John Joyce <john@acryl.io>

* docs: fix typo (datahub-project#11046)

* fix(lint): apply spotless (datahub-project#11050)

* docs(airflow): example query to get datajobs for a dataflow (datahub-project#11034)

* feat(cli): Add run-id option to put sub-command (datahub-project#11023)

Adds an option to assign run-id to a given put command execution. 
This is useful when transformers do not exist for a given ingestion payload, we can follow up with custom metadata and assign it to an ingestion pipeline.

* fix(ingest): improve sql error reporting calls (datahub-project#11025)

* fix(airflow): fix CI setup (datahub-project#11031)

* feat(ingest/dbt): add experimental `prefer_sql_parser_lineage` flag (datahub-project#11039)

* fix(ingestion/lookml): enable stack-trace in lookml logs (datahub-project#10971)

* (chore): Linting fix (datahub-project#11015)

* chore(ci): update deprecated github actions (datahub-project#10977)

* Fix ALB configuration example (datahub-project#10981)

* chore(ingestion-base): bump base image packages (datahub-project#11053)

* feat(cli): Trim report of dataHubExecutionRequestResult to max GMS size (datahub-project#11051)

* fix(ingestion/lookml): emit dummy sql condition for lookml custom condition tag (datahub-project#11008)

Co-authored-by: Harshal Sheth <hsheth2@gmail.com>

* fix(ingestion/powerbi): fix issue with broken report lineage (datahub-project#10910)

* feat(ingest/tableau): add retry on timeout (datahub-project#10995)

* change generate kafka connect properties from env (datahub-project#10545)

Co-authored-by: david-leifker <114954101+david-leifker@users.noreply.github.com>

* fix(ingest): fix oracle cronjob ingestion (datahub-project#11001)

Co-authored-by: david-leifker <114954101+david-leifker@users.noreply.github.com>

* chore(ci): revert update deprecated github actions (datahub-project#10977) (datahub-project#11062)

* feat(ingest/dbt-cloud): update metadata_endpoint inference (datahub-project#11041)

* build: Reduce size of datahub-frontend-react image by 50-ish% (datahub-project#10878)

Co-authored-by: david-leifker <114954101+david-leifker@users.noreply.github.com>

* fix(ci): Fix lint issue in datahub_ingestion_run_summary_provider.py (datahub-project#11063)

* docs(ingest): update developing-a-transformer.md (datahub-project#11019)

* feat(search-test): update search tests from datahub-project#10408 (datahub-project#11056)

* feat(cli): add aspects parameter to DataHubGraph.get_entity_semityped (datahub-project#11009)

Co-authored-by: Harshal Sheth <hsheth2@gmail.com>

* docs(airflow): update min version for plugin v2 (datahub-project#11065)

* doc(ingestion/tableau): doc update for derived permission (datahub-project#11054)

Co-authored-by: Pedro Silva <pedro.cls93@gmail.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: Harshal Sheth <hsheth2@gmail.com>

* fix(py): remove dep on types-pkg_resources (datahub-project#11076)

* feat(ingest/mode): add option to exclude restricted (datahub-project#11081)

* fix(ingest): set lastObserved in sdk when unset (datahub-project#11071)

* doc(ingest): Update capabilities (datahub-project#11072)

* chore(vulnerability): Log Injection (datahub-project#11090)

* chore(vulnerability): Information exposure through a stack trace (datahub-project#11091)

* chore(vulnerability): Comparison of narrow type with wide type in loop condition (datahub-project#11089)

* chore(vulnerability): Insertion of sensitive information into log files (datahub-project#11088)

* chore(vulnerability): Risky Cryptographic Algorithm (datahub-project#11059)

* chore(vulnerability): Overly permissive regex range (datahub-project#11061)

Co-authored-by: Harshal Sheth <hsheth2@gmail.com>

* fix: update customer data (datahub-project#11075)

* fix(models): fixing the datasetPartition models (datahub-project#11085)

Co-authored-by: John Joyce <john@ip-192-168-1-200.us-west-2.compute.internal>

* fix(ui): Adding view, forms GraphQL query, remove showing a fallback error message on unhandled GraphQL error (datahub-project#11084)

Co-authored-by: John Joyce <john@ip-192-168-1-200.us-west-2.compute.internal>

* feat(docs-site): hiding learn more from cloud page (datahub-project#11097)

* fix(docs): Add correct usage of orFilters in search API docs (datahub-project#11082)

Co-authored-by: Jay <159848059+jayacryl@users.noreply.github.com>

* fix(ingest/mode): Regexp in mode name matcher didn't allow underscore (datahub-project#11098)

* docs: Refactor customer stories section (datahub-project#10869)

Co-authored-by: Jeff Merrick <jeff@wireform.io>

* fix(release): fix full/slim suffix on tag (datahub-project#11087)

* feat(config): support alternate hashing algorithm for doc id (datahub-project#10423)

Co-authored-by: david-leifker <114954101+david-leifker@users.noreply.github.com>
Co-authored-by: John Joyce <john@acryl.io>

* fix(emitter): fix typo in get method of java kafka emitter (datahub-project#11007)

* fix(ingest): use correct native data type in all SQLAlchemy sources by compiling data type using dialect (datahub-project#10898)

Co-authored-by: Harshal Sheth <hsheth2@gmail.com>

* chore: Update contributors list in PR labeler (datahub-project#11105)

* feat(ingest): tweak stale entity removal messaging (datahub-project#11064)

* fix(ingestion): enforce lastObserved timestamps in SystemMetadata (datahub-project#11104)

* fix(ingest/powerbi): fix broken lineage between chart and dataset (datahub-project#11080)

* feat(ingest/lookml): CLL support for sql set in sql_table_name attribute of lookml view (datahub-project#11069)

* docs: update graphql docs on forms & structured properties (datahub-project#11100)

* test(search): search openAPI v3 test (datahub-project#11049)

* fix(ingest/tableau): prevent empty site content urls (datahub-project#11057)

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* feat(entity-client): implement client batch interface (datahub-project#11106)

* fix(snowflake): avoid reporting warnings/info for sys tables (datahub-project#11114)

* fix(ingest): downgrade column type mapping warning to info (datahub-project#11115)

* feat(api): add AuditStamp to the V3 API entity/aspect response (datahub-project#11118)

* fix(ingest/redshift): replace r'\n' with '\n' to avoid token error redshift serverless… (datahub-project#11111)

* fix(entiy-client): handle null entityUrn case for restli (datahub-project#11122)

* fix(sql-parser): prevent bad urns from alter table lineage (datahub-project#11092)

* fix(ingest/bigquery): use small batch size if use_tables_list_query_v2 is set (datahub-project#11121)

* fix(graphql): add missing entities to EntityTypeMapper and EntityTypeUrnMapper (datahub-project#10366)

* feat(ui): Changes to allow editable dataset name (datahub-project#10608)

Co-authored-by: Jay Kadambi <jayasimhan_venkatadri@optum.com>

* fix: remove saxo (datahub-project#11127)

* feat(mcl-processor): Update mcl processor hooks (datahub-project#11134)

* fix(openapi): fix openapi v2 endpoints & v3 documentation update

* Revert "fix(openapi): fix openapi v2 endpoints & v3 documentation update"

This reverts commit 573c1cb.

* docs(policies): updates to policies documentation (datahub-project#11073)

* fix(openapi): fix openapi v2 and v3 docs update (datahub-project#11139)

* feat(auth): grant type and acr values custom oidc parameters support (datahub-project#11116)

* fix(mutator): mutator hook fixes (datahub-project#11140)

* feat(search): support sorting on multiple fields (datahub-project#10775)

* feat(ingest): various logging improvements (datahub-project#11126)

* fix(ingestion/lookml): fix for sql parsing error (datahub-project#11079)

Co-authored-by: Harshal Sheth <hsheth2@gmail.com>

* feat(docs-site) cloud page spacing and content polishes (datahub-project#11141)

* feat(ui) Enable editing structured props on fields (datahub-project#11042)

* feat(tests): add md5 and last computed to testResult model (datahub-project#11117)

* test(openapi): openapi regression smoke tests (datahub-project#11143)

* fix(airflow): fix tox tests + update docs (datahub-project#11125)

* docs: add chime to adoption stories (datahub-project#11142)

* fix(ingest/databricks): Updating code to work with Databricks sdk 0.30 (datahub-project#11158)

* fix(kafka-setup): add missing script to image (datahub-project#11190)

* fix(config): fix hash algo config (datahub-project#11191)

* test(smoke-test): updates to smoke-tests (datahub-project#11152)

* fix(elasticsearch): refactor idHashAlgo setting (datahub-project#11193)

* chore(kafka): kafka version bump (datahub-project#11211)

* readd UsageStatsWorkUnit

* fix merge problems

* change logo

---------

Co-authored-by: Chris Collins <chriscollins3456@gmail.com>
Co-authored-by: John Joyce <john@acryl.io>
Co-authored-by: John Joyce <john@Johns-MBP.lan>
Co-authored-by: John Joyce <john@ip-192-168-1-200.us-west-2.compute.internal>
Co-authored-by: dushayntAW <158567391+dushayntAW@users.noreply.github.com>
Co-authored-by: sagar-salvi-apptware <159135491+sagar-salvi-apptware@users.noreply.github.com>
Co-authored-by: Aseem Bansal <asmbansal2@gmail.com>
Co-authored-by: Kevin Chun <kevin1chun@gmail.com>
Co-authored-by: jordanjeremy <72943478+jordanjeremy@users.noreply.github.com>
Co-authored-by: skrydal <piotr.skrydalewicz@gmail.com>
Co-authored-by: Harshal Sheth <hsheth2@gmail.com>
Co-authored-by: david-leifker <114954101+david-leifker@users.noreply.github.com>
Co-authored-by: sid-acryl <155424659+sid-acryl@users.noreply.github.com>
Co-authored-by: Julien Jehannet <80408664+aviv-julienjehannet@users.noreply.github.com>
Co-authored-by: Hendrik Richert <github@richert.li>
Co-authored-by: Hendrik Richert <hendrik.richert@swisscom.com>
Co-authored-by: RyanHolstien <RyanHolstien@users.noreply.github.com>
Co-authored-by: Felix Lüdin <13187726+Masterchen09@users.noreply.github.com>
Co-authored-by: Pirry <158024088+chardaway@users.noreply.github.com>
Co-authored-by: Hyejin Yoon <0327jane@gmail.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: cburroughs <chris.burroughs@gmail.com>
Co-authored-by: ksrinath <ksrinath@users.noreply.github.com>
Co-authored-by: Mayuri Nehate <33225191+mayurinehate@users.noreply.github.com>
Co-authored-by: Kunal-kankriya <127090035+Kunal-kankriya@users.noreply.github.com>
Co-authored-by: Shirshanka Das <shirshanka@apache.org>
Co-authored-by: ipolding-cais <155455744+ipolding-cais@users.noreply.github.com>
Co-authored-by: Tamas Nemeth <treff7es@gmail.com>
Co-authored-by: Shubham Jagtap <132359390+shubhamjagtap639@users.noreply.github.com>
Co-authored-by: haeniya <yanik.haeni@gmail.com>
Co-authored-by: Yanik Häni <Yanik.Haeni1@swisscom.com>
Co-authored-by: Gabe Lyons <itsgabelyons@gmail.com>
Co-authored-by: Gabe Lyons <gabe.lyons@acryl.io>
Co-authored-by: 808OVADOZE <52988741+shtephlee@users.noreply.github.com>
Co-authored-by: noggi <anton.kuraev@acryl.io>
Co-authored-by: Nicholas Pena <npena@foursquare.com>
Co-authored-by: Jay <159848059+jayacryl@users.noreply.github.com>
Co-authored-by: ethan-cartwright <ethan.cartwright.m@gmail.com>
Co-authored-by: Ethan Cartwright <ethan.cartwright@acryl.io>
Co-authored-by: Nadav Gross <33874964+nadavgross@users.noreply.github.com>
Co-authored-by: Patrick Franco Braz <patrickfbraz@poli.ufrj.br>
Co-authored-by: pie1nthesky <39328908+pie1nthesky@users.noreply.github.com>
Co-authored-by: Joel Pinto Mata (KPN-DSH-DEX team) <130968841+joelmataKPN@users.noreply.github.com>
Co-authored-by: Ellie O'Neil <110510035+eboneil@users.noreply.github.com>
Co-authored-by: Ajoy Majumdar <ajoymajumdar@hotmail.com>
Co-authored-by: deepgarg-visa <149145061+deepgarg-visa@users.noreply.github.com>
Co-authored-by: Tristan Heisler <tristankheisler@gmail.com>
Co-authored-by: Andrew Sikowitz <andrew.sikowitz@acryl.io>
Co-authored-by: Davi Arnaut <davi.arnaut@acryl.io>
Co-authored-by: Pedro Silva <pedro@acryl.io>
Co-authored-by: amit-apptware <132869468+amit-apptware@users.noreply.github.com>
Co-authored-by: Sam Black <sam.black@acryl.io>
Co-authored-by: Raj Tekal <varadaraj_tekal@optum.com>
Co-authored-by: Steffen Grohsschmiedt <gitbhub@steffeng.eu>
Co-authored-by: jaegwon.seo <162448493+wornjs@users.noreply.github.com>
Co-authored-by: Renan F. Lima <51028757+lima-renan@users.noreply.github.com>
Co-authored-by: Matt Exchange <xkollar@users.noreply.github.com>
Co-authored-by: Jonny Dixon <45681293+acrylJonny@users.noreply.github.com>
Co-authored-by: Pedro Silva <pedro.cls93@gmail.com>
Co-authored-by: Pinaki Bhattacharjee <pinakipb2@gmail.com>
Co-authored-by: Jeff Merrick <jeff@wireform.io>
Co-authored-by: skrydal <piotr.skrydalewicz@acryl.io>
Co-authored-by: AndreasHegerNuritas <163423418+AndreasHegerNuritas@users.noreply.github.com>
Co-authored-by: jayasimhankv <145704974+jayasimhankv@users.noreply.github.com>
Co-authored-by: Jay Kadambi <jayasimhan_venkatadri@optum.com>
Co-authored-by: David Leifker <david.leifker@acryl.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants