Refactor view sequencing and return sequenced views if recursion is found #2499

JCZuurmond · 2024-08-28T11:30:08Z

Changes

Refactor view sequencing and return sequenced views if recursion is found

Linked issues

Resolves #2494

Functionality

modified existing workflow: table-migration

Tests

updated unit tests

JCZuurmond

Did the refactor to improve code readability and to look for the possible origin of the problem

github-actions · 2024-08-28T12:04:29Z

✅ 78/78 passed, 3 skipped, 1h40m5s total

_{Running from acceptance #5374}

JCZuurmond · 2024-08-28T13:06:14Z

CI failure is unrelated. Fixed in #2500

ericvergnaud · 2024-08-28T16:52:29Z

src/databricks/labs/ucx/hive_metastore/view_migrate.py

+             2. The (growing) set of views from already sequenced previous batches
+           For each remaining view, we check if all its dependencies are covered for. If that is the case, then we
+           add that view to the new batch of views.
+        1. We repeat point from point 1. until all views are sequenced.


Should be 2?

nfx

lgtm

nfx · 2024-08-29T21:04:48Z

src/databricks/labs/ucx/hive_metastore/view_migrate.py

+            if all(self._index.is_migrated(table_view.schema, table_view.name) for table_view in not_processed_yet):
+                result.append(view)
+        if len(result) == 0 and len(views) > 0:  # prevent infinite loop
+            raise RecursionError(f"Unresolved dependencies prevent batch sequencing: {views}")


ideally we'd not rely on exceptions for control flow, but let's continue with this one

* Added a check for No isolation shared clusters and MLR ([#2484](#2484)). This change introduces a check for `No Isolation shared clusters` using Machine Learning Runtime (MLR) in the assessment workflow and cluster crawler, resolving issue [#846](#846). A new function has been added to verify if a Spark version is an MLR and another function checks for No Isolation shared clusters with MLR. The `_check_cluster_failures` method now includes an additional condition to verify if the data security mode of the cluster is set to NONE and if the cluster has MLR enabled. New unit tests have been added to ensure the proper functioning of the new feature. The Assessment workflow and cluster crawler have been modified to include the new checks. No new documentation, CLI command, or tables have been added as part of this change. * Added a section in migration dashboard to list the failed tables, etc ([#2406](#2406)). In this enhancement, we have added a `failed-to-migrate` warning message to notify users when specific table migration and ACL migration operations in `table_migrate.py` fail. This change is part of the resolution of issue [#1754](#1754) and includes modifications to existing workflows. We have also introduced a new SQL file, `05_1_failed_table_migration.sql`, located in `src/databricks/labs/ucx/queries/migration/main/`, which lists the failed tables during a migration process. The SQL file contains a query that retrieves messages from the `inventory.logs` table indicating a failed migration and displays the relevant message. The changes have been manually tested on a staging environment, ensuring reliable and consistent performance. No new commands, tables, or user documentation have been added as part of this update. * Added clean up activities when `migrate-credentials` cmd fails intermittently ([#2479](#2479)). This pull request includes enhancements to the `databricks/labs/ucx` module, specifically for Azure management functionalities. The primary changes involve improving error handling during the creation of access connectors and storage credentials for storage accounts. Two new methods, `delete_storage_credentials` and `delete_access_connectors`, have been introduced to handle deletion of resources when errors occur. Additionally, the `migrate-credentials` command has been improved to ensure that any created but unvalidated access connectors and storage credentials are deleted when intermittent failures happen. A new `delete_access_connector` method has been added to the `resources.py` file to improve cleanup activities. Moreover, the test suite has been updated with new test cases and fixtures to enhance Azure-specific migration functionality test coverage. These changes aim to increase the overall robustness and consistency of the system when handling failures and to avoid leaving orphaned resources in the Azure environment. * Added standalone migrate ACLs ([#2284](#2284)). In this commit, the team has added a new `migrate-acls` command to the `labs.yml` file, which allows for the migration of Access Control Lists (ACLs) from a legacy metastore to a Unity Catalog (UC) metastore. The command includes optional flags for specifying the target catalog and handling HMS-FED ACLs. Additionally, the `migrate-dbsql-dashboards` command has been updated to include a new flag for specifying the target workspace ID. A new `ACLMigrator` class has been introduced to manage the migration of ACLs for tables and databases, and a new `migrate_acls` command has been added to the `cli.py` file for migrating ACLs in table migration scenarios involving HMS federation. The commit also includes a new test file, `test_migrate_acls.py`, with unit tests for the `migrate_acls` function in the `hive_metastore` package, as well as a new test function, `test_migrate_acls_should_produce_proper_queries`, for testing the behavior of the `migrate_acls` function and ensuring that it produces the correct SQL queries. * Appends metastore_id or location_name to roles for uniqueness ([#2471](#2471)). In this release, we have introduced a new method `_generate_role_name` in the `access.py` module of the AWS package to generate a unique role name for the `create-missing-principals` functionality. This addresses the issue [#233](#233) * Cache workspace content ([#2497](#2497)). This commit introduces a caching mechanism for workspace content to improve load times and bypass rate limits, implemented through the new `WorkspaceCache` class which stores cached instances of various objects. The `_CachedPath` class, a subclass of `WorkspacePath`, is used to cache the content of the workspace path using an LRU cache. New classes and methods, such as `_CachedIO`, `_PathLruCache`, and `WorkspaceCache.get_path()`, have been added for handling caching of input/output operations. The `TaskRunner` class has been updated to use `WorkspaceCache` to retrieve workspace paths. Unit tests have been added to verify the functionality of the new caching mechanism. This change is expected to significantly improve the user experience by reducing the time taken to load workspace content. * Changes the security mode for assessment cluster ([#2472](#2472)). In this update, we have enhanced the security of the `main` cluster for assessment jobs in the open-source library. We have modified the `_job_clusters` function in the `workflows.py` file, changing the `data_security_mode` parameter in the `compute.ClusterSpec` constructor from LEGACY_SINGLE_USER to LEGACY_SINGLE_USER_STANDARD using the `databricks labs ucx` command line interface. This change resolves issue [#1717](#1717) and disables passthrough in the LEGACY_SINGLE_USER_STANDARD mode. The modification has been manually tested to ensure correct implementation. This improvement is backward-compatible, as it modifies existing functionality without introducing new methods. Software engineers familiar with the project can conveniently adopt this change, which strengthens security settings for the assessment `main` cluster. * Do not normalize cases when reformatting SQL queries in CI check ([#2495](#2495)). In this release, we have made a modification to the `Reformat SQL queries` job in our continuous integration (CI) workflow to address case normalization issues that were causing blocks. The `databricks labs lsql fmt` command has been updated to include the `--normalize-case false` flag, which prevents case normalization during query reformatting. This change ensures that case-sensitive columns are not altered during the reformatting process, thus avoiding CI blockages. A specific example of this change can be seen in the SQL query used for cluster summary assessment in the interactive mode, where the case sensitivity of the `cluster_name` and `cluster_id` columns has been preserved. This modification enhances the adoption experience for software engineers working with the project by ensuring that SQL queries are formatted without altering case-sensitive elements, enabling a smoother CI check process. No new methods have been added, and existing functionality has only been changed to exclude case normalization. * Drop source table after successful table move not before ([#2430](#2430)). A fix has been implemented to address an issue where the source table was unintentionally dropped before a new table could be successfully created during a table move operation in the Hive metastore. This resulted in the process failing with a [DELTA_CREATE_TABLE_WITH_DIFFERENT_PROPERTY] error. To resolve this, the source table is now dropped after the new table creation, preserving the source table in case of any issues. This change includes an updated order of dropping and creating the table within the `_recreate_table` method, as well as an added integration test (test_move_tables_table_properties_mismatch_preserves_original) to simulate table move with mismatched table properties and verify if the original table is preserved. The test uses the TableMove class and creates catalogs, schemas, tables, and access groups to perform the test. The import section has been updated to include pytest and BadRequest from the sdk.errors module. * Enabled `principal-prefix-access` command to run as collection ([#2450](#2450)). In this release, we have introduced several enhancements to our open-source library aimed at improving functionality and maintainability for software engineers. The `principal-prefix-access` command can now be run as a collection, allowing for more flexible and efficient execution. We have also introduced a new `get_workspace_context` function that consolidates common functionalities, simplifying the codebase and improving maintainability. Additionally, we have added a `run-as-collection` flag to the `aws-subscription-scan` command, allowing users to specify whether to run the command as a collection or not. The `create-missing-principals` command for AWS has also been improved to identify all S3 locations missing a UC-compatible role more effectively. Overall, these changes enable the `principal-prefix-access` command to run as a collection, enhance the functionality of the codebase, and make it easier for users to manage their AWS subscriptions and identities. * Fixed Driver OOM error by increasing the min memory requirement for node from 16GB to 32 GB ([#2473](#2473)). In this update, the `policy.py` file in the `databricks/labs/ucx/installer` directory has been modified to increase the minimum memory requirement for the node type from 16 GB to 32 GB. This change is intended to prevent driver crashes during assessment runs by providing additional memory for the workflow job. The function `_definition` has been updated to incorporate the new minimum memory requirement in the `node_type_id` configuration. No new methods have been added, and the existing functionality remains unchanged beyond the updated memory requirement. * Fixed issue when running create-missing-credential cmd tries to create the role again if already created ([#2456](#2456)). A modification has been implemented in the `list_uc_roles` method within the `access.py` file of the `databricks/labs/ucx/aws` directory to address an issue with the `create-missing-credential` command. Previously, the command would try to recreate a role even if it had already been created due to a mismatch in the comparison of the `location` attribute of an `external_location` object with the `resource_path` attribute of a `role` object. This comparison has now been updated to use the `startswith` method instead of the `match` method. An early return has also been added if there are no missing paths to avoid unnecessary processing. These changes resolve issue [#2413](#2413) and have been tested through unit tests. However, there is no mention of integration tests or staging environment verifications, and more information about the testing performed would be helpful to ensure the changes are functioning as intended. No user documentation, CLI commands, workflows, or tables have been added, modified, or removed as part of this change. * Fixed issue with Interactive Dashboard not showing output ([#2476](#2476)). In this release, we have resolved an issue in the Interactive Dashboard feature where the output was not being displayed due to a case sensitivity bug in an SQL query. The query was incorrectly using `request_params.clusterid` instead of `request_params.clusterId` in the join, select, and having clauses, causing no output to be displayed. We have fixed this issue by changing the lowercase `d` to an uppercase `I` in the affected clauses. This change is limited to the SQL file used in the Interactive Dashboard feature and does not affect other parts of the system. The changes have been manually tested, but no unit or integration tests have been mentioned. Additionally, there is a modification to the test case for selecting node type in the `test_job_cluster_policy` function, which now selects a node type with min_memory_gb as 32 instead of 16. No new documentation, CLI commands, workflows, tables, or existing functionalities have been changed in this release. * Fixed support for table/schema scope for the revert table cli command ([#2428](#2428)). The recent change to the open-source library includes an update to the `revert table` CLI command to support table/schema scoping. This modification allows users to specify a schema and table for reverting migrations, with a default value of None. The `revert_migrated_tables` function and `print_revert_report` method have been updated to include schema and table as keyword arguments with a default value of None, enabling more precise control over the revert operation and preventing unintended actions. Additionally, a new dictionary, 'reverse_seen', has been implemented to store the mapping from original table keys to new keys after migration, improving support for reverting table migrations in specific schemas. The `print_revert_report` method now accepts optional `schema` and `table` parameters for filtering the report based on a specific schema and/or table, enhancing the overall user experience. * Make lint log references clickable ([#2474](#2474)). This commit introduces a significant enhancement to the lint log references, making them clickable for easier navigation and reference. By updating the format of the message relative to method, a leading "." has been added to the path to make it relative to the current directory, addressing issue [#2474](#2474) and also closing issue [#2408](#2408). The message relative to method has been modified to include the path, start line, start column, code, and message of the lint advice in the format "<path>:<line>:<column>: [<code>] <message>". The Advisory class remains unchanged. This improvement simplifies the process of navigating and referencing lint logs, ultimately providing a better user experience. * Refactor view sequencing and return sequenced views if recursion is found ([#2499](#2499)). In this release, the `_migrate_views` function in the `table_migrate.py` file has been refactored to improve view sequencing during table migrations, resolving issue [#2494](#2494). The `ViewsMigrationSequencer` object now takes a `migration_index` parameter and returns sequenced views if recursion is found during the migration process. The `table-migration` workflow has been modified to sequence views based on their dependencies, and a new `sql_migrate_view` method has been added to the `ViewToMigrate` class. Unit tests have been updated to reflect these changes and include a variety of scenarios, such as empty sequences, direct and indirect views, deep indirect views, and invalid view queries. Additionally, several fixtures have been added to simplify view migration testing. This refactoring enhances view migration functionality, making it more robust and flexible. * Updated databricks-labs-lsql requirement from <0.9,>=0.5 to >=0.5,<0.10 ([#2489](#2489)). In this pull request, we update the requirement on the `databricks-labs-lsql` package from a version greater than or equal to 0.5 and less than 0.9, to a version greater than or equal to 0.5 and less than 0.10. This update allows for the use of the latest version of the package and avoids any potential compatibility issues. The change includes updates to the requirements.txt file, and adds the `normalize-case` option to the `databricks labs lsql fmt` command, allowing users to control the normalization of query text to lowercase. The `deploy_dashboard` method has been removed and replaced with the `create` method of the `lakeview` attribute of the WorkspaceClient object. A new test function, 'test_dashboards_creates_dashboard_with_replace_database', has been added, which is currently marked to be skipped due to missing permissions to create a schema. Additionally, the project has been updated to use Databricks Python SDK version 0.30.0, with changes to the `execute` and `fetch_value` functions to use the new `StatementResponse` type instead of 'ExecuteStatementResponse'. Please refer to the release notes and changelog for the `databricks-labs-lsql` package version 0.9.0 for more information on the changes. * Updated databricks-sdk requirement from ~=0.29.0 to >=0.29,<0.31 ([#2417](#2417)). In this update, the requirement for the `databricks-sdk` package has been changed from '~=0.29.0' to '>=0.29,<0.31', allowing for the use of the latest version of the package while ensuring compatibility. The new version includes features such as DataPlane support, partner support in the SDK, and various bug fixes and improvements. The update also includes changes to the 'redash.py' file, modifying import statements and updating the `Query` class to use the new `LegacyQuery` class. Additionally, there have been changes to the unit tests in the 'test_access.py' file to ensure compatibility with the updated package. The commit includes release notes, a changelog, and a list of commits for the updated version. * Updated sqlglot requirement from <25.12,>=25.5.0 to >=25.5.0,<25.13 ([#2431](#2431)). In this pull request, we have updated the version range of the sqlglot dependency in the pyproject.toml file from >=25.5.0,<25.12 to >=25.5.0,<25.13. This change allows us to use the latest version of the sqlglot library, which provides parsing and analysis functionality for SQL code, while also specifying a maximum version to avoid any potential breaking changes that may be introduced in future releases. By keeping our dependencies up-to-date, we can ensure that our project is making use of the latest features and bug fixes, and is compatible with the most recent versions of other libraries and tools. * Updated sqlglot requirement from <25.13,>=25.5.0 to >=25.5.0,<25.15 ([#2453](#2453)). In this pull request, we have updated the version range constraint for the sqlglot requirement in the pyproject.toml file, from v25.5.0 to v25.14.9999, to allow for the latest version of sqlglot (v25.14.0). This update includes several bug fixes and new features related to ClickHouse support and optimizer functionality. However, it also includes a couple of breaking changes related to schema and database substitution, and nullable comparison in is_type. Therefore, it is crucial to thoroughly test your codebase to ensure compatibility with the new version of sqlglot. The previous constraint allowed for versions between 25.5.0 and 25.12.9999, but we have updated it to 25.14.9999 to accommodate the latest version of sqlglot and its new features. * Updated sqlglot requirement from <25.15,>=25.5.0 to >=25.5.0,<25.17 ([#2480](#2480)). In this update, we are upgrading the `sqlglot` dependency in our `pyproject.toml` file from version `>=25.5.0,<25.15` to `>=25.5.0,<25.17`. This change resolves issues [#2452](#2452) and [#2451](#2451), which were caused by bugs in the previous version of `sqlglot`. `sqlglot` is a library used for parsing, analyzing, and rewriting SQL queries. The new version of `sqlglot` includes bug fixes, new features, and some breaking changes. The specific details of these changes can be found in the commit history. Once this pull request is merged and the new version of `sqlglot` is installed, the aforementioned issues should be resolved. * Updated sqlglot requirement from <25.17,>=25.5.0 to >=25.5.0,<25.18 ([#2488](#2488)). In this update, we have modified the requirements for the sqlglot library to allow the most recent version. The previous requirement of '<25.17,>=25.5.0' has been changed to '>=25.5.0,<25.18'. This change permits the adoption of the latest improvements and bug fixes made to the sqlglot library. The commit message includes a reference to the sqlglot changelog and a list of commits since the last permitted version. As a software engineer implementing this update, you can be assured of the compatibility of the project with the newest version of sqlglot, including all the added enhancements and bug fixes. * Updated sqlglot requirement from <25.18,>=25.5.0 to >=25.5.0,<25.19 ([#2509](#2509)). In the latest update, the sqlglot dependency has been upgraded to version 25.18.1, introducing several new features and addressing a variety of issues. The new version includes support for the IS JSON predicate in PostgreSQL, the GLOB table function in DuckDB, and the table statement in INSERT for Spark. Several bug fixes are also included, such as improvements to the SQLite IS parser, proper handling of LTRIM/RTRIM usage in Oracle, and fixes for DIV0 case handling in Snowflake. Additionally, there are changes to the default naming of STRUCT fields in Spark and a fix in the binding of TABLESAMPLE to exp.Subquery instead of the top-level exp.Select. This release aims to improve the overall functionality and reliability of the library for software engineers working with various SQL databases. * [chore] make `GRANT` migration logic isolated to `MigrateGrants` component ([#2492](#2492)). In this release, the `MigrateGrants` component has been introduced to handle the migration logic related to grants, improving code organization and maintainability. This component is responsible for applying grants to a Unity Catalog (UC) table based on a given source table, using instances of `SqlBackend`, `GroupManager`, and a list of grant loaders. The `ACLMigrator` class has also been introduced, which takes instances of `TablesCrawler`, `WorkspaceInfo`, `MigrationStatusRefresher`, and `MigrateGrants`, and applies ACLs to the migrated tables in the given catalog. The `Mapping` class has been updated to include return type annotations of `str` for the `as_uc_table_key` and `as_hms_table_key` methods. Additionally, the `migrate_tables` method in the `TableMigration` class has been modified to remove the `acl_strategy` parameter in several methods and instead, interacts with the new `MigrateGrants` component, reducing code duplication and simplifying the implementation of the `migrate_tables` method. The `PrincipalACL` class has been removed, and the `MigrateGrants` class has been introduced, which handles the remapping of group names and the migration of grants. Dependency updates: * Updated databricks-sdk requirement from ~=0.29.0 to >=0.29,<0.31 ([#2417](#2417)). * Updated sqlglot requirement from <25.12,>=25.5.0 to >=25.5.0,<25.13 ([#2431](#2431)). * Updated sqlglot requirement from <25.13,>=25.5.0 to >=25.5.0,<25.15 ([#2453](#2453)). * Updated sqlglot requirement from <25.15,>=25.5.0 to >=25.5.0,<25.17 ([#2480](#2480)). * Updated databricks-labs-lsql requirement from <0.9,>=0.5 to >=0.5,<0.10 ([#2489](#2489)). * Updated sqlglot requirement from <25.17,>=25.5.0 to >=25.5.0,<25.18 ([#2488](#2488)). * Updated sqlglot requirement from <25.18,>=25.5.0 to >=25.5.0,<25.19 ([#2509](#2509)).

* Added a check for No isolation shared clusters and MLR ([#2484](#2484)). This commit introduces a check for `No isolation shared clusters` utilizing MLR as part of the assessment workflow and cluster crawler, addressing issue [#846](#846). A new function, `is_mlr`, has been implemented to determine if the Spark version corresponds to an MLR cluster. If the cluster has no isolation and uses MLR, the assessment failure list is appended with an appropriate error message. Thorough testing, including unit tests and manual verification, has been conducted. However, user documentation and new CLI commands, workflows, tables, or unit/integration tests have not been added. Additionally, a new test has been added to verify the behavior of MLR clusters without isolation, enhancing the assessment workflow's accuracy in identifying unsupported configurations. * Added a section in migration dashboard to list the failed tables, etc ([#2406](#2406)). In this release, we have introduced a new logging message format for failed table migrations in the `TableMigrate` class, specifically impacting the `_migrate_external_table`, `_migrate_external_table_hiveserde_in_place`, `_migrate_dbfs_root_table`, `_migrate_table_create_ctas`, `_migrate_table_in_mount`, and `_migrate_acl` methods within the `table_migrate.py` file. This update employs the `failed-to-migrate` prefix in log messages for improved failure reason identification during table migrations, enhancing debugging capabilities. As part of this release, we have also developed a new SQL file, `05_1_failed_table_migration.sql`, which retrieves a list of failed table migrations by extracting messages with the 'failed-to-migrate:' prefix from the inventory.logs table and returning the corresponding message text. While this release does not include new methods or user documentation, it resolves issue [#1754](#1754) and has been manually tested with positive results in the staging environment, demonstrating its functionality. * Added clean up activities when `migrate-credentials` cmd fails intermittently ([#2479](#2479)). This pull request enhances the robustness of the `migrate-credentials` command for Azure in the event of intermittent failures during the creation of access connectors and storage credentials. It introduces new methods, `delete_storage_credential` and `delete_access_connectors`, which are responsible for removing incomplete resources when errors occur. The `_migrate_service_principals` and `_create_storage_credentials_for_storage_accounts` methods now handle `PermissionDenied`, `NotFound`, and `BadRequest` exceptions, deleting created storage credentials and access connectors if exceptions occur. Additionally, error messages have been updated to guide users in resolving issues before attempting the operation again. The PR also modifies the `sp_migration` fixture in the `tests/unit/azure/test_credentials.py` file, simplifying the deletion process for access connectors and improving the testing of the `ServicePrincipalMigration` class. These changes address issue [#2362](#2362), ensuring clean-up activities in case of intermittent failures and improving the overall reliability of the system. * Added standalone migrate ACLs ([#2284](#2284)). A new `migrate-acls` command has been introduced to facilitate the migration of Access Control Lists (ACLs) from a legacy metastore to a Unity Catalog (UC) metastore. The command, designed to work with HMS federation and other table migration scenarios, can be executed with optional flags `target-catalog` and `hms-fed` to specify the target catalog and migrate HMS-FED ACLs, respectively. The release also includes modifications to the `labs.yml` file, adding the new command and its details to the `commands` section. In addition, a new `ACLMigrator` class has been added to the `databricks.labs.ucx.contexts.application` module to handle ACL migration for tables in a standalone manner. A new test file, `test_migrate_acls.py`, contains unit tests for ACL migration in a Hive metastore, covering various scenarios and ensuring proper query generation. These features streamline and improve the functionality of ACL migration, offering better access control management for users. * Appends metastore_id or location_name to roles for uniqueness ([#2471](#2471)). A new method, `_generate_role_name`, has been added to the `Access` class in the `aws/access.py` file of the `databricks/labs/ucx` module to generate unique names for AWS roles using a consistent naming convention. The `list_uc_roles` method has been updated to utilize this new method for creating role names. In response to issue [#2336](#2336), the `create_missing_principals` change enforces role uniqueness on AWS by modifying the `ExternalLocation` table to include `metastore_id` or `location_name` for uniqueness. To ensure proper cleanup, the `create_uber_principal` method has been updated to delete the instance profile if creating the cluster policy fails due to a `PermissionError`. Unit tests have been added to verify these changes, including tests for the new role name generation method and the updated `ExternalLocation` table. The `MetastoreAssignment` class is also imported in this diff, although its usage is not immediately clear. These changes aim to improve the creation of unique AWS roles for Databricks Labs UCX and enforce role uniqueness on AWS. * Cache workspace content ([#2497](#2497)). In this release, we have implemented a caching mechanism for workspace content to improve load times and bypass rate limits. The `WorkspaceCache` class handles caching of workspace content, with the `_CachedIO` and `_PathLruCache` classes managing IO operation caching and LRU caching, respectively. The `_CachedPath` class, a subclass of `WorkspacePath`, handles caching of workspace paths. The `open` and `unlink` methods of `_CachedPath` have been overridden to cache results and remove corresponding cache entries. The `guess_encoding` function is used to determine the encoding of downloaded content. Unit tests have been added to ensure the proper functioning of the caching mechanism, including tests for cache reuse, invalidation, and encoding determination. This feature aims to enhance the performance of file operations, making the overall system more efficient for users. * Changes the security mode for assessment cluster ([#2472](#2472)). In this release, the security mode of the `main` cluster assessment has been updated from LEGACY_SINGLE_USER to LEGACY_SINGLE_USER_STANDARD in the workflows.py file. This change disables passthrough and addresses issue [#1717](#1717). The new data security mode is defined in the compute.ClusterSpec object for the `main` job cluster by modifying the data_security_mode attribute. While no new methods have been introduced, existing functionality related to the cluster's security mode has been modified. Software engineers adopting this project should be aware of the security implications of this change, ensuring the appropriate data protection measures are in place. Manual testing has been conducted to verify the functionality of this update. * Do not normalize cases when reformatting SQL queries in CI check ([#2495](#2495)). In this release, the CI workflow for pushing changes to the repository has been updated to improve the behavior of the SQL query reformatting step. Previously, case normalization of SQL queries was causing issues with case-sensitive columns, resulting in blocked CI checks. This release addresses the issue by adding the `--normalize-case false` flag to the `databricks labs lsql fmt` command, which disables case normalization. This modification allows the CI workflow to pass and ensures correct SQL query formatting, regardless of case sensitivity. The change impacts the assessment/interactive directory, specifically a cluster summary query for interactive assessments. This query involves a change in the ORDER BY clause, replacing a normalized case with the original case. Despite these changes, no new methods have been added, and existing functionality has been modified solely to improve CI efficiency and SQL query compatibility. * Drop source table after successful table move not before ([#2430](#2430)). In this release, we have addressed an issue where the source table was being dropped before a new table was created, which could cause the creation process to fail and leave the source table unavailable. This problem has been resolved by modifying the `_recreate_table` method of the `TableMove` class in the `hive_metastore` package to drop the source table after the new table creation. The updated implementation ensures that the source table remains intact during the creation process, even in case of any issues. This change comes with integration tests and does not involve any modifications to user documentation, CLI commands, workflows, tables, or existing functionality. Additionally, a new test function `test_move_tables_table_properties_mismatch_preserves_original` has been added to `test_table_move.py`, which checks if the original table is preserved when there is a mismatch in table properties during the move operation. The changes also include adding the `pytest` library and the `BadRequest` exception from the `databricks.sdk.errors` package for the new test function. The imports section has been updated accordingly with the removal of `databricks.sdk.errors.NotFound` and the addition of `pytest` and `databricks.sdk.errors.BadRequest`. * Enabled `principal-prefix-access` command to run as collection ([#2450](#2450)). This commit introduces several improvements to the `principal-prefix-access` command in our open-source library. A new flag `run-as-collection` has been added, allowing the command to run as a collection across multiple AWS accounts. A new `get_workspace_context` function has also been implemented, which encapsulates common functionalities and enhances code reusability. Additionally, the `get_workspace_contexts` method has been developed to retrieve a list of `WorkspaceContext` objects, making the command more efficient when handling collections of workspaces. Furthermore, the `install_on_account` method has been updated to use the new `get_workspace_contexts` method. The `principal-prefix-access` command has been enhanced to accept an optional `acc_client` argument, which is used to retrieve information about the assessment run. These changes improve the functionality and organization of the codebase, making it more efficient, flexible, and easier to maintain for users working with multiple AWS accounts and workspaces. * Fixed Driver OOM error by increasing the min memory requirement for node from 16GB to 32 GB ([#2473](#2473)). A modification has been implemented in the `policy.py` file located in the `databricks/labs/ucx/installer` directory, which enhances the minimum memory requirement for the node type from 16GB to 32GB. This adjustment is intended to prevent driver out-of-memory (OOM) errors during assessments. The `_definition` function in the `policy` class has been updated to incorporate the new memory requirement, which will be employed for selecting a suitable node type. The rest of the code remains unchanged. This modification addresses issue [#2398](#2398). While the code has been tested, specific testing details are not provided in the commit message. * Fixed issue when running create-missing-credential cmd tries to create the role again if already created ([#2456](#2456)). In this release, we have implemented a fix to address an issue in the `_identify_missing_paths` function within the `access.py` file of the `databricks/labs/ucx/aws` directory, where the `create-missing-credential` command was attempting to create a role again even if it had already been created. This issue was due to a mismatch in path comparison using the `match` function, which has now been updated to use the `startswith` function instead. This change ensures that the code checks if the path starts with the resource path, thereby resolving issue [#2413](#2413). The `_identify_missing_paths` function identifies missing paths by loading UC compatible roles and iterating through each external location. If a location matches any of the resource paths of the UC compatible roles, the `matching_role` variable is set to True, and the code continues to the next role. If the location does not match any of the resource paths, the `matching_role` variable is set to False. If a match is found, the code continues to the next external location. If no match is found for any of the UC compatible roles, then the location is added to the `missing_paths` set. The diff also includes a conditional check to return an empty list if the `missing_paths` set is empty. Additionally, tests have been added or modified to ensure the proper functioning of the updated code, including unit tests and integration tests. However, there is no mention of manual testing or verification on a staging environment. Overall, this update fixes a specific issue with the `create-missing-credential` command and includes updated tests to ensure proper functionality. * Fixed issue with Interactive Dashboard not showing output ([#2476](#2476)). In this release, we have resolved an issue with the Interactive Dashboard not displaying output by fixing a bug in the query used for the dashboard. Previously, the query was joining on "request_params.clusterid" and selecting "request_params.clusterid" in the SELECT clause, but the correct field name is "request_params.clusterId". The query has been updated to use "request_params.clusterId" instead, both in the JOIN and SELECT clauses. These changes ensure that the Interactive Dashboard displays the correct output, improving the overall functionality and usability of the product. No new methods were added, and existing functionality was changed within the scope of the Interactive Dashboard query. Manual testing is recommended to ensure that the output is now displayed correctly. Additionally, a change has been made to the 'test_installation.py' integration test file to improve the performance of clusters by updating the `min_memory_gb` argument from 16 GB to 32 GB in the `test_job_cluster_policy` function. * Fixed support for table/schema scope for the revert table cli command ([#2428](#2428)). In this release, we have enhanced the `revert table` CLI command to support table and schema scopes in the open-source library. The `revert_migrated_tables` function now accepts optional parameters `schema` and `table` of types str or None, which were previously required parameters. Similarly, the `print_revert_report` function in the `tables_migrator` object within `WorkspaceContext` has been updated to accept the same optional parameters. The `revert_migrated_tables` function now uses these optional parameters when calling the `revert_migrated_tables` method of `tables_migrator` within 'ctx'. Additionally, we have introduced a new dictionary called `reverse_seen` and modified the `_get_tables_to_revert` and `print_revert_report` functions to utilize this dictionary, providing more fine-grained control when reverting table migrations. The `delete_managed` parameter is used to determine if managed tables should be deleted. These changes allow users to specify a specific schema and table to revert, rather than reverting all migrated tables within a workspace. * Refactor view sequencing and return sequenced views if recursion is found ([#2499](#2499)). In this refactored code, the view sequencing for table migration has been improved and now returns sequenced views if recursion is found, addressing issue [#249](#249) * Updated databricks-labs-lsql requirement from <0.9,>=0.5 to >=0.5,<0.10 ([#2489](#2489)). In this release, we have updated the version requirements for the `databricks-labs-lsql` package, changing it from greater than 0.5 and less than 0.9 to greater than 0.5 and less than 0.10. This update enables the use of newer versions of the package while maintaining compatibility with existing systems. The `databricks-labs-lsql` package is used for creating dashboards and managing SQL queries in Databricks. The pull request also includes detailed release notes, a comprehensive changelog, and a list of commits for the updated package. We recommend that all users of this package review the release notes and update to the new version to take advantage of the latest features and improvements. * Updated databricks-sdk requirement from ~=0.29.0 to >=0.29,<0.31 ([#2417](#2417)). In this pull request, the `databricks-sdk` dependency has been updated from version `~=0.29.0` to `>=0.29,<0.31` to allow for the latest version of the package, which includes new features, bug fixes, internal changes, and other updates. This update is in response to the release of version `0.30.0` of the `databricks-sdk` library, which includes new features such as DataPlane support and partner support. In addition to the updated dependency, there have been changes to several files, including `access.py`, `fixtures.py`, `test_access.py`, and `test_workflows.py`. These changes include updates to method calls, import statements, and test data to reflect the new version of the `databricks-sdk` library. The `pyproject.toml` file has also been updated to reflect the new dependency version. This pull request does not include any other changes. * Updated sqlglot requirement from <25.12,>=25.5.0 to >=25.5.0,<25.13 ([#2431](#2431)). In this pull request, we are updating the `sqlglot` dependency from version `>=25.5.0,<25.12` to `>=25.5.0,<25.13`. This update allows us to use the latest version of the `sqlglot` library, which includes several new features and bug fixes. Specifically, the new version includes support for `TryCast` generation and improvements to the `clickhouse` dialect. It is important to note that the previous version had a breaking change related to treating `DATABASE` as `SCHEMA` in `exp.Create`. Therefore, it is crucial to thoroughly test the changes before merging, as breaking changes may affect existing functionality. * Updated sqlglot requirement from <25.13,>=25.5.0 to >=25.5.0,<25.15 ([#2453](#2453)). In this pull request, we have updated the required version range of the `sqlglot` package from `>=25.5.0,<25.13` to `>=25.5.0,<25.15`. This change allows us to install the latest version of the package, which includes several bug fixes and new features. These include improved transpilation of nullable/non-nullable data types and support for TryCast generation in ClickHouse. The changelog for `sqlglot` provides a detailed list of changes in each release, and a list of commits made in the latest release is also included in the pull request. This update will improve the functionality and reliability of our software, as we will now be able to take advantage of the latest features and fixes provided by `sqlglot`. * Updated sqlglot requirement from <25.15,>=25.5.0 to >=25.5.0,<25.17 ([#2480](#2480)). In this release, we have updated the requirement range for the `sqlglot` dependency to '>=25.5.0,<25.17' from '<25.15,>=25.5.0'. This change resolves issues [#2452](#2452) and [#2451](#2451) and includes several bug fixes and new features in the `sqlglot` library version 25.16.1. The updated version includes support for timezone in exp.TimeStrToTime, transpiling from_iso8601_timestamp from presto/trino to duckdb, and mapping %e to %-d in BigQuery. Additionally, there are changes to the parser and optimizer, as well as other bug fixes and refactors. This update does not introduce any major breaking changes and should not affect the functionality of the project. The `sqlglot` library is used for parsing, analyzing, and rewriting SQL queries, and the new version range provides improved functionality and reliability. * Updated sqlglot requirement from <25.17,>=25.5.0 to >=25.5.0,<25.18 ([#2488](#2488)). this pull request updates the sqlglot library requirement to version 25.5.0 or greater, but less than 25.18. By doing so, it enables the use of the latest version of sqlglot, while still maintaining compatibility with the current implementation. The changelog and commits for each release from v25.17.0 to v25.16.1 are provided for reference, detailing bug fixes, new features, and breaking changes. As a software engineer, it's important to review this pull request and ensure it aligns with the project's requirements before merging, to take advantage of the latest improvements and fixes in sqlglot. * Updated sqlglot requirement from <25.18,>=25.5.0 to >=25.5.0,<25.19 ([#2509](#2509)). In this release, we have updated the required version of the `sqlglot` package in our project's dependencies. Previously, we required a version greater than or equal to 25.5.0 and less than 25.18, which has now been updated to require a version greater than or equal to 25.5.0 and less than 25.19. This change was made automatically by Dependabot, a service that helps to keep dependencies up to date, in order to permit the latest version of the `sqlglot` package. The pull request contains a detailed list of the changes made in the `sqlglot` package between versions 25.5.0 and 25.18.0, as well as a list of the commits that were made during this time. These details can be helpful for understanding the potential impact of the update on the project. * [chore] make `GRANT` migration logic isolated to `MigrateGrants` component ([#2492](#2492)). In this release, the grant migration logic has been isolated to a separate `MigrateGrants` component, enhancing code modularity and maintainability. This new component, along with the `ACLMigrator`, is now responsible for handling grants and Access Control Lists (ACLs) migration. The `MigrateGrants` class takes grant loaders as input, applies grants to a Unity Catalog (UC) table based on a given source table, and is utilized in the `acl_migrator` method. The `ACLMigrator` class manages ACL migration for the migrated tables, taking instances of necessary classes as arguments and setting ACLs for the migrated tables based on the migration status. These changes bring better separation of concerns, making the code easier to understand, test, and maintain. Dependency updates: * Updated databricks-sdk requirement from ~=0.29.0 to >=0.29,<0.31 ([#2417](#2417)). * Updated sqlglot requirement from <25.12,>=25.5.0 to >=25.5.0,<25.13 ([#2431](#2431)). * Updated sqlglot requirement from <25.13,>=25.5.0 to >=25.5.0,<25.15 ([#2453](#2453)). * Updated sqlglot requirement from <25.15,>=25.5.0 to >=25.5.0,<25.17 ([#2480](#2480)). * Updated databricks-labs-lsql requirement from <0.9,>=0.5 to >=0.5,<0.10 ([#2489](#2489)). * Updated sqlglot requirement from <25.17,>=25.5.0 to >=25.5.0,<25.18 ([#2488](#2488)). * Updated sqlglot requirement from <25.18,>=25.5.0 to >=25.5.0,<25.19 ([#2509](#2509)).

JCZuurmond self-assigned this Aug 28, 2024

JCZuurmond requested review from a team and pritishpai August 28, 2024 11:30

JCZuurmond had a problem deploying to account-admin August 28, 2024 11:30 — with GitHub Actions Error

JCZuurmond commented Aug 28, 2024

View reviewed changes

JCZuurmond had a problem deploying to account-admin August 28, 2024 11:43 — with GitHub Actions Failure

JCZuurmond had a problem deploying to account-admin August 28, 2024 12:15 — with GitHub Actions Failure

JCZuurmond added 21 commits August 28, 2024 17:19

Rename view sequencer test module to be consistent with library modules

5acd83e

Move samples up

6e40d15

Invert condition to make it more clear

00ad7df

Move mockbackend up

35278e1

Update docstring

2c9196e

Remove all tables

321949f

Refactor code

9d10258

Test for sequencing a view that depends on an already migrated table

5bd8ee4

Remove result_tables_set

723f461

Remove result view list

d5e87cd

Reuse views property

d9cc2d3

Refactor

f702de3

Change signatures

6700963

Add docstring

5e005bd

Refactor

b2047a9

Refactor check circular dependency

ef357c5

Refactor

cda16ee

Use recursion error

d712563

Break early

8f3807f

Update comment

cd4b101

Format

1eed86e

JCZuurmond added 13 commits August 28, 2024 17:19

Add comment

4c76188

Format

e41ee77

Change types from views_from_previous_batches

d8b0b99

Add comment

7b8c75a

Remove redundant set

3c7b267

Consistent while condition

10d5cf3

Remove mockbackend

9349f2c

Remove Samples class

2efdf5d

Isort

096e67b

Format

8920b1d

Give tests the same pattern

d1f4ed7

Add back test

aed247d

Name test consistently

f8d2458

JCZuurmond force-pushed the fix/view-squencing branch from 87a3370 to f8d2458 Compare August 28, 2024 15:26

JCZuurmond temporarily deployed to account-admin August 28, 2024 15:26 — with GitHub Actions Inactive

ericvergnaud requested changes Aug 28, 2024

View reviewed changes

JCZuurmond added 5 commits August 29, 2024 10:55

Fix numbering

0366cdd

Add index to log statement about not able to sequence views

49780d6

Make return early more clear

e41b16f

Sort order as it is not guaranteed

28ee79f

Verify batches in tests

a36a914

JCZuurmond temporarily deployed to account-admin August 29, 2024 09:06 — with GitHub Actions Inactive

nfx approved these changes Aug 29, 2024

View reviewed changes

nfx merged commit 7b263ff into main Aug 29, 2024
6 checks passed

nfx deleted the fix/view-squencing branch August 29, 2024 21:05

nfx mentioned this pull request Aug 30, 2024

Release v0.34.0 #2515

Merged

JCZuurmond mentioned this pull request Sep 2, 2024

[FEATURE]: Gracefully Handle Cascading View Dependencies #2518

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor view sequencing and return sequenced views if recursion is found #2499

Refactor view sequencing and return sequenced views if recursion is found #2499

JCZuurmond commented Aug 28, 2024 •

edited

Loading

JCZuurmond left a comment

github-actions bot commented Aug 28, 2024 •

edited

Loading

JCZuurmond commented Aug 28, 2024

ericvergnaud Aug 28, 2024

nfx left a comment

nfx Aug 29, 2024

Refactor view sequencing and return sequenced views if recursion is found #2499

Refactor view sequencing and return sequenced views if recursion is found #2499

Conversation

JCZuurmond commented Aug 28, 2024 • edited Loading

Changes

Linked issues

Functionality

Tests

JCZuurmond left a comment

Choose a reason for hiding this comment

github-actions bot commented Aug 28, 2024 • edited Loading

JCZuurmond commented Aug 28, 2024

ericvergnaud Aug 28, 2024

Choose a reason for hiding this comment

nfx left a comment

Choose a reason for hiding this comment

nfx Aug 29, 2024

Choose a reason for hiding this comment

JCZuurmond commented Aug 28, 2024 •

edited

Loading

github-actions bot commented Aug 28, 2024 •

edited

Loading