Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create inventory of service principals and direct files access in Azure #249

Closed
birbalgithub opened this issue Sep 20, 2023 · 4 comments · Fixed by #305
Closed

Create inventory of service principals and direct files access in Azure #249

birbalgithub opened this issue Sep 20, 2023 · 4 comments · Fixed by #305
Labels
feat/crawler good first issue Good for newcomers migrate/external go/uc/upgrade SYNC EXTERNAL TABLES step step/assessment go/uc/upgrade - Assessment Step

Comments

@birbalgithub
Copy link

birbalgithub commented Sep 20, 2023

In Azure, data access is authorized using service principals and it can be configured in either of the settings below-
Cluster conf
Spark session
Cluster policy
DLT settings

Today, there is no automated way of knowing how many service principal are being used in the workspace and where they are being set (cluster/cluster policies/dlt/notebook ?). As such, it is hard for the customer is figure out how many STORAGE CREDENTIALS needs to be created as part of UC upgrade. Furthermore, they could be accessing the files or mount points directly using service principal credentials and as such it is important to list all those files and mount points to estimates the EXTERNAL LOCATIONS that needs to be created. Additionally, it is also useful to list the cluster/cluster policies/dlt/notebook wherever the service principal credentials are being set to identify their owner/user. This would help customer map the permission to the required group on Storage Credential and External locations.

We need a feature in the tool to create an inventory of all service principals and direct files/mount points that are currently being used in the workspace along with the objects (cluster/cluster policies/dlt/notebook) that are using them . The direct files/mount points can be found by scanning their code (Notebook/scala/java/sql) and the service principal can be found by scanning the following four settings.

  1. Cluster conf settings
    spark.hadoop.fs.azure.account.auth.type OAuth
    spark.hadoop.fs.azure.account.oauth.provider.type org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider
    spark.hadoop.fs.azure.account.oauth2.client.id
    spark.hadoop.fs.azure.account.oauth2.client.secret {{secrets//}}
    spark.hadoop.fs.azure.account.oauth2.client.endpoint https://login.microsoftonline.com//oauth2/token

  2. Spark session level settings
    spark.conf.set("fs.azure.account.auth.type","OAuth")
    spark.conf.set("fs.azure.account.oauth.provider.type","org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
    spark.conf.set("fs.azure.account.oauth2.client.id","")
    spark.conf.set("fs.azure.account.oauth2.client.secret",dbutils.secrets.get(scope="",key=""))
    spark.conf.set("fs.azure.account.oauth2.client.endpoint","https://login.microsoftonline.com//oauth2/token")

  3. DLT settings
    "configuration": {
    "spark.hadoop.fs.azure.account.auth.type": "OAuth",
    "spark.hadoop.fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
    "spark.hadoop.fs.azure.account.oauth2.client.id":"",
    "spark.hadoop.fs.azure.account.oauth2.client.secret":"{{secrets//}}",
    "spark.hadoop.fs.azure.account.oauth2.client.endpoint":"https://login.microsoftonline.com//oauth2/token"
    },

  4. Cluster policy level settings
    {
    "spark_conf.fs.azure.account.auth.type": {
    "type": "fixed",
    "value": "OAuth",
    "hidden": true
    },
    "spark_conf.fs.azure.account.oauth.provider.type": {
    "type": "fixed",
    "value": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
    "hidden": true
    },
    "spark_conf.fs.azure.account.oauth2.client.id": {
    "type": "fixed",
    "value": "",
    "hidden": true
    },

"spark_conf.fs.azure.account.oauth2.client.secret": {
"type": "fixed",
"value": "{{secrets//}}",
"hidden": true
},
"spark_conf.fs.azure.account.oauth2.client.endpoint": {
"type": "fixed",
"value": "https://login.microsoftonline.com//oauth2/token",
"hidden": true
}
}

@zpappa
Copy link

zpappa commented Sep 28, 2023

@birbalgithub @dipankarkush-db - don't spend time on code analysis, I have artifacts we are importing that already do this. Focus on inventory based on the definition in cluster conf/dlt.

@dipankarkush-db
Copy link
Contributor

Awesome @zpappa. I was about to comment that lets split the code analysis which is point number 2 of the issue as a separate issue. I have done the point 1 and 4 and now moving to 3. Thanks!

@nfx
Copy link
Collaborator

nfx commented Sep 28, 2023

@zpappa can you create the separate issue for case 2 here and link it here in a comm? Keep that issue focused on specifically that scenario, because I know you are working on a component for it, when you PR, make it only to add this functionality.

@dipankarkush-db
Copy link
Contributor

Created a new issue for point number 2 - #310.

dipankarkush-db added a commit that referenced this issue Oct 5, 2023
dipankarkush-db added a commit that referenced this issue Oct 5, 2023
dipankarkush-db added a commit that referenced this issue Oct 5, 2023
dipankarkush-db added a commit that referenced this issue Oct 5, 2023
dipankarkush-db added a commit that referenced this issue Oct 6, 2023
dipankarkush-db added a commit that referenced this issue Oct 6, 2023
dipankarkush-db added a commit that referenced this issue Oct 6, 2023
dipankarkush-db added a commit that referenced this issue Oct 6, 2023
dipankarkush-db added a commit that referenced this issue Oct 6, 2023
dipankarkush-db added a commit that referenced this issue Oct 6, 2023
dipankarkush-db added a commit that referenced this issue Oct 7, 2023
dipankarkush-db added a commit that referenced this issue Oct 7, 2023
dipankarkush-db added a commit that referenced this issue Oct 7, 2023
dipankarkush-db added a commit that referenced this issue Oct 8, 2023
dipankarkush-db added a commit that referenced this issue Oct 8, 2023
dipankarkush-db added a commit that referenced this issue Oct 8, 2023
dipankarkush-db added a commit that referenced this issue Oct 8, 2023
dipankarkush-db added a commit that referenced this issue Oct 8, 2023
dipankarkush-db added a commit that referenced this issue Oct 8, 2023
dipankarkush-db added a commit that referenced this issue Oct 8, 2023
dipankarkush-db added a commit that referenced this issue Oct 8, 2023
dipankarkush-db added a commit that referenced this issue Oct 8, 2023
dipankarkush-db added a commit that referenced this issue Oct 8, 2023
dipankarkush-db added a commit that referenced this issue Oct 8, 2023
dipankarkush-db added a commit that referenced this issue Oct 8, 2023
@nfx nfx mentioned this issue Aug 30, 2024
nfx added a commit that referenced this issue Aug 30, 2024
* Added a check for No isolation shared clusters and MLR ([#2484](#2484)). This commit introduces a check for `No isolation shared clusters` utilizing MLR as part of the assessment workflow and cluster crawler, addressing issue [#846](#846). A new function, `is_mlr`, has been implemented to determine if the Spark version corresponds to an MLR cluster. If the cluster has no isolation and uses MLR, the assessment failure list is appended with an appropriate error message. Thorough testing, including unit tests and manual verification, has been conducted. However, user documentation and new CLI commands, workflows, tables, or unit/integration tests have not been added. Additionally, a new test has been added to verify the behavior of MLR clusters without isolation, enhancing the assessment workflow's accuracy in identifying unsupported configurations.
* Added a section in migration dashboard to list the failed tables, etc ([#2406](#2406)). In this release, we have introduced a new logging message format for failed table migrations in the `TableMigrate` class, specifically impacting the `_migrate_external_table`, `_migrate_external_table_hiveserde_in_place`, `_migrate_dbfs_root_table`, `_migrate_table_create_ctas`, `_migrate_table_in_mount`, and `_migrate_acl` methods within the `table_migrate.py` file. This update employs the `failed-to-migrate` prefix in log messages for improved failure reason identification during table migrations, enhancing debugging capabilities. As part of this release, we have also developed a new SQL file, `05_1_failed_table_migration.sql`, which retrieves a list of failed table migrations by extracting messages with the 'failed-to-migrate:' prefix from the inventory.logs table and returning the corresponding message text. While this release does not include new methods or user documentation, it resolves issue [#1754](#1754) and has been manually tested with positive results in the staging environment, demonstrating its functionality.
* Added clean up activities when `migrate-credentials` cmd fails intermittently ([#2479](#2479)). This pull request enhances the robustness of the `migrate-credentials` command for Azure in the event of intermittent failures during the creation of access connectors and storage credentials. It introduces new methods, `delete_storage_credential` and `delete_access_connectors`, which are responsible for removing incomplete resources when errors occur. The `_migrate_service_principals` and `_create_storage_credentials_for_storage_accounts` methods now handle `PermissionDenied`, `NotFound`, and `BadRequest` exceptions, deleting created storage credentials and access connectors if exceptions occur. Additionally, error messages have been updated to guide users in resolving issues before attempting the operation again. The PR also modifies the `sp_migration` fixture in the `tests/unit/azure/test_credentials.py` file, simplifying the deletion process for access connectors and improving the testing of the `ServicePrincipalMigration` class. These changes address issue [#2362](#2362), ensuring clean-up activities in case of intermittent failures and improving the overall reliability of the system.
* Added standalone migrate ACLs ([#2284](#2284)). A new `migrate-acls` command has been introduced to facilitate the migration of Access Control Lists (ACLs) from a legacy metastore to a Unity Catalog (UC) metastore. The command, designed to work with HMS federation and other table migration scenarios, can be executed with optional flags `target-catalog` and `hms-fed` to specify the target catalog and migrate HMS-FED ACLs, respectively. The release also includes modifications to the `labs.yml` file, adding the new command and its details to the `commands` section. In addition, a new `ACLMigrator` class has been added to the `databricks.labs.ucx.contexts.application` module to handle ACL migration for tables in a standalone manner. A new test file, `test_migrate_acls.py`, contains unit tests for ACL migration in a Hive metastore, covering various scenarios and ensuring proper query generation. These features streamline and improve the functionality of ACL migration, offering better access control management for users.
* Appends metastore_id or location_name to roles for uniqueness ([#2471](#2471)). A new method, `_generate_role_name`, has been added to the `Access` class in the `aws/access.py` file of the `databricks/labs/ucx` module to generate unique names for AWS roles using a consistent naming convention. The `list_uc_roles` method has been updated to utilize this new method for creating role names. In response to issue [#2336](#2336), the `create_missing_principals` change enforces role uniqueness on AWS by modifying the `ExternalLocation` table to include `metastore_id` or `location_name` for uniqueness. To ensure proper cleanup, the `create_uber_principal` method has been updated to delete the instance profile if creating the cluster policy fails due to a `PermissionError`. Unit tests have been added to verify these changes, including tests for the new role name generation method and the updated `ExternalLocation` table. The `MetastoreAssignment` class is also imported in this diff, although its usage is not immediately clear. These changes aim to improve the creation of unique AWS roles for Databricks Labs UCX and enforce role uniqueness on AWS.
* Cache workspace content ([#2497](#2497)). In this release, we have implemented a caching mechanism for workspace content to improve load times and bypass rate limits. The `WorkspaceCache` class handles caching of workspace content, with the `_CachedIO` and `_PathLruCache` classes managing IO operation caching and LRU caching, respectively. The `_CachedPath` class, a subclass of `WorkspacePath`, handles caching of workspace paths. The `open` and `unlink` methods of `_CachedPath` have been overridden to cache results and remove corresponding cache entries. The `guess_encoding` function is used to determine the encoding of downloaded content. Unit tests have been added to ensure the proper functioning of the caching mechanism, including tests for cache reuse, invalidation, and encoding determination. This feature aims to enhance the performance of file operations, making the overall system more efficient for users.
* Changes the security mode for assessment cluster ([#2472](#2472)). In this release, the security mode of the `main` cluster assessment has been updated from LEGACY_SINGLE_USER to LEGACY_SINGLE_USER_STANDARD in the workflows.py file. This change disables passthrough and addresses issue [#1717](#1717). The new data security mode is defined in the compute.ClusterSpec object for the `main` job cluster by modifying the data_security_mode attribute. While no new methods have been introduced, existing functionality related to the cluster's security mode has been modified. Software engineers adopting this project should be aware of the security implications of this change, ensuring the appropriate data protection measures are in place. Manual testing has been conducted to verify the functionality of this update.
* Do not normalize cases when reformatting SQL queries in CI check ([#2495](#2495)). In this release, the CI workflow for pushing changes to the repository has been updated to improve the behavior of the SQL query reformatting step. Previously, case normalization of SQL queries was causing issues with case-sensitive columns, resulting in blocked CI checks. This release addresses the issue by adding the `--normalize-case false` flag to the `databricks labs lsql fmt` command, which disables case normalization. This modification allows the CI workflow to pass and ensures correct SQL query formatting, regardless of case sensitivity. The change impacts the assessment/interactive directory, specifically a cluster summary query for interactive assessments. This query involves a change in the ORDER BY clause, replacing a normalized case with the original case. Despite these changes, no new methods have been added, and existing functionality has been modified solely to improve CI efficiency and SQL query compatibility.
* Drop source table after successful table move not before ([#2430](#2430)). In this release, we have addressed an issue where the source table was being dropped before a new table was created, which could cause the creation process to fail and leave the source table unavailable. This problem has been resolved by modifying the `_recreate_table` method of the `TableMove` class in the `hive_metastore` package to drop the source table after the new table creation. The updated implementation ensures that the source table remains intact during the creation process, even in case of any issues. This change comes with integration tests and does not involve any modifications to user documentation, CLI commands, workflows, tables, or existing functionality. Additionally, a new test function `test_move_tables_table_properties_mismatch_preserves_original` has been added to `test_table_move.py`, which checks if the original table is preserved when there is a mismatch in table properties during the move operation. The changes also include adding the `pytest` library and the `BadRequest` exception from the `databricks.sdk.errors` package for the new test function. The imports section has been updated accordingly with the removal of `databricks.sdk.errors.NotFound` and the addition of `pytest` and `databricks.sdk.errors.BadRequest`.
* Enabled `principal-prefix-access` command to run as collection ([#2450](#2450)). This commit introduces several improvements to the `principal-prefix-access` command in our open-source library. A new flag `run-as-collection` has been added, allowing the command to run as a collection across multiple AWS accounts. A new `get_workspace_context` function has also been implemented, which encapsulates common functionalities and enhances code reusability. Additionally, the `get_workspace_contexts` method has been developed to retrieve a list of `WorkspaceContext` objects, making the command more efficient when handling collections of workspaces. Furthermore, the `install_on_account` method has been updated to use the new `get_workspace_contexts` method. The `principal-prefix-access` command has been enhanced to accept an optional `acc_client` argument, which is used to retrieve information about the assessment run. These changes improve the functionality and organization of the codebase, making it more efficient, flexible, and easier to maintain for users working with multiple AWS accounts and workspaces.
* Fixed Driver OOM error by increasing the min memory requirement for node from 16GB to 32 GB ([#2473](#2473)). A modification has been implemented in the `policy.py` file located in the `databricks/labs/ucx/installer` directory, which enhances the minimum memory requirement for the node type from 16GB to 32GB. This adjustment is intended to prevent driver out-of-memory (OOM) errors during assessments. The `_definition` function in the `policy` class has been updated to incorporate the new memory requirement, which will be employed for selecting a suitable node type. The rest of the code remains unchanged. This modification addresses issue [#2398](#2398). While the code has been tested, specific testing details are not provided in the commit message.
* Fixed issue when running create-missing-credential cmd tries to create the role again if already created ([#2456](#2456)). In this release, we have implemented a fix to address an issue in the `_identify_missing_paths` function within the `access.py` file of the `databricks/labs/ucx/aws` directory, where the `create-missing-credential` command was attempting to create a role again even if it had already been created. This issue was due to a mismatch in path comparison using the `match` function, which has now been updated to use the `startswith` function instead. This change ensures that the code checks if the path starts with the resource path, thereby resolving issue [#2413](#2413). The `_identify_missing_paths` function identifies missing paths by loading UC compatible roles and iterating through each external location. If a location matches any of the resource paths of the UC compatible roles, the `matching_role` variable is set to True, and the code continues to the next role. If the location does not match any of the resource paths, the `matching_role` variable is set to False. If a match is found, the code continues to the next external location. If no match is found for any of the UC compatible roles, then the location is added to the `missing_paths` set. The diff also includes a conditional check to return an empty list if the `missing_paths` set is empty. Additionally, tests have been added or modified to ensure the proper functioning of the updated code, including unit tests and integration tests. However, there is no mention of manual testing or verification on a staging environment. Overall, this update fixes a specific issue with the `create-missing-credential` command and includes updated tests to ensure proper functionality.
* Fixed issue with Interactive Dashboard not showing output ([#2476](#2476)). In this release, we have resolved an issue with the Interactive Dashboard not displaying output by fixing a bug in the query used for the dashboard. Previously, the query was joining on "request_params.clusterid" and selecting "request_params.clusterid" in the SELECT clause, but the correct field name is "request_params.clusterId". The query has been updated to use "request_params.clusterId" instead, both in the JOIN and SELECT clauses. These changes ensure that the Interactive Dashboard displays the correct output, improving the overall functionality and usability of the product. No new methods were added, and existing functionality was changed within the scope of the Interactive Dashboard query. Manual testing is recommended to ensure that the output is now displayed correctly. Additionally, a change has been made to the 'test_installation.py' integration test file to improve the performance of clusters by updating the `min_memory_gb` argument from 16 GB to 32 GB in the `test_job_cluster_policy` function.
* Fixed support for table/schema scope for the revert table cli command ([#2428](#2428)). In this release, we have enhanced the `revert table` CLI command to support table and schema scopes in the open-source library. The `revert_migrated_tables` function now accepts optional parameters `schema` and `table` of types str or None, which were previously required parameters. Similarly, the `print_revert_report` function in the `tables_migrator` object within `WorkspaceContext` has been updated to accept the same optional parameters. The `revert_migrated_tables` function now uses these optional parameters when calling the `revert_migrated_tables` method of `tables_migrator` within 'ctx'. Additionally, we have introduced a new dictionary called `reverse_seen` and modified the `_get_tables_to_revert` and `print_revert_report` functions to utilize this dictionary, providing more fine-grained control when reverting table migrations. The `delete_managed` parameter is used to determine if managed tables should be deleted. These changes allow users to specify a specific schema and table to revert, rather than reverting all migrated tables within a workspace.
* Refactor view sequencing and return sequenced views if recursion is found ([#2499](#2499)). In this refactored code, the view sequencing for table migration has been improved and now returns sequenced views if recursion is found, addressing issue [#249](#249)
* Updated databricks-labs-lsql requirement from <0.9,>=0.5 to >=0.5,<0.10 ([#2489](#2489)). In this release, we have updated the version requirements for the `databricks-labs-lsql` package, changing it from greater than 0.5 and less than 0.9 to greater than 0.5 and less than 0.10. This update enables the use of newer versions of the package while maintaining compatibility with existing systems. The `databricks-labs-lsql` package is used for creating dashboards and managing SQL queries in Databricks. The pull request also includes detailed release notes, a comprehensive changelog, and a list of commits for the updated package. We recommend that all users of this package review the release notes and update to the new version to take advantage of the latest features and improvements.
* Updated databricks-sdk requirement from ~=0.29.0 to >=0.29,<0.31 ([#2417](#2417)). In this pull request, the `databricks-sdk` dependency has been updated from version `~=0.29.0` to `>=0.29,<0.31` to allow for the latest version of the package, which includes new features, bug fixes, internal changes, and other updates. This update is in response to the release of version `0.30.0` of the `databricks-sdk` library, which includes new features such as DataPlane support and partner support. In addition to the updated dependency, there have been changes to several files, including `access.py`, `fixtures.py`, `test_access.py`, and `test_workflows.py`. These changes include updates to method calls, import statements, and test data to reflect the new version of the `databricks-sdk` library. The `pyproject.toml` file has also been updated to reflect the new dependency version. This pull request does not include any other changes.
* Updated sqlglot requirement from <25.12,>=25.5.0 to >=25.5.0,<25.13 ([#2431](#2431)). In this pull request, we are updating the `sqlglot` dependency from version `>=25.5.0,<25.12` to `>=25.5.0,<25.13`. This update allows us to use the latest version of the `sqlglot` library, which includes several new features and bug fixes. Specifically, the new version includes support for `TryCast` generation and improvements to the `clickhouse` dialect. It is important to note that the previous version had a breaking change related to treating `DATABASE` as `SCHEMA` in `exp.Create`. Therefore, it is crucial to thoroughly test the changes before merging, as breaking changes may affect existing functionality.
* Updated sqlglot requirement from <25.13,>=25.5.0 to >=25.5.0,<25.15 ([#2453](#2453)). In this pull request, we have updated the required version range of the `sqlglot` package from `>=25.5.0,<25.13` to `>=25.5.0,<25.15`. This change allows us to install the latest version of the package, which includes several bug fixes and new features. These include improved transpilation of nullable/non-nullable data types and support for TryCast generation in ClickHouse. The changelog for `sqlglot` provides a detailed list of changes in each release, and a list of commits made in the latest release is also included in the pull request. This update will improve the functionality and reliability of our software, as we will now be able to take advantage of the latest features and fixes provided by `sqlglot`.
* Updated sqlglot requirement from <25.15,>=25.5.0 to >=25.5.0,<25.17 ([#2480](#2480)). In this release, we have updated the requirement range for the `sqlglot` dependency to '>=25.5.0,<25.17' from '<25.15,>=25.5.0'. This change resolves issues [#2452](#2452) and [#2451](#2451) and includes several bug fixes and new features in the `sqlglot` library version 25.16.1. The updated version includes support for timezone in exp.TimeStrToTime, transpiling from_iso8601_timestamp from presto/trino to duckdb, and mapping %e to %-d in BigQuery. Additionally, there are changes to the parser and optimizer, as well as other bug fixes and refactors. This update does not introduce any major breaking changes and should not affect the functionality of the project. The `sqlglot` library is used for parsing, analyzing, and rewriting SQL queries, and the new version range provides improved functionality and reliability.
* Updated sqlglot requirement from <25.17,>=25.5.0 to >=25.5.0,<25.18 ([#2488](#2488)). this pull request updates the sqlglot library requirement to version 25.5.0 or greater, but less than 25.18. By doing so, it enables the use of the latest version of sqlglot, while still maintaining compatibility with the current implementation. The changelog and commits for each release from v25.17.0 to v25.16.1 are provided for reference, detailing bug fixes, new features, and breaking changes. As a software engineer, it's important to review this pull request and ensure it aligns with the project's requirements before merging, to take advantage of the latest improvements and fixes in sqlglot.
* Updated sqlglot requirement from <25.18,>=25.5.0 to >=25.5.0,<25.19 ([#2509](#2509)). In this release, we have updated the required version of the `sqlglot` package in our project's dependencies. Previously, we required a version greater than or equal to 25.5.0 and less than 25.18, which has now been updated to require a version greater than or equal to 25.5.0 and less than 25.19. This change was made automatically by Dependabot, a service that helps to keep dependencies up to date, in order to permit the latest version of the `sqlglot` package. The pull request contains a detailed list of the changes made in the `sqlglot` package between versions 25.5.0 and 25.18.0, as well as a list of the commits that were made during this time. These details can be helpful for understanding the potential impact of the update on the project.
* [chore] make `GRANT` migration logic isolated to `MigrateGrants` component ([#2492](#2492)). In this release, the grant migration logic has been isolated to a separate `MigrateGrants` component, enhancing code modularity and maintainability. This new component, along with the `ACLMigrator`, is now responsible for handling grants and Access Control Lists (ACLs) migration. The `MigrateGrants` class takes grant loaders as input, applies grants to a Unity Catalog (UC) table based on a given source table, and is utilized in the `acl_migrator` method. The `ACLMigrator` class manages ACL migration for the migrated tables, taking instances of necessary classes as arguments and setting ACLs for the migrated tables based on the migration status. These changes bring better separation of concerns, making the code easier to understand, test, and maintain.

Dependency updates:

 * Updated databricks-sdk requirement from ~=0.29.0 to >=0.29,<0.31 ([#2417](#2417)).
 * Updated sqlglot requirement from <25.12,>=25.5.0 to >=25.5.0,<25.13 ([#2431](#2431)).
 * Updated sqlglot requirement from <25.13,>=25.5.0 to >=25.5.0,<25.15 ([#2453](#2453)).
 * Updated sqlglot requirement from <25.15,>=25.5.0 to >=25.5.0,<25.17 ([#2480](#2480)).
 * Updated databricks-labs-lsql requirement from <0.9,>=0.5 to >=0.5,<0.10 ([#2489](#2489)).
 * Updated sqlglot requirement from <25.17,>=25.5.0 to >=25.5.0,<25.18 ([#2488](#2488)).
 * Updated sqlglot requirement from <25.18,>=25.5.0 to >=25.5.0,<25.19 ([#2509](#2509)).
nfx added a commit that referenced this issue Aug 30, 2024
* Added a check for No isolation shared clusters and MLR
([#2484](#2484)). This
commit introduces a check for `No isolation shared clusters` utilizing
MLR as part of the assessment workflow and cluster crawler, addressing
issue [#846](#846). A new
function, `is_mlr`, has been implemented to determine if the Spark
version corresponds to an MLR cluster. If the cluster has no isolation
and uses MLR, the assessment failure list is appended with an
appropriate error message. Thorough testing, including unit tests and
manual verification, has been conducted. However, user documentation and
new CLI commands, workflows, tables, or unit/integration tests have not
been added. Additionally, a new test has been added to verify the
behavior of MLR clusters without isolation, enhancing the assessment
workflow's accuracy in identifying unsupported configurations.
* Added a section in migration dashboard to list the failed tables, etc
([#2406](#2406)). In this
release, we have introduced a new logging message format for failed
table migrations in the `TableMigrate` class, specifically impacting the
`_migrate_external_table`, `_migrate_external_table_hiveserde_in_place`,
`_migrate_dbfs_root_table`, `_migrate_table_create_ctas`,
`_migrate_table_in_mount`, and `_migrate_acl` methods within the
`table_migrate.py` file. This update employs the `failed-to-migrate`
prefix in log messages for improved failure reason identification during
table migrations, enhancing debugging capabilities. As part of this
release, we have also developed a new SQL file,
`05_1_failed_table_migration.sql`, which retrieves a list of failed
table migrations by extracting messages with the 'failed-to-migrate:'
prefix from the inventory.logs table and returning the corresponding
message text. While this release does not include new methods or user
documentation, it resolves issue
[#1754](#1754) and has been
manually tested with positive results in the staging environment,
demonstrating its functionality.
* Added clean up activities when `migrate-credentials` cmd fails
intermittently
([#2479](#2479)). This pull
request enhances the robustness of the `migrate-credentials` command for
Azure in the event of intermittent failures during the creation of
access connectors and storage credentials. It introduces new methods,
`delete_storage_credential` and `delete_access_connectors`, which are
responsible for removing incomplete resources when errors occur. The
`_migrate_service_principals` and
`_create_storage_credentials_for_storage_accounts` methods now handle
`PermissionDenied`, `NotFound`, and `BadRequest` exceptions, deleting
created storage credentials and access connectors if exceptions occur.
Additionally, error messages have been updated to guide users in
resolving issues before attempting the operation again. The PR also
modifies the `sp_migration` fixture in the
`tests/unit/azure/test_credentials.py` file, simplifying the deletion
process for access connectors and improving the testing of the
`ServicePrincipalMigration` class. These changes address issue
[#2362](#2362), ensuring
clean-up activities in case of intermittent failures and improving the
overall reliability of the system.
* Added standalone migrate ACLs
([#2284](#2284)). A new
`migrate-acls` command has been introduced to facilitate the migration
of Access Control Lists (ACLs) from a legacy metastore to a Unity
Catalog (UC) metastore. The command, designed to work with HMS
federation and other table migration scenarios, can be executed with
optional flags `target-catalog` and `hms-fed` to specify the target
catalog and migrate HMS-FED ACLs, respectively. The release also
includes modifications to the `labs.yml` file, adding the new command
and its details to the `commands` section. In addition, a new
`ACLMigrator` class has been added to the
`databricks.labs.ucx.contexts.application` module to handle ACL
migration for tables in a standalone manner. A new test file,
`test_migrate_acls.py`, contains unit tests for ACL migration in a Hive
metastore, covering various scenarios and ensuring proper query
generation. These features streamline and improve the functionality of
ACL migration, offering better access control management for users.
* Appends metastore_id or location_name to roles for uniqueness
([#2471](#2471)). A new
method, `_generate_role_name`, has been added to the `Access` class in
the `aws/access.py` file of the `databricks/labs/ucx` module to generate
unique names for AWS roles using a consistent naming convention. The
`list_uc_roles` method has been updated to utilize this new method for
creating role names. In response to issue
[#2336](#2336), the
`create_missing_principals` change enforces role uniqueness on AWS by
modifying the `ExternalLocation` table to include `metastore_id` or
`location_name` for uniqueness. To ensure proper cleanup, the
`create_uber_principal` method has been updated to delete the instance
profile if creating the cluster policy fails due to a `PermissionError`.
Unit tests have been added to verify these changes, including tests for
the new role name generation method and the updated `ExternalLocation`
table. The `MetastoreAssignment` class is also imported in this diff,
although its usage is not immediately clear. These changes aim to
improve the creation of unique AWS roles for Databricks Labs UCX and
enforce role uniqueness on AWS.
* Cache workspace content
([#2497](#2497)). In this
release, we have implemented a caching mechanism for workspace content
to improve load times and bypass rate limits. The `WorkspaceCache` class
handles caching of workspace content, with the `_CachedIO` and
`_PathLruCache` classes managing IO operation caching and LRU caching,
respectively. The `_CachedPath` class, a subclass of `WorkspacePath`,
handles caching of workspace paths. The `open` and `unlink` methods of
`_CachedPath` have been overridden to cache results and remove
corresponding cache entries. The `guess_encoding` function is used to
determine the encoding of downloaded content. Unit tests have been added
to ensure the proper functioning of the caching mechanism, including
tests for cache reuse, invalidation, and encoding determination. This
feature aims to enhance the performance of file operations, making the
overall system more efficient for users.
* Changes the security mode for assessment cluster
([#2472](#2472)). In this
release, the security mode of the `main` cluster assessment has been
updated from LEGACY_SINGLE_USER to LEGACY_SINGLE_USER_STANDARD in the
workflows.py file. This change disables passthrough and addresses issue
[#1717](#1717). The new data
security mode is defined in the compute.ClusterSpec object for the
`main` job cluster by modifying the data_security_mode attribute. While
no new methods have been introduced, existing functionality related to
the cluster's security mode has been modified. Software engineers
adopting this project should be aware of the security implications of
this change, ensuring the appropriate data protection measures are in
place. Manual testing has been conducted to verify the functionality of
this update.
* Do not normalize cases when reformatting SQL queries in CI check
([#2495](#2495)). In this
release, the CI workflow for pushing changes to the repository has been
updated to improve the behavior of the SQL query reformatting step.
Previously, case normalization of SQL queries was causing issues with
case-sensitive columns, resulting in blocked CI checks. This release
addresses the issue by adding the `--normalize-case false` flag to the
`databricks labs lsql fmt` command, which disables case normalization.
This modification allows the CI workflow to pass and ensures correct SQL
query formatting, regardless of case sensitivity. The change impacts the
assessment/interactive directory, specifically a cluster summary query
for interactive assessments. This query involves a change in the ORDER
BY clause, replacing a normalized case with the original case. Despite
these changes, no new methods have been added, and existing
functionality has been modified solely to improve CI efficiency and SQL
query compatibility.
* Drop source table after successful table move not before
([#2430](#2430)). In this
release, we have addressed an issue where the source table was being
dropped before a new table was created, which could cause the creation
process to fail and leave the source table unavailable. This problem has
been resolved by modifying the `_recreate_table` method of the
`TableMove` class in the `hive_metastore` package to drop the source
table after the new table creation. The updated implementation ensures
that the source table remains intact during the creation process, even
in case of any issues. This change comes with integration tests and does
not involve any modifications to user documentation, CLI commands,
workflows, tables, or existing functionality. Additionally, a new test
function `test_move_tables_table_properties_mismatch_preserves_original`
has been added to `test_table_move.py`, which checks if the original
table is preserved when there is a mismatch in table properties during
the move operation. The changes also include adding the `pytest` library
and the `BadRequest` exception from the `databricks.sdk.errors` package
for the new test function. The imports section has been updated
accordingly with the removal of `databricks.sdk.errors.NotFound` and the
addition of `pytest` and `databricks.sdk.errors.BadRequest`.
* Enabled `principal-prefix-access` command to run as collection
([#2450](#2450)). This
commit introduces several improvements to the `principal-prefix-access`
command in our open-source library. A new flag `run-as-collection` has
been added, allowing the command to run as a collection across multiple
AWS accounts. A new `get_workspace_context` function has also been
implemented, which encapsulates common functionalities and enhances code
reusability. Additionally, the `get_workspace_contexts` method has been
developed to retrieve a list of `WorkspaceContext` objects, making the
command more efficient when handling collections of workspaces.
Furthermore, the `install_on_account` method has been updated to use the
new `get_workspace_contexts` method. The `principal-prefix-access`
command has been enhanced to accept an optional `acc_client` argument,
which is used to retrieve information about the assessment run. These
changes improve the functionality and organization of the codebase,
making it more efficient, flexible, and easier to maintain for users
working with multiple AWS accounts and workspaces.
* Fixed Driver OOM error by increasing the min memory requirement for
node from 16GB to 32 GB
([#2473](#2473)). A
modification has been implemented in the `policy.py` file located in the
`databricks/labs/ucx/installer` directory, which enhances the minimum
memory requirement for the node type from 16GB to 32GB. This adjustment
is intended to prevent driver out-of-memory (OOM) errors during
assessments. The `_definition` function in the `policy` class has been
updated to incorporate the new memory requirement, which will be
employed for selecting a suitable node type. The rest of the code
remains unchanged. This modification addresses issue
[#2398](#2398). While the
code has been tested, specific testing details are not provided in the
commit message.
* Fixed issue when running create-missing-credential cmd tries to create
the role again if already created
([#2456](#2456)). In this
release, we have implemented a fix to address an issue in the
`_identify_missing_paths` function within the `access.py` file of the
`databricks/labs/ucx/aws` directory, where the
`create-missing-credential` command was attempting to create a role
again even if it had already been created. This issue was due to a
mismatch in path comparison using the `match` function, which has now
been updated to use the `startswith` function instead. This change
ensures that the code checks if the path starts with the resource path,
thereby resolving issue
[#2413](#2413). The
`_identify_missing_paths` function identifies missing paths by loading
UC compatible roles and iterating through each external location. If a
location matches any of the resource paths of the UC compatible roles,
the `matching_role` variable is set to True, and the code continues to
the next role. If the location does not match any of the resource paths,
the `matching_role` variable is set to False. If a match is found, the
code continues to the next external location. If no match is found for
any of the UC compatible roles, then the location is added to the
`missing_paths` set. The diff also includes a conditional check to
return an empty list if the `missing_paths` set is empty. Additionally,
tests have been added or modified to ensure the proper functioning of
the updated code, including unit tests and integration tests. However,
there is no mention of manual testing or verification on a staging
environment. Overall, this update fixes a specific issue with the
`create-missing-credential` command and includes updated tests to ensure
proper functionality.
* Fixed issue with Interactive Dashboard not showing output
([#2476](#2476)). In this
release, we have resolved an issue with the Interactive Dashboard not
displaying output by fixing a bug in the query used for the dashboard.
Previously, the query was joining on "request_params.clusterid" and
selecting "request_params.clusterid" in the SELECT clause, but the
correct field name is "request_params.clusterId". The query has been
updated to use "request_params.clusterId" instead, both in the JOIN and
SELECT clauses. These changes ensure that the Interactive Dashboard
displays the correct output, improving the overall functionality and
usability of the product. No new methods were added, and existing
functionality was changed within the scope of the Interactive Dashboard
query. Manual testing is recommended to ensure that the output is now
displayed correctly. Additionally, a change has been made to the
'test_installation.py' integration test file to improve the performance
of clusters by updating the `min_memory_gb` argument from 16 GB to 32 GB
in the `test_job_cluster_policy` function.
* Fixed support for table/schema scope for the revert table cli command
([#2428](#2428)). In this
release, we have enhanced the `revert table` CLI command to support
table and schema scopes in the open-source library. The
`revert_migrated_tables` function now accepts optional parameters
`schema` and `table` of types str or None, which were previously
required parameters. Similarly, the `print_revert_report` function in
the `tables_migrator` object within `WorkspaceContext` has been updated
to accept the same optional parameters. The `revert_migrated_tables`
function now uses these optional parameters when calling the
`revert_migrated_tables` method of `tables_migrator` within 'ctx'.
Additionally, we have introduced a new dictionary called `reverse_seen`
and modified the `_get_tables_to_revert` and `print_revert_report`
functions to utilize this dictionary, providing more fine-grained
control when reverting table migrations. The `delete_managed` parameter
is used to determine if managed tables should be deleted. These changes
allow users to specify a specific schema and table to revert, rather
than reverting all migrated tables within a workspace.
* Refactor view sequencing and return sequenced views if recursion is
found ([#2499](#2499)). In
this refactored code, the view sequencing for table migration has been
improved and now returns sequenced views if recursion is found,
addressing issue
[#249](#249)
* Updated databricks-labs-lsql requirement from <0.9,>=0.5 to
>=0.5,<0.10
([#2489](#2489)). In this
release, we have updated the version requirements for the
`databricks-labs-lsql` package, changing it from greater than 0.5 and
less than 0.9 to greater than 0.5 and less than 0.10. This update
enables the use of newer versions of the package while maintaining
compatibility with existing systems. The `databricks-labs-lsql` package
is used for creating dashboards and managing SQL queries in Databricks.
The pull request also includes detailed release notes, a comprehensive
changelog, and a list of commits for the updated package. We recommend
that all users of this package review the release notes and update to
the new version to take advantage of the latest features and
improvements.
* Updated databricks-sdk requirement from ~=0.29.0 to >=0.29,<0.31
([#2417](#2417)). In this
pull request, the `databricks-sdk` dependency has been updated from
version `~=0.29.0` to `>=0.29,<0.31` to allow for the latest version of
the package, which includes new features, bug fixes, internal changes,
and other updates. This update is in response to the release of version
`0.30.0` of the `databricks-sdk` library, which includes new features
such as DataPlane support and partner support. In addition to the
updated dependency, there have been changes to several files, including
`access.py`, `fixtures.py`, `test_access.py`, and `test_workflows.py`.
These changes include updates to method calls, import statements, and
test data to reflect the new version of the `databricks-sdk` library.
The `pyproject.toml` file has also been updated to reflect the new
dependency version. This pull request does not include any other
changes.
* Updated sqlglot requirement from <25.12,>=25.5.0 to >=25.5.0,<25.13
([#2431](#2431)). In this
pull request, we are updating the `sqlglot` dependency from version
`>=25.5.0,<25.12` to `>=25.5.0,<25.13`. This update allows us to use the
latest version of the `sqlglot` library, which includes several new
features and bug fixes. Specifically, the new version includes support
for `TryCast` generation and improvements to the `clickhouse` dialect.
It is important to note that the previous version had a breaking change
related to treating `DATABASE` as `SCHEMA` in `exp.Create`. Therefore,
it is crucial to thoroughly test the changes before merging, as breaking
changes may affect existing functionality.
* Updated sqlglot requirement from <25.13,>=25.5.0 to >=25.5.0,<25.15
([#2453](#2453)). In this
pull request, we have updated the required version range of the
`sqlglot` package from `>=25.5.0,<25.13` to `>=25.5.0,<25.15`. This
change allows us to install the latest version of the package, which
includes several bug fixes and new features. These include improved
transpilation of nullable/non-nullable data types and support for
TryCast generation in ClickHouse. The changelog for `sqlglot` provides a
detailed list of changes in each release, and a list of commits made in
the latest release is also included in the pull request. This update
will improve the functionality and reliability of our software, as we
will now be able to take advantage of the latest features and fixes
provided by `sqlglot`.
* Updated sqlglot requirement from <25.15,>=25.5.0 to >=25.5.0,<25.17
([#2480](#2480)). In this
release, we have updated the requirement range for the `sqlglot`
dependency to '>=25.5.0,<25.17' from '<25.15,>=25.5.0'. This change
resolves issues
[#2452](#2452) and
[#2451](#2451) and includes
several bug fixes and new features in the `sqlglot` library version
25.16.1. The updated version includes support for timezone in
exp.TimeStrToTime, transpiling from_iso8601_timestamp from presto/trino
to duckdb, and mapping %e to %-d in BigQuery. Additionally, there are
changes to the parser and optimizer, as well as other bug fixes and
refactors. This update does not introduce any major breaking changes and
should not affect the functionality of the project. The `sqlglot`
library is used for parsing, analyzing, and rewriting SQL queries, and
the new version range provides improved functionality and reliability.
* Updated sqlglot requirement from <25.17,>=25.5.0 to >=25.5.0,<25.18
([#2488](#2488)). this pull
request updates the sqlglot library requirement to version 25.5.0 or
greater, but less than 25.18. By doing so, it enables the use of the
latest version of sqlglot, while still maintaining compatibility with
the current implementation. The changelog and commits for each release
from v25.17.0 to v25.16.1 are provided for reference, detailing bug
fixes, new features, and breaking changes. As a software engineer, it's
important to review this pull request and ensure it aligns with the
project's requirements before merging, to take advantage of the latest
improvements and fixes in sqlglot.
* Updated sqlglot requirement from <25.18,>=25.5.0 to >=25.5.0,<25.19
([#2509](#2509)). In this
release, we have updated the required version of the `sqlglot` package
in our project's dependencies. Previously, we required a version greater
than or equal to 25.5.0 and less than 25.18, which has now been updated
to require a version greater than or equal to 25.5.0 and less than
25.19. This change was made automatically by Dependabot, a service that
helps to keep dependencies up to date, in order to permit the latest
version of the `sqlglot` package. The pull request contains a detailed
list of the changes made in the `sqlglot` package between versions
25.5.0 and 25.18.0, as well as a list of the commits that were made
during this time. These details can be helpful for understanding the
potential impact of the update on the project.
* [chore] make `GRANT` migration logic isolated to `MigrateGrants`
component ([#2492](#2492)).
In this release, the grant migration logic has been isolated to a
separate `MigrateGrants` component, enhancing code modularity and
maintainability. This new component, along with the `ACLMigrator`, is
now responsible for handling grants and Access Control Lists (ACLs)
migration. The `MigrateGrants` class takes grant loaders as input,
applies grants to a Unity Catalog (UC) table based on a given source
table, and is utilized in the `acl_migrator` method. The `ACLMigrator`
class manages ACL migration for the migrated tables, taking instances of
necessary classes as arguments and setting ACLs for the migrated tables
based on the migration status. These changes bring better separation of
concerns, making the code easier to understand, test, and maintain.

Dependency updates:

* Updated databricks-sdk requirement from ~=0.29.0 to >=0.29,<0.31
([#2417](#2417)).
* Updated sqlglot requirement from <25.12,>=25.5.0 to >=25.5.0,<25.13
([#2431](#2431)).
* Updated sqlglot requirement from <25.13,>=25.5.0 to >=25.5.0,<25.15
([#2453](#2453)).
* Updated sqlglot requirement from <25.15,>=25.5.0 to >=25.5.0,<25.17
([#2480](#2480)).
* Updated databricks-labs-lsql requirement from <0.9,>=0.5 to
>=0.5,<0.10 ([#2489](#2489)).
* Updated sqlglot requirement from <25.17,>=25.5.0 to >=25.5.0,<25.18
([#2488](#2488)).
* Updated sqlglot requirement from <25.18,>=25.5.0 to >=25.5.0,<25.19
([#2509](#2509)).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat/crawler good first issue Good for newcomers migrate/external go/uc/upgrade SYNC EXTERNAL TABLES step step/assessment go/uc/upgrade - Assessment Step
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants