Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

♻️ Refactor current code to use Python SDK #27

Merged
merged 86 commits into from
Jul 25, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
86 commits
Select commit Hold shift + click to select a range
7405aa8
check for valid pat
nsenno-dbr Jun 8, 2023
fb31a1b
Merge branch 'main' into python-sdk
renardeinside Jun 8, 2023
8444b26
Merge branch 'main' into python-sdk
renardeinside Jun 9, 2023
65c8ef6
implement authconfig
renardeinside Jun 9, 2023
f2cedbe
get local workspace groups
nsenno-dbr Jun 14, 2023
6802bd0
save workbook
nsenno-dbr Jun 14, 2023
b4a3ad8
Merge branch 'python-sdk' of github.com:databricks/UC-Upgrade into py…
nsenno-dbr Jun 14, 2023
5532df3
remove scratches from repo
renardeinside Jul 17, 2023
291527c
fix linters
renardeinside Jul 17, 2023
5d60560
introduce structure
renardeinside Jul 17, 2023
8bd905d
add clear toolkit instructions
renardeinside Jul 17, 2023
5554f65
fix linter errors
renardeinside Jul 17, 2023
60a08c5
upgrade versions
renardeinside Jul 17, 2023
b4a70b4
Merge remote-tracking branch 'origin/main' into python-sdk
renardeinside Jul 17, 2023
8e6a1b6
fix exclusions
renardeinside Jul 17, 2023
0e38e8b
add runtime imports
renardeinside Jul 17, 2023
1937b30
add nice formatting and docs to the migration notebook
Jul 17, 2023
4240408
add verifications
renardeinside Jul 17, 2023
d893fc8
fix
renardeinside Jul 17, 2023
4c09c11
add creds
renardeinside Jul 17, 2023
0feb6d8
minor fixes
renardeinside Jul 19, 2023
79007c3
finish migration to hatch
renardeinside Jul 19, 2023
5471c0a
Merge remote-tracking branch 'origin/main' into python-sdk
renardeinside Jul 19, 2023
2b3560a
add hatch caching
renardeinside Jul 19, 2023
c610a2d
move logger to mixins
renardeinside Jul 19, 2023
68e5d5f
move logger to mixins
renardeinside Jul 19, 2023
1314a72
move ruff comment to the top
renardeinside Jul 19, 2023
b960141
exclude notebooks from ruff checks
renardeinside Jul 19, 2023
9750b02
fix readme
renardeinside Jul 19, 2023
02963c8
fix readme
renardeinside Jul 19, 2023
bbac931
fix readme
renardeinside Jul 19, 2023
83cd99c
improve dev docs
renardeinside Jul 19, 2023
304502f
remove context-related methods
renardeinside Jul 19, 2023
8b942de
provide installation mechanism for package in the notebook
renardeinside Jul 19, 2023
2986db2
fix imports
renardeinside Jul 19, 2023
aac5b0f
rename function
renardeinside Jul 19, 2023
06d99af
minor fixes
renardeinside Jul 19, 2023
744e31d
minor fixes
renardeinside Jul 19, 2023
0eb8ddd
fix imports
renardeinside Jul 19, 2023
52303d2
fix order
renardeinside Jul 19, 2023
7335fae
add path
renardeinside Jul 19, 2023
f97bf64
add path
renardeinside Jul 19, 2023
9458fb9
provide a new way to introduce dependencies
renardeinside Jul 19, 2023
81671dc
fix imports
renardeinside Jul 19, 2023
085a748
disable reloads
renardeinside Jul 19, 2023
4f5c382
add paths
renardeinside Jul 19, 2023
2794849
add builder and installator
renardeinside Jul 19, 2023
3073a94
add hatch installation
renardeinside Jul 19, 2023
74f1f07
add hatch installation
renardeinside Jul 19, 2023
f7aa254
fix usages
renardeinside Jul 19, 2023
f1d07c5
remove notebooks
renardeinside Jul 19, 2023
7259bd0
introduce tests
renardeinside Jul 20, 2023
e18ef85
fix readme
renardeinside Jul 20, 2023
7a48061
fix readme
renardeinside Jul 20, 2023
d48509c
save first e2e
renardeinside Jul 20, 2023
b1a2a72
add mc tests
renardeinside Jul 20, 2023
c6be9d4
add unit testing
renardeinside Jul 20, 2023
80235f0
add new mocking logic
renardeinside Jul 20, 2023
feb6d65
add config loader
renardeinside Jul 20, 2023
c681978
lint
renardeinside Jul 20, 2023
81cd637
silence pyspark broadcast warnings
renardeinside Jul 20, 2023
cec0213
fix str enum issues
renardeinside Jul 20, 2023
3cd8579
fix text messages
renardeinside Jul 20, 2023
2c613e0
add spark mock
renardeinside Jul 21, 2023
c9ea222
fix lint
renardeinside Jul 21, 2023
cb8bb50
fix strenum
renardeinside Jul 21, 2023
3d08568
add session adapter
renardeinside Jul 21, 2023
f45bf57
align methods to managers
renardeinside Jul 21, 2023
676fc6d
add crud ops for temp groups
renardeinside Jul 21, 2023
1f6cedd
remove temp dirs
renardeinside Jul 21, 2023
bf7bb00
add functionality to apply new permissions to temp groups
renardeinside Jul 24, 2023
2eaa103
apply linter
renardeinside Jul 24, 2023
96c1d34
reformat readme
renardeinside Jul 24, 2023
1b6e7ba
lint
renardeinside Jul 24, 2023
c2e2794
remove args
renardeinside Jul 24, 2023
93e1f03
add group migration logic
renardeinside Jul 25, 2023
7c920c1
add e2e tests and bugfixes
renardeinside Jul 25, 2023
33a35cf
update readme
renardeinside Jul 25, 2023
232873e
improve readme
renardeinside Jul 25, 2023
48f8992
fix readme and e2e tests
renardeinside Jul 25, 2023
3c4668e
include default actions into the integration tests
renardeinside Jul 25, 2023
1230b3c
add verifications to e2e tests
renardeinside Jul 25, 2023
54e2d2f
add e2e tests for permissions
renardeinside Jul 25, 2023
47798cb
lint
renardeinside Jul 25, 2023
d20462e
remove outdated comments
renardeinside Jul 25, 2023
cb1fbdd
finalize entitlements and roles
renardeinside Jul 25, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 15 additions & 24 deletions .github/workflows/push.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,45 +6,36 @@ on:
push:
branches: [main]

env:
HATCH_VERSION: 1.7.0

jobs:
ci:
strategy:
matrix:
pyVersion: [ '3.9' ]
pyVersion: [ '3.10' ]
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v2
uses: actions/checkout@v3

- name: Unshallow
run: git fetch --prune --unshallow

- name: Install Python
uses: actions/setup-python@v4
with:
cache: 'pip'
cache-dependency-path: '**/pyproject.toml'
python-version: ${{ matrix.pyVersion }}


- name: Install Poetry
uses: snok/install-poetry@v1
with:
virtualenvs-create: true
virtualenvs-in-project: true
installer-parallel: true


- name: Load cache
id: cached-poetry-dependencies
uses: actions/cache@v3
with:
path: .venv
key: venv-${{ runner.os }}-${{ steps.setup-python.outputs.python-version }}-${{ hashFiles('**/poetry.lock') }}


- name: Install project dependencies
if: steps.cached-poetry-dependencies.outputs.cache-hit != 'true'
run: poetry install --no-interaction --no-root --with=dev

- name: Install hatch
run: pip install hatch==$HATCH_VERSION

- name: Verify linting
run: make verify
run: |
hatch run lint:verify

- name: Run unit tests
run: |
hatch run unit:test
7 changes: 6 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -93,8 +93,9 @@ celerybeat.pid
*.sage.py

# Environments
.env
.env.admin
.venv
.env.*
env/
venv/
ENV/
Expand Down Expand Up @@ -134,3 +135,7 @@ cython_debug/

# ruff
.ruff_cache
/scratch

# dev files and scratches
dev/cleanup.py
11 changes: 0 additions & 11 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,11 +0,0 @@
lint:
@echo "Linting the project code"
poetry run black .
poetry run isort .
poetry run ruff . --fix

verify:
@echo "Verifying the project code"
poetry run black . --check
poetry run isort . --check
poetry run ruff .
137 changes: 122 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,17 @@

This repo contains various functions and utilities for UC Upgrade.


## Latest working version and how-to

Please note that current project statis is 🏗️ **WIP**, but we have a minimal set of already working utilities.

To run the notebooks please use latest LTS Databricks Runtime (non-ML), without Photon, in a single-user cluster mode.
If you have Table ACL Clusters or SQL Warehouse where ACL have been defined, you should create a TableACL cluster to run this notebook

Please note that script is executed only on the driver node, therefore you'll need to use a Single Node Cluster with sufficient amount of cores (e.g. 16 cores).
> If you have Table ACL Clusters or SQL Warehouse where ACL have been defined, you should create a TableACL cluster to
> run this notebook.

Please note that script is executed **only** on the driver node, therefore you'll need to use a Single Node Cluster with
sufficient amount of cores (e.g. 16 cores).

Recommended VM types are:

Expand All @@ -18,27 +21,131 @@ Recommended VM types are:
- GCP: `c2-standard-16`

**For now please switch to the `v0.0.1` tag in the GitHub to get the latest working version.**
**All instructions below are currently in WIP mode.**

## Group migration

During the UC adoption, it's critical to move the groups from the workspace to account level.

To deliver this migration, the following steps are performed:


| Step description | Relevant API method |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------|
| A set of groups to be migrated is identified (either via `groups.selected` config property, or automatically).<br/>Group existence is verified against the account level.<br/>**If there is no group on the account level, an error is thrown.**<br/>Backup groups are created on the workspace level. | `toolkit.prepare_groups_in_environment()` |
| Inventory table is cleaned up. | `toolkit.cleanup_inventory_table()` |
| Workspace local group permissions are inventorized and saved into a Delta Table. | `toolkit.inventorize_permissions()` |
| Backup groups are entitled with permissions from the inventory table. | `toolkit.apply_permissions_to_backup_groups()` |
| Workspace-level groups are deleted. Account-level groups are granted with access to the workspace.<br/>Workspace-level entitlements are synced from backup groups to newly added account-level groups. | `toolkit.replace_workspace_groups_with_account_groups()` |
| Account-level groups are entitled with workspace-level permissions from the inventory table. | `toolkit.apply_permissions_to_account_groups()` |
| Backup groups are deleted | `toolkit.delete_backup_groups()` |
| Inventory table is cleaned up. This step is optional. | `toolkit.cleanup_inventory_table()` |

## Permissions and entitlements that we inventorize

> Please note that inherited permissions will not be inventorized / migrated.
> We only cover direct permissions.

Group-level:

- [x] Entitlements (One of `workspace-access`, `databricks-sql-access`, `allow-cluster-create`, `allow-instance-pool-create`)
- [x] Roles (AWS Only, represents Instance Profile Access)

Compute infrastructure:

- [x] Clusters
- [ ] Cluster policies
- [ ] Pools
- [ ] Instance Profile (for AWS)

Workflows:

- [ ] Delta Live Tables
- [ ] Jobs

ML:

- [ ] MLflow experiments
- [ ] MLflow registry
- [ ] Legacy Mlflow model endpoints (?)

SQL:

- [ ] Databricks SQL warehouses
- [ ] Dashboard
- [ ] Queries
- [ ] Alerts

Security:

- [ ] Tokens
- [ ] Passwords (for AWS)
- [ ] Secrets

Workspace:

- [ ] Notebooks in the Workspace FS
- [ ] Directories in the Workspace FS
- [ ] Files in the Workspace FS

Repos:

- [ ] User-level Repos
- [ ] Org-level Repos

## Local setup and development process
Data access:

- Install [poetry](https://python-poetry.org/)
- Run `poetry install` in the project directory
- Pin your IDE to use the newly created poetry environment
- [ ] Table ACLS

> Please note that you **don't** need to use `poetry` inside notebooks or in the Databricks workspace.
> It's only introduced to simplify local development.
## Development

Before running `git push`, don't forget to link your code with:
This section describes setup and development process for the project.

### Local setup

- Install [hatch](https://github.com/pypa/hatch):

```shell
pip install hatch
```

- Create environment:

```shell
hatch env create
```

- Install dev dependencies:

```shell
hatch run pip install -e '.[dbconnect]'
```

- Pin your IDE to use the newly created virtual environment. You can get the python path with:

```shell
hatch run python -c "import sys; print(sys.executable)"
```

- You're good to go! 🎉

### Development process

Please note that you **don't** need to use `hatch` inside notebooks or in the Databricks workspace.
It's only introduced to simplify local development.

Write your code in the IDE. Please keep all relevant files under the `src/uc_migration_toolkit` directory.

Don't forget to test your code via:

```shell
make lint
hatch run test
```

### Details of package installation
Please note that all commits go through the CI process, and it verifies linting. You can run linting locally via:

Since the package itself is managed with `poetry`, to re-use it inside the notebooks we're doing the following:
```shell
hatch run lint:fmt
```

1. Installing the package dependencies via poetry export
2. Adding the package itself to the notebook via `sys.path`

45 changes: 45 additions & 0 deletions dev/init_setup.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
from functools import partial
from pathlib import Path

from databricks.sdk import WorkspaceClient
from databricks.sdk.service.iam import ComplexValue
from dotenv import load_dotenv

from uc_migration_toolkit.config import RateLimitConfig
from uc_migration_toolkit.providers.logger import logger
from uc_migration_toolkit.utils import ThreadedExecution

Threader = partial(ThreadedExecution, num_threads=40, rate_limit=RateLimitConfig())


def _create_user(_ws: WorkspaceClient, uid: str):
user_name = f"test-user-{uid}@example.com"
potential_user = list(ws.users.list(filter=f"userName eq '{user_name}'"))
if potential_user:
logger.debug(f"User {user_name} already exists, skipping its creation")
else:
ws.users.create(
active=True,
user_name=user_name,
display_name=f"test-user-{uid}",
emails=[ComplexValue(display=None, primary=True, value=f"test-user-{uid}@example.com")],
)


def _create_users(_ws: WorkspaceClient):
executables = [partial(_create_user, ws, uid) for uid in range(200)]
Threader(executables).run()


if __name__ == "__main__":
principal_env = Path(__file__).parent.parent / ".env.principal"
if principal_env.exists():
logger.info("Using credentials provided in .env.principal")
load_dotenv(dotenv_path=principal_env)

logger.debug("setting up the workspace client")
ws = WorkspaceClient()
user_info = ws.current_user.me()
logger.debug("workspace client is set up")

_create_users(ws)
14 changes: 14 additions & 0 deletions examples/migration_config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
inventory:
table:
catalog: main
database: default
name: uc_migration_inventory


with_table_acls: False

groups:
selected: [ "analyst" ]

num_threads: 80

Loading