Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add Getting Started with Databricks guide #7050

Merged
merged 1 commit into from
Sep 6, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion docs/docs-new/pages/product/getting-started/_meta.js
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
module.exports = {
"core": "Cube Core",
"cloud": "Cube Cloud",
"databricks": "Cube Cloud and Databricks",
"migrate-from-core": "Migrate from Cube Core"
}
}
15 changes: 15 additions & 0 deletions docs/docs-new/pages/product/getting-started/databricks.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Getting started with Cube Cloud and Databricks

This getting started guide will show you how to use Cube Cloud with Databricks.
You will learn how to:

- Load sample data into your Databricks account
- Connect Cube Cloud to Databricks
- Create your first Cube data model
- Connect to a BI tool to explore this model
- Create React application with Cube REST API

## Prerequisites

- [Cube Cloud account](https://cubecloud.dev/auth/signup)
- [Databricks account](https://www.databricks.com/try-databricks)
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
module.exports = {
"load-data": "Load data",
"connect-to-databricks": "Connect to Databricks",
"create-data-model": "Create data model",
"query-from-bi": "Query from BI",
"query-from-react-app": "Query from React"
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# Connect to Databricks

In this section, we’ll create a Cube Cloud deployment and connect it to
Databricks. A deployment represents a data model, configuration, and managed
infrastructure.

To continue with this guide, you'll need to have a Cube Cloud account. If you
don't have one yet, [click here to sign up][cube-cloud-signup] for free.

First, [sign in to your Cube Cloud account][cube-cloud-signin]. Then,
click <Btn>Create Deployment</Btn>:

Give the deployment a name, select the cloud provider and region of your choice,
and click <Btn>Next</Btn>:

<Screenshot
alt="Cube Cloud Create Deployment Screen"
src="https://ucarecdn.com/2338323e-0db8-4224-8e7a-3b4daf9c60ec/"
/>

<SuccessBox>

Microsoft Azure is available in Cube Cloud on
[Premium](https://cube.dev/pricing) tier. [Contact us](https://cube.dev/contact)
for details.

</SuccessBox>

## Set up a Cube project

Next, click <Btn>Create</Btn> to create a new project from scratch:

<Screenshot
alt="Cube Cloud Upload Project Screen"
src="https://ucarecdn.com/46b72b61-b650-4271-808d-55203f1c8d8b/"
/>

## Connect to your Databricks

The last step is to connect Cube Cloud to Databricks. First, select it from the
grid:

<Screenshot
alt="Cube Cloud Setup Database Screen"
src="https://ucarecdn.com/1d656ba9-dd83-4ff4-a59e-8b5f97a9ddcc/"
/>

Then enter your Databricks credentials:

- **Access Token:** A personal access token for your Databricks account. [You
can generate one][databricks-docs-pat] in your Databricks account settings.
- **Databricks JDBC URL:** The JDBC URL for your Databricks SQL warehouse. [You
can find it][databricks-docs-jdbc-url] in the SQL warehouse settings screen.
- **Databricks Catalog:** This should match the same catalog where you uploaded
the files in the last section. If left unspecified, the `default` catalog is
used.

[databricks-docs-pat]:
https://docs.databricks.com/en/dev-tools/auth.html#databricks-personal-access-tokens-for-workspace-users
[databricks-docs-jdbc-url]:
https://docs.databricks.com/en/integrations/jdbc-odbc-bi.html#get-connection-details-for-a-sql-warehouse

Click <Btn>Apply</Btn>, Cube Cloud will test the connection and proceed to the
next step.

## Generate data model from your Databricks schema

Cube can now generate a basic data model from your data warehouse, which helps
getting started with data modeling faster. Select all four tables in our catalog
and click through the data model generation wizard. We'll inspect these
generated files in the next section and start making changes to them.

[aws-docs-sec-group]:
https://docs.aws.amazon.com/vpc/latest/userguide/security-groups.html
[aws-docs-sec-group-rule]:
https://docs.aws.amazon.com/vpc/latest/userguide/security-group-rules.html
[cube-cloud-signin]: https://cubecloud.dev/auth
[cube-cloud-signup]: https://cubecloud.dev/auth/signup
[ref-conf-db]: /product/configuration/data-sources
[ref-getting-started-cloud-generate-models]:
/getting-started/cloud/generate-models
Original file line number Diff line number Diff line change
@@ -0,0 +1,213 @@
# Create your first data model

Cube follows a dataset-oriented data modeling approach, which is inspired by and
expands upon dimensional modeling. Cube incorporates this approach and provides
a practical framework for implementing dataset-oriented data modeling.

When building a data model in Cube, you work with two dataset-centric objects:
**cubes** and **views**. **Cubes** usually represent business entities such as
customers, line items, and orders. In cubes, you define all the calculations
within the measures and dimensions of these entities. Additionally, you define
relationships between cubes, such as "an order has many line items" or "a user
may place multiple orders."

**Views** sit on top of a data graph of cubes and create a facade of your entire
data model, with which data consumers can interact. You can think of views as
the final data products for your data consumers - BI users, data apps, AI
agents, etc. When building views, you select measures and dimensions from
different connected cubes and present them as a single dataset to BI or data
apps.

<Diagram
alt="Architecture diagram of queries being sent to cubes and views"
src="https://ucarecdn.com/bfc3e04a-b690-40bc-a6f8-14a9175fb4fd/"
/>

## Working with cubes

To begin building your data model, click on <Btn>Enter Development Mode</Btn> in
Cube Cloud. This will take you to your personal developer space, where you can
safely make changes to your data model without affecting the production
environment.

In the previous section, we generated four cubes. To see the data graph of these
four cubes and how they are connected to each other, click the <Btn>Show
Graph</Btn> button on the Data Model page.

Let's review the `orders` cube first and update it with additional dimensions
and measures.

Once you are in developer mode, navigate to the <Btn>Data Model</Btn> and click
on the `orders.yml` file in the left sidebar inside the `model/cubes` directory
to open it.

You should see the following content of `model/cubes/orders.yml` file.

```yaml
cubes:
- name: orders
sql_table: ECOM.ORDERS

joins:
- name: users
sql: "{CUBE}.USER_ID = {users}.USER_ID"
relationship: many_to_one

dimensions:
- name: status
sql: STATUS
type: string

- name: id
sql: ID
type: number
primary_key: true

- name: created_at
sql: CREATED_AT
type: time

- name: completed_at
sql: COMPLETED_AT
type: time

measures:
- name: count
type: count
```

As you can see, we already have a `count` measure that we can use to calculate
the total count of our orders.

Let's add an additional measure to the `orders` cube to calculate only
**completed orders**. The `status` dimension in the `orders` cube reflects the
three possible statuses: **processing**, **shipped**, or **completed**. We will
create a new measure `completed_count` by using a filter on that dimension. To
do this, we will use a
[filter parameter](/product/data-modeling/reference/measures#filters) of the
measure and
[refer](/product/data-modeling/fundamentals/syntax#referring-to-objects) to the
existing dimension.

Add the following measure definition to your `model/cubes/orders.yml` file. It
should be included within the `measures` block.

```yaml
- name: completed_count
type: count
filters:
- sql: "{CUBE}.status = 'completed'"
```

With these two measures in place, `count` and `completed_count`, we can create a
**derived measure**. Derived measures are measures that you can create based on
existing measures. Let's create the `completed_percentage` derived measure.

Add the following measure definition to your `model/cubes/orders.yml` file
within the `measures` block.

```yaml
- name: completed_percentage
type: number
sql: "({completed_count} / NULLIF({count}, 0)) * 100.0"
format: percent
```

Below you can see what your updated `orders` cube should look like with two new
measures. Feel free to copy this code and paste it into your
`model/cubes/order.yml` file.

```yaml
cubes:
- name: orders
sql_table: ECOM.ORDERS

joins:
- name: users
sql: "{CUBE}.USER_ID = {users}.USER_ID"
relationship: many_to_one

dimensions:
- name: status
sql: STATUS
type: string

- name: id
sql: ID
type: number
primary_key: true

- name: created_at
sql: CREATED_AT
type: time

- name: completed_at
sql: COMPLETED_AT
type: time

measures:
- name: count
type: count

- name: completed_count
type: count
filters:
- sql: "{CUBE}.status = 'completed'"

- name: completed_percentage
type: number
sql: "({completed_count} / NULLIF({count}, 0)) * 100.0"
format: percent
```

Click <Btn>Save All</Btn> in the upper corner to save changes to the data model.
Now, you can navigate to Cube’s Playground. The Playground is a web-based tool
that allows you to query your data without connecting any tools or writing any
code. It's the fastest way to explore and test your data model.

You can select measures and dimensions from different cubes in playground,
including your newly created `completed_percentage` measure.

## Working with views

When building views, we recommend following entity-oriented design and
structuring your views around your business entities. Usually, cubes tend to be
normalized entities without duplicated or redundant members, while views are
denormalized entities where you pick as many measures and dimensions from
multiple cubes as needed to describe a business entity.

Let's create our first view, which will provide all necessary measures and
dimensions to explore orders. Views are usually located in the `views` folder
and have a `_view` postfix.

Create `model/views/orders_view.yml` with the following content:

```yaml
views:
- name: orders_view

cubes:
- join_path: orders
includes:
- status
- created_at
- count
- completed_count
- completed_percentage

- join_path: orders.users
prefix: true
includes:
- city
- age
- state
```

When building views, you can leverage the `cubes` parameter, which enables you
to include measures and dimensions from other cubes in the view. You can build
your view by combining multiple joined cubes and specifying the path by which
they should be joined for that particular view.

After saving, you can experiment with your newly created view in the Playground.
In the next section, we will learn how to query our `orders_view` using a BI
tool.
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Load data

The following steps will guide you through setting up a Databricks account and
uploading the demo dataset, which is stored as CSV files in a public S3 bucket.

First, download the following files to your local machine:

- [`line_items.csv`](https://cube-tutorial.s3.us-east-2.amazonaws.com/line_items.csv)
- [`orders.csv`](https://cube-tutorial.s3.us-east-2.amazonaws.com/orders.csv)
- [`users.csv`](https://cube-tutorial.s3.us-east-2.amazonaws.com/users.csv)
- [`products.csv`](https://cube-tutorial.s3.us-east-2.amazonaws.com/products.csv)

Next, let's ensure we have a SQL warehouse that is active. Log in to your
Databricks account, then from the sidebar, click on <Btn>SQL → SQL
Warehouses</Btn>:

<Screenshot
alt="Databricks SQL Warehouses screen"
src="https://ucarecdn.com/92e82ca3-0ca4-4064-8ed6-394e5a66e869/"
/>

<InfoBox>

Ensure the warehouse is active by checking its status; if it is inactive, click

<Btn>▶️</Btn> to start it.

</InfoBox>

Next, click <Btn>New → File upload</Btn> from the sidebar, and upload
`line_items.csv`. The UI will show a preview of the data within the file; when
ready, click <Btn>Create table</Btn>.

Repeat the above steps for the three other files.
Loading