docs: add Getting Started with Databricks guide (#7050)

cube-js · Sep 6, 2023 · 5516f23 · 5516f23
1 parent e4d8c16
commit 5516f23
Show file tree

Hide file tree

Showing 8 changed files with 537 additions and 1 deletion.
diff --git a/docs/docs-new/pages/product/getting-started/_meta.js b/docs/docs-new/pages/product/getting-started/_meta.js
@@ -1,5 +1,6 @@
 module.exports = {
   "core": "Cube Core",
   "cloud": "Cube Cloud",
+  "databricks": "Cube Cloud and Databricks",
   "migrate-from-core": "Migrate from Cube Core"
-}
+}
diff --git a/docs/docs-new/pages/product/getting-started/databricks.mdx b/docs/docs-new/pages/product/getting-started/databricks.mdx
@@ -0,0 +1,15 @@
+# Getting started with Cube Cloud and Databricks
+
+This getting started guide will show you how to use Cube Cloud with Databricks.
+You will learn how to:
+
+- Load sample data into your Databricks account
+- Connect Cube Cloud to Databricks
+- Create your first Cube data model
+- Connect to a BI tool to explore this model
+- Create React application with Cube REST API
+
+## Prerequisites
+
+- [Cube Cloud account](https://cubecloud.dev/auth/signup)
+- [Databricks account](https://www.databricks.com/try-databricks)
diff --git a/docs/docs-new/pages/product/getting-started/databricks/_meta.js b/docs/docs-new/pages/product/getting-started/databricks/_meta.js
@@ -0,0 +1,7 @@
+module.exports = {
+  "load-data": "Load data",
+  "connect-to-databricks": "Connect to Databricks",
+  "create-data-model": "Create data model",
+  "query-from-bi": "Query from BI",
+  "query-from-react-app": "Query from React"
+}
diff --git a/docs/docs-new/pages/product/getting-started/databricks/connect-to-databricks.mdx b/docs/docs-new/pages/product/getting-started/databricks/connect-to-databricks.mdx
@@ -0,0 +1,81 @@
+# Connect to Databricks
+
+In this section, we’ll create a Cube Cloud deployment and connect it to
+Databricks. A deployment represents a data model, configuration, and managed
+infrastructure.
+
+To continue with this guide, you'll need to have a Cube Cloud account. If you
+don't have one yet, [click here to sign up][cube-cloud-signup] for free.
+
+First, [sign in to your Cube Cloud account][cube-cloud-signin]. Then,
+click <Btn>Create Deployment</Btn>:
+
+Give the deployment a name, select the cloud provider and region of your choice,
+and click <Btn>Next</Btn>:
+
+<Screenshot
+  alt="Cube Cloud Create Deployment Screen"
+  src="https://ucarecdn.com/2338323e-0db8-4224-8e7a-3b4daf9c60ec/"
+/>
+
+<SuccessBox>
+
+Microsoft Azure is available in Cube Cloud on
+[Premium](https://cube.dev/pricing) tier. [Contact us](https://cube.dev/contact)
+for details.
+
+</SuccessBox>
+
+## Set up a Cube project
+
+Next, click <Btn>Create</Btn> to create a new project from scratch:
+
+<Screenshot
+  alt="Cube Cloud Upload Project Screen"
+  src="https://ucarecdn.com/46b72b61-b650-4271-808d-55203f1c8d8b/"
+/>
+
+## Connect to your Databricks
+
+The last step is to connect Cube Cloud to Databricks. First, select it from the
+grid:
+
+<Screenshot
+  alt="Cube Cloud Setup Database Screen"
+  src="https://ucarecdn.com/1d656ba9-dd83-4ff4-a59e-8b5f97a9ddcc/"
+/>
+
+Then enter your Databricks credentials:
+
+- **Access Token:** A personal access token for your Databricks account. [You
+  can generate one][databricks-docs-pat] in your Databricks account settings.
+- **Databricks JDBC URL:** The JDBC URL for your Databricks SQL warehouse. [You
+  can find it][databricks-docs-jdbc-url] in the SQL warehouse settings screen.
+- **Databricks Catalog:** This should match the same catalog where you uploaded
+  the files in the last section. If left unspecified, the `default` catalog is
+  used.
+
+[databricks-docs-pat]:
+  https://docs.databricks.com/en/dev-tools/auth.html#databricks-personal-access-tokens-for-workspace-users
+[databricks-docs-jdbc-url]:
+  https://docs.databricks.com/en/integrations/jdbc-odbc-bi.html#get-connection-details-for-a-sql-warehouse
+
+Click <Btn>Apply</Btn>, Cube Cloud will test the connection and proceed to the
+next step.
+
+## Generate data model from your Databricks schema
+
+Cube can now generate a basic data model from your data warehouse, which helps
+getting started with data modeling faster. Select all four tables in our catalog
+and click through the data model generation wizard. We'll inspect these
+generated files in the next section and start making changes to them.
+
+[aws-docs-sec-group]:
+  https://docs.aws.amazon.com/vpc/latest/userguide/security-groups.html
+[aws-docs-sec-group-rule]:
+  https://docs.aws.amazon.com/vpc/latest/userguide/security-group-rules.html
+[cube-cloud-signin]: https://cubecloud.dev/auth
+[cube-cloud-signup]: https://cubecloud.dev/auth/signup
+[ref-conf-db]: /product/configuration/data-sources
+[ref-getting-started-cloud-generate-models]:
+  /getting-started/cloud/generate-models
diff --git a/docs/docs-new/pages/product/getting-started/databricks/create-data-model.mdx b/docs/docs-new/pages/product/getting-started/databricks/create-data-model.mdx
@@ -0,0 +1,213 @@
+# Create your first data model
+
+Cube follows a dataset-oriented data modeling approach, which is inspired by and
+expands upon dimensional modeling. Cube incorporates this approach and provides
+a practical framework for implementing dataset-oriented data modeling.
+
+When building a data model in Cube, you work with two dataset-centric objects:
+**cubes** and **views**. **Cubes** usually represent business entities such as
+customers, line items, and orders. In cubes, you define all the calculations
+within the measures and dimensions of these entities. Additionally, you define
+relationships between cubes, such as "an order has many line items" or "a user
+may place multiple orders."
+
+**Views** sit on top of a data graph of cubes and create a facade of your entire
+data model, with which data consumers can interact. You can think of views as
+the final data products for your data consumers - BI users, data apps, AI
+agents, etc. When building views, you select measures and dimensions from
+different connected cubes and present them as a single dataset to BI or data
+apps.
+
+<Diagram
+  alt="Architecture diagram of queries being sent to cubes and views"
+  src="https://ucarecdn.com/bfc3e04a-b690-40bc-a6f8-14a9175fb4fd/"
+/>
+
+## Working with cubes
+
+To begin building your data model, click on <Btn>Enter Development Mode</Btn> in
+Cube Cloud. This will take you to your personal developer space, where you can
+safely make changes to your data model without affecting the production
+environment.
+
+In the previous section, we generated four cubes. To see the data graph of these
+four cubes and how they are connected to each other, click the <Btn>Show
+Graph</Btn> button on the Data Model page.
+
+Let's review the `orders` cube first and update it with additional dimensions
+and measures.
+
+Once you are in developer mode, navigate to the <Btn>Data Model</Btn> and click
+on the `orders.yml` file in the left sidebar inside the `model/cubes` directory
+to open it.
+
+You should see the following content of `model/cubes/orders.yml` file.
+
+```yaml
+cubes:
+  - name: orders
+    sql_table: ECOM.ORDERS
+
+    joins:
+      - name: users
+        sql: "{CUBE}.USER_ID = {users}.USER_ID"
+        relationship: many_to_one
+
+    dimensions:
+      - name: status
+        sql: STATUS
+        type: string
+
+      - name: id
+        sql: ID
+        type: number
+        primary_key: true
+
+      - name: created_at
+        sql: CREATED_AT
+        type: time
+
+      - name: completed_at
+        sql: COMPLETED_AT
+        type: time
+
+    measures:
+      - name: count
+        type: count
+```
+
+As you can see, we already have a `count` measure that we can use to calculate
+the total count of our orders.
+
+Let's add an additional measure to the `orders` cube to calculate only
+**completed orders**. The `status` dimension in the `orders` cube reflects the
+three possible statuses: **processing**, **shipped**, or **completed**. We will
+create a new measure `completed_count` by using a filter on that dimension. To
+do this, we will use a
+[filter parameter](/product/data-modeling/reference/measures#filters) of the
+measure and
+[refer](/product/data-modeling/fundamentals/syntax#referring-to-objects) to the
+existing dimension.
+
+Add the following measure definition to your `model/cubes/orders.yml` file. It
+should be included within the `measures` block.
+
+```yaml
+- name: completed_count
+  type: count
+  filters:
+    - sql: "{CUBE}.status = 'completed'"
+```
+
+With these two measures in place, `count` and `completed_count`, we can create a
+**derived measure**. Derived measures are measures that you can create based on
+existing measures. Let's create the `completed_percentage` derived measure.
+
+Add the following measure definition to your `model/cubes/orders.yml` file
+within the `measures` block.
+
+```yaml
+- name: completed_percentage
+  type: number
+  sql: "({completed_count} / NULLIF({count}, 0)) * 100.0"
+  format: percent
+```
+
+Below you can see what your updated `orders` cube should look like with two new
+measures. Feel free to copy this code and paste it into your
+`model/cubes/order.yml` file.
+
+```yaml
+cubes:
+  - name: orders
+    sql_table: ECOM.ORDERS
+
+    joins:
+      - name: users
+        sql: "{CUBE}.USER_ID = {users}.USER_ID"
+        relationship: many_to_one
+
+    dimensions:
+      - name: status
+        sql: STATUS
+        type: string
+
+      - name: id
+        sql: ID
+        type: number
+        primary_key: true
+
+      - name: created_at
+        sql: CREATED_AT
+        type: time
+
+      - name: completed_at
+        sql: COMPLETED_AT
+        type: time
+
+    measures:
+      - name: count
+        type: count
+
+      - name: completed_count
+        type: count
+        filters:
+          - sql: "{CUBE}.status = 'completed'"
+
+      - name: completed_percentage
+        type: number
+        sql: "({completed_count} / NULLIF({count}, 0)) * 100.0"
+        format: percent
+```
+
+Click <Btn>Save All</Btn> in the upper corner to save changes to the data model.
+Now, you can navigate to Cube’s Playground. The Playground is a web-based tool
+that allows you to query your data without connecting any tools or writing any
+code. It's the fastest way to explore and test your data model.
+
+You can select measures and dimensions from different cubes in playground,
+including your newly created `completed_percentage` measure.
+
+## Working with views
+
+When building views, we recommend following entity-oriented design and
+structuring your views around your business entities. Usually, cubes tend to be
+normalized entities without duplicated or redundant members, while views are
+denormalized entities where you pick as many measures and dimensions from
+multiple cubes as needed to describe a business entity.
+
+Let's create our first view, which will provide all necessary measures and
+dimensions to explore orders. Views are usually located in the `views` folder
+and have a `_view` postfix.
+
+Create `model/views/orders_view.yml` with the following content:
+
+```yaml
+views:
+  - name: orders_view
+
+    cubes:
+      - join_path: orders
+        includes:
+          - status
+          - created_at
+          - count
+          - completed_count
+          - completed_percentage
+
+      - join_path: orders.users
+        prefix: true
+        includes:
+          - city
+          - age
+          - state
+```
+
+When building views, you can leverage the `cubes` parameter, which enables you
+to include measures and dimensions from other cubes in the view. You can build
+your view by combining multiple joined cubes and specifying the path by which
+they should be joined for that particular view.
+
+After saving, you can experiment with your newly created view in the Playground.
+In the next section, we will learn how to query our `orders_view` using a BI
+tool.
diff --git a/docs/docs-new/pages/product/getting-started/databricks/load-data.mdx b/docs/docs-new/pages/product/getting-started/databricks/load-data.mdx
@@ -0,0 +1,34 @@
+# Load data
+
+The following steps will guide you through setting up a Databricks account and
+uploading the demo dataset, which is stored as CSV files in a public S3 bucket.
+
+First, download the following files to your local machine:
+
+- [`line_items.csv`](https://cube-tutorial.s3.us-east-2.amazonaws.com/line_items.csv)
+- [`orders.csv`](https://cube-tutorial.s3.us-east-2.amazonaws.com/orders.csv)
+- [`users.csv`](https://cube-tutorial.s3.us-east-2.amazonaws.com/users.csv)
+- [`products.csv`](https://cube-tutorial.s3.us-east-2.amazonaws.com/products.csv)
+
+Next, let's ensure we have a SQL warehouse that is active. Log in to your
+Databricks account, then from the sidebar, click on <Btn>SQL → SQL
+Warehouses</Btn>:
+
+<Screenshot
+  alt="Databricks SQL Warehouses screen"
+  src="https://ucarecdn.com/92e82ca3-0ca4-4064-8ed6-394e5a66e869/"
+/>
+
+<InfoBox>
+
+Ensure the warehouse is active by checking its status; if it is inactive, click
+
+<Btn>▶️</Btn> to start it.
+
+</InfoBox>
+
+Next, click <Btn>New → File upload</Btn> from the sidebar, and upload
+`line_items.csv`. The UI will show a preview of the data within the file; when
+ready, click <Btn>Create table</Btn>.
+
+Repeat the above steps for the three other files.