Skip to content

Commit

Permalink
Refresh cloud documentation (#81)
Browse files Browse the repository at this point in the history
  • Loading branch information
proddata authored Aug 9, 2024
1 parent 627ee75 commit 775e408
Show file tree
Hide file tree
Showing 17 changed files with 596 additions and 1,008 deletions.
Binary file removed docs/_assets/img/cluster-export-tab-history.png
Binary file not shown.
Binary file removed docs/_assets/img/cluster-export.png
Binary file not shown.
192 changes: 192 additions & 0 deletions docs/cluster/automation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,192 @@
(cluster-automation)=
# Automation

Automation in CrateDB Cloud allows users to streamline and manage routine
database operations efficiently. Two primary automation features available are
the SQL Scheduler and Table Policies, both of which facilitate the maintenance
and optimization of database tasks.

:::{important}
- Automation is available for all newly deployed clusters.
- For existing clusters, the feature can be enabled on demand. (Contact
[support](https://support.crate.io/) for activation.)

Automation utilizes a dedicated database user `gc_admin` with full cluster
privileges to execute scheduled tasks and persists data in the `gc` schema.
:::

## SQL Scheduler

The SQL Scheduler is designed to automate routine database tasks by scheduling
SQL queries to run at specific times, in UTC time. This feature supports
creating job descriptions with valid [cron patterns](https://www.ibm.com/docs/en/db2oc?topic=task-unix-cron-format)
and SQL statements, enabling a wide range of tasks. Users can manage these jobs
through the Cloud UI, adding, removing, editing, activating, and deactivating
them as needed.

### Use Cases

- Regularly updating or aggregating table data.
- Automating export and import of data.
- Deleting old/redundant data to maintain database efficiency.

### Accessing and Using the SQL Scheduler

SQL Scheduler can be found in the "Automation" tab in the left-hand
navigation menu. There are two tabs relevant to the SQL Scheduler:


**SQL Scheduler** shows a list of your existing jobs. In the list, you can
activate/deactivate each job with a toggle in the "Active" column. You can
also edit and delete jobs with buttons on the right side of the list.

![SQL Scheduler overview](../_assets/img/cluster-sql-scheduler-overview.png)


**Logs** shows a list of *scheduled* job runs, whether they failed or succeeded,
execution time, run time, and the error in case they were unsuccessful. In case
of an error, more details can be viewed showing the executed query and a stack
trace. You can filter the logs by status or by a specific job.

![SQL Scheduler overview](../_assets/img/cluster-sql-scheduler-logs.png)

### Examples

#### Cleanup of Old Files

Cleanup tasks represent a common use case for these types of automated jobs.
This example deletes records older than 30 days from a specified table once a
day:

```sql
DELETE FROM "sample_data"
WHERE
"timestamp_column" < NOW() - INTERVAL '30 days';
```

How often you run it, of course, depends on you, but once a day is common for
cleanup. This expression runs every day at 2:30 PM UTC:

Schedule: `30 14 * * *`

![SQL Scheduler overview](../_assets/img/cluster-sql-scheduler-example-cleanup.png)

#### Copying Logs into a Persistent Table

Another useful example might be copying data to another table for archival
purposes. This specifically copies from the system logs table into one of
our own tables.

```sql
CREATE TABLE IF NOT EXISTS "logs"."persistent_jobs_log" (
"classification" OBJECT (DYNAMIC),
"ended" TIMESTAMP WITH TIME ZONE,
"error" TEXT,
"id" TEXT,
"node" OBJECT (DYNAMIC),
"started" TIMESTAMP WITH TIME ZONE,
"stmt" TEXT,
"username" TEXT,
PRIMARY KEY (id)
) CLUSTERED INTO 1 SHARDS;

INSERT INTO
"logs"."persistent_jobs_log"
SELECT
*
FROM
sys.jobs_log
ON CONFLICT ("id") DO NOTHING;
```

In this example, we schedule the job to run every hour:

Schedule: `0 * * * *`

![SQL Scheduler overview](../_assets/img/cluster-sql-scheduler-example-copying.png)

:::{note}
Limitations and Known Issues:
* Only one job can run at a time; subsequent jobs will be queued until the
current one completes.
* Long-running jobs may block the execution of queued jobs, leading to
potential delays.
:::


## Table Policies

Table policies allow automating maintenance operations for **partitioned tables**.
Automated actions can be set up to be executed daily based on a pre-configured
ruleset.

![Table policy list](../_assets/img/cluster-table-policy.png)

### Overview

Table policy overview can be found in the left-hand navigation menu under
"Automation". From the list of policies, you can create, delete, edit, or
(de)activate them. Logs of executed policies can be found in the "Logs" tab.

![Table policy list](../_assets/img/cluster-table-policy-logs.png)

A new policy can be created with the "Add New Policy" button.

![Table policy list](../_assets/img/cluster-table-policy-create.png)

After naming the policy and selecting the tables/schemas to be impacted, you
must specify the time column. This column, which should be a timestamp used for
partitioning, will determine the data affected by the policy. It is important
that this time column is consistently present across all targeted tables/schemas.
While you can apply the policy to tables without the specified time column,
it will not get executed for those. If your tables have different timestamp
columns, consider setting up separate policies for each to ensure accuracy.

:::{note}
The "Time Column" must be of type `TIMESTAMP`.
:::

Next, a condition is used to determine affected partitions. The system is
time-based. A partition is eligible for action if the value in the partitioned
column is smaller (`<`), or smaller or equal (`<=`) than the current date minus
`n` days, months, or years.

### Actions

Following actions are supported:
* **Delete:** Deletes eligible partitions along with their data.
* **Set replicas:** Changes the replication factor of eligible partitions.
* **Force merge:** Merges segments on eligible partitions to ensure a specified number.

After filling out the info, you can see the affected schemas/tables and the
number of affected partitions if the policy gets executed at this very moment.

### Examples

Consider a scenario where you have a table and want to optimize space on your
cluster. For older data (e.g., 30 days), which may have already been snapshotted
and is only accessed infrequently, meaning it's not used for live analyitcs, it
might be sufficient for it to exist just once in the cluster without replication.
Additionally, you may not want to retain data older than 60 days.

Assume the following table schema:

```sql
CREATE TABLE data_table (
ts TIMESTAMP,
ts_day GENERATED ALWAYS AS date_trunc('day',ts),
val DOUBLE
) PARTITIONED BY (ts_day);
```

For the outlined scenario, the policies would be as follows:

**Policy 1 - Saving replica space:**
* **Time Column:** `ts_day`
* **Condition:** `older than 30 days`
* **Actions:** `Set replicas to 0.`

**Policy 2 - Data removal:**
* **Time Column:** `ts_day`
* **Condition:** `older than 60 days`
* **Actions:** `Delete eligible partition(s)`
83 changes: 83 additions & 0 deletions docs/cluster/backups.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
(cluster-backups)=
# Backups

You can find the Backups page in the detailed view of your cluster and
you can see and restore all existing backups here.

By default, a backup is made every hour. The backups are kept for 14
days. We also keep the last 14 backups indefinitely, no matter the state
of your cluster.

The Backups tab provides a list of all your backups. By default, a
backup is made every hour.

![Cloud Console cluster backups page](../_assets/img/cluster-backups.png)

You can also control the schedule of your backups by clicking the *Edit
backup schedule* button.

![Cloud Console cluster backups edit page](../_assets/img/cluster-backups-edit.png)

Here you can create a custom schedule by selecting any number of hour
slots. Backups will be created at selected times. At least one backup a
day is mandatory.

To restore a particular backup, click the *Restore* button. A popup
window with a SQL statement will appear. Input this statement to your
Admin UI console either by copy-pasting it, or clicking the *Run query
in Admin UI*. The latter will bring you directly to the Admin UI console
with the statement automatically pre-filled.

![Cloud Console cluster backups restore page](../_assets/img/cluster-backups-restore.png)

You have a choice between restoring the cluster fully, or only specific
tables.

(cluster-cloning)=
## Cluster Cloning

Cluster cloning is a process of duplicating all the data from a specific
snapshot into a different cluster. Creating the new cluster isn't part
of the cloning process, you need to create the target cluster yourself.
You can clone a cluster from the Backups page.

![Cloud Console cluster backup snapshots](../_assets/img/cluster-backups.png)

Choose a snapshot and click the *Clone* button. As with restoring a
backup, you can choose between cloning the whole cluster, or only
specific tables.

![Cloud Console cluster clone popup](../_assets/img/cluster-clone-popup.png)

:::{note}
Keep in mind that the full cluster clone will include users, views,
privileges and everything else. Cloning also doesn't distinguish
between cluster plans, meaning you can clone from CR2 to CR1 or any
other variation.
:::

(cluster-cloning-fail)=
## Failed cloning

There are circumstances under which cloning can fail or behave
unexpectedly. These are:

- If you already have tables with the same names in the target cluster
as in the source snapshot, the entire clone operation will fail.
- There isn't enough storage left on the target cluster to
accommodate the tables you're trying to clone. In this case, you
might get an incomplete cloning as the cluster will run out of
storage.
- You're trying to clone an invalid or no longer existing snapshot.
This can happen if you're cloning through
[Croud](https://cratedb.com/docs/cloud/cli/en/latest/). In this case,
the cloning will fail.
- You're trying to restore a table that is not included in the
snapshot. This can happen if you're restoring snapshots through
[Croud](https://cratedb.com/docs/cloud/cli/en/latest/). In this case,
the cloning will fail.

When cloning fails, it is indicated by a banner in the cluster overview
screen.

![Cloud Console cluster failed cloning](../_assets/img/cluster-clone-failed.png)
30 changes: 30 additions & 0 deletions docs/cluster/console.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
(cluster-console)=
# Console

The Console in CrateDB Cloud allows users to execute SQL queries seamlessly
against their CrateDB cluster. The Console can be accessed by users having the
"Organization Admin" role in the left-hand navigation menu within a cluster.

- **Table and Schema Tree View:** Easily navigate through your database
structure.
- **Client-Side Query Validation:** Ensure your SQL queries are correct before
execution.
- **Multiple Query Execution:** Run several queries in sequence.
- **Query History:** Access and manage your past queries.

:::{important}
- The Console is available for all newly deployed clusters.
- For older clusters, this feature can be enabled on demand. Contact
[support](https://support.crate.io/) for activation.

The Console currently utilizes a dedicated database user `gc_admin` with full
cluster privileges.
:::

:::{note}
**Multi-Query Execution:**
When running multiple queries at once, the Console executes them sequentially,
not within a single session or transaction. If one query fails, the subsequent
queries will not be executed. Currently, session settings are not persisted
between queries.
:::
27 changes: 27 additions & 0 deletions docs/cluster/export.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
(cluster-export)=
# Export

The "Export" section allows users to download specific tables/views. When you
first visit the Export tab, you can specify the name of a table/view,
format (CSV, JSON, or Parquet) and whether you'd like your data to be
gzip compressed (recommended for CSV and JSON files).

:::{important}
- Size limit for exporting is 1 GiB
- Exports are held for 3 days, then automatically deleted
:::

:::{note}
**Limitations with Parquet**:
Parquet is a highly compressed data format for very efficient storage of
tabular data. Please note that for OBJECT and ARRAY columns in CrateDB,
the exported data will be JSON encoded when saving to Parquet
(effectively saving them as strings). This is due to the complexity of
encoding structs and lists in the Parquet format, where determining the
exact schema might not be possible. When re-importing such a Parquet
file, make sure you pre-create the table with the correct schema.
:::




Loading

0 comments on commit 775e408

Please sign in to comment.