Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature][Task Plugin] Add DVC task plugin for MLops scenario (#10372) #10407

Merged
merged 2 commits into from
Jun 21, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
125 changes: 125 additions & 0 deletions docs/docs/en/guide/task/dvc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
# DVC Node

## Overview

[DVC (Data Version Control)](https://dvc.org) is an excellent open-source version control system for machine learning projects.

The DVC plugin is used to use the data version management function of DVC on DolphinScheduler, helping users to carry out data version management easily.

The plugin provides the following three functions:

- Init DVC: Initialize the Git repository as a DVC repository and bind the address where the data is stored to store the actual data.
- Upload: Add or update specific data to the repository and record the version tag.
- Download: Download a specific version of data from the repository.

## Create Task

- Click `Project -> Management-Project -> Name-Workflow Definition`, and click the "Create Workflow" button to enter the
DAG editing page.
- Drag from the toolbar <img src="../../../../img/tasks/icons/dvc.png" width="15"/> task node to canvas.

## Task Example

First, introduce some general parameters of DolphinScheduler:

- **Node name**: The node name in a workflow definition is unique.
- **Run flag**: Identifies whether this node schedules normally, if it does not need to execute, select
the `prohibition execution`.
- **Descriptive information**: Describe the function of the node.
- **Task priority**: When the number of worker threads is insufficient, execute in the order of priority from high
to low, and tasks with the same priority will execute in a first-in first-out order.
- **Worker grouping**: Assign tasks to the machines of the worker group to execute. If `Default` is selected,
randomly select a worker machine for execution.
- **Environment Name**: Configure the environment name in which run the script.
- **Times of failed retry attempts**: The number of times the task failed to resubmit.
- **Failed retry interval**: The time interval (unit minute) for resubmitting the task after a failed task.
- **Delayed execution time**: The time (unit minute) that a task delays in execution.
- **Timeout alarm**: Check the timeout alarm and timeout failure. When the task runs exceed the "timeout", an alarm
email will send and the task execution will fail.
- **Predecessor task**: Selecting a predecessor task for the current task, will set the selected predecessor task as
upstream of the current task.

Here are some specific parameters for the DVC plugin:

- **DVC Task Type** :Upload, Download or Init DVC。
- **DVC Repository** :The DVC repository address associated with the task execution.

### Init DVC

Initialize the Git repository as a DVC repository and add a new data remote to save data.

After the project is initialized, it is still a Git repository, but with DVC features added.

The data is not actually stored in a Git repository, but somewhere else, and DVC keeps track of the version and address of the data and handles this relationship.

![dvc_init](../../../../img/tasks/demo/dvc_init.png)

**Task Parameter**

- **Remote Store Url** :The actual data is stored at the address. You can learn about the supported storage types from the [DVC supported storage types](https://dvc.org/doc/command-reference/remote/add#supported-storage-types) .

The example above shows that:
Initialize repository `git@github.com:<YOUR-NAME-OR-ORG>/dvc-data-repository-example.git` as a DVC project and bind the remote storage address to `~/dvc`

### Upload

Used to upload and update data and record version numbers.

![dvc_upload](../../../../img/tasks/demo/dvc_upload.png)

**Task Parameter**

- **Data Path in DVC Repository** :The data will be uploaded to this path in the repository.
- **Data Path In Worker** :Data path to be uploaded.
- **Version** :After the data is uploaded, the version tag for the data will be saved in `git tag`.
- **Version Message** :Version Message.

The example above shows that:

Upload data `/home/data/iris` to the root directory of repository `git@github.com:<YOUR-NAME-OR-ORG>/dvc-data-repository-example.git`. The file or folder of data is named `iris`.

Then run `git tag "iris_1.0" -m "init iris data"`. Record the version tag `iris_1.0` and the version message `inir iris data`.

### Download

Used to download data for a specific version.

![dvc_download](../../../../img/tasks/demo/dvc_download.png)

**Task Parameter**

- **Data Path in DVC Repository** :The path to the data to download in the DVC repository.
- **Data Path In Worker** :Path for saving data after the file is downloaded to the local.
- **Version** :The version of the data to download.

The example above shows that:

Download the data for iris data at version `iris_1.0` in repository `git@github.com:<YOUR-NAME-OR-ORG>/dvc-data-repository-example.git` to the `~/dvc_test/iris`

## Environment to prepare

### Install DVC

Make sure you have installed DVC, if not, you can run `pip install dvc` command to install it.

Get the 'dvc' path and configure the environment variables.

The conda environment is used as an example:

Install python PIP on Conda and configure conda's environment variables so that the component can correctly find the 'DVC' command

```shell
which dvc
# >> ~/anaconda3/bin/dvc
```

You need to enter the admin account to configure a conda environment variable(Please
install [anaconda](https://docs.continuum.io/anaconda/install/)
or [miniconda](https://docs.conda.io/en/latest/miniconda.html#installing ) in advance).

![dvc_env_config](../../../../img/tasks/demo/dvc_env_config.png)

Note During the configuration task, select the conda environment created above. Otherwise, the program cannot find the
Conda environment.

![dvc_env_name](../../../../img/tasks/demo/dvc_env_name.png)
110 changes: 110 additions & 0 deletions docs/docs/zh/guide/task/dvc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
# DVC节点

## 综述

[DVC(Data Version Control)](https://dvc.org) 是一个MLops领域一个优秀的开机器学习版本管理系统。

DVC 组件用于在DS上使用DVC的数据版本管理功能,帮助用户简易地进行数据的版本管理。组件提供如下三个功能:

- Init DVC: 将git仓库初始化为DVC仓库,并绑定存储数据的地址用于存储实际的数据。
- Upload: 将特定数据添加或者更新到仓库中,并记录版本号。
- Download: 从仓库中下载特定版本的数据。

## 创建任务

- 点击项目管理-项目名称-工作流定义,点击“创建工作流”按钮,进入 DAG 编辑页面;
- 拖动工具栏的 <img src="../../../../img/tasks/icons/dvc.png" width="15"/> 任务节点到画板中。

## 任务样例

首先介绍一些DS通用参数

- **节点名称** :设置任务的名称。一个工作流定义中的节点名称是唯一的。
- **运行标志** :标识这个节点是否能正常调度,如果不需要执行,可以打开禁止执行开关。
- **描述** :描述该节点的功能。
- **任务优先级** :worker 线程数不足时,根据优先级从高到低依次执行,优先级一样时根据先进先出原则执行。
- **Worker 分组** :任务分配给 worker 组的机器执行,选择 Default,会随机选择一台 worker 机执行。
- **环境名称** :配置运行脚本的环境。
- **失败重试次数** :任务失败重新提交的次数。
- **失败重试间隔** :任务失败重新提交任务的时间间隔,以分钟为单位。
- **延迟执行时间** :任务延迟执行的时间,以分钟为单位。
- **超时告警** :勾选超时告警、超时失败,当任务超过"超时时长"后,会发送告警邮件并且任务执行失败。
- **前置任务** :选择当前任务的前置任务,会将被选择的前置任务设置为当前任务的上游。

以下是一些DVC 组件的常用参数

- **DVC任务类型** :可以选择 Upload、Download、Init DVC。
- **DVC仓库** :任务执行时关联的仓库地址。

### Init DVC


将git仓库初始化为DVC仓库, 并绑定数据储存的地方。

项目初始化后,仍然为git仓库,不过添加了DVC的特性。

实际上数据并不保存在git仓库,而是存储在另外的地方,DVC会跟踪数据的版本和地址,并处理好这个关系。

![dvc_init](../../../../img/tasks/demo/dvc_init.png)

**任务参数**

- **数据存储地址**
:实际的数据保存的地址,支持的类型可见 [DVC supported storage types](https://dvc.org/doc/command-reference/remote/add#supported-storage-types)

如上述例子表示: 将仓库 `git@github.com:<YOUR-NAME-OR-ORG>/dvc-data-repository-example.git` 初始化为DVC项目,并绑定远程储存地址为 `~/dvc`

### Upload

用于上传和更新数据,并记录版本号。

![dvc_upload](../../../../img/tasks/demo/dvc_upload.png)

**任务参数**

- **DVC仓库中的数据路径** :上传的数据保存到仓库的地址。
- **Worker中数据路径** :需要上传的数据的地址。
- **数据版本** :上传数据后,为该版本数据打上的版本号,会保存到 git tag 里面。
- **数据版本信息** :本次上传需要备注的信息。

如上述例子表示: 将数据 `/home/data/iris` 上传到仓库 `git@github.com:<YOUR-NAME-OR-ORG>/dvc-data-repository-example.git`
的根目录下,数据的文件/文件夹名字为`iris`。 然后执行 `git tag "iris_1.0" -m "init iris data"`。 记录版本号 `iris_1.0`和 版本信息 'inir iris data'

### Download

用于下载特定版本的数据。

![dvc_download](../../../../img/tasks/demo/dvc_download.png)

**任务参数**

- **DVC仓库中的数据路径** :需要下载数据在仓库中的路径。
- **Worker中数据路径** :数据下载到本地后的保存地址。
- **数据版本** :需要下载的数据的版本。

如上述例子表示: 将仓库 `git@github.com:xxxx/dvc-data-repository-example.git` 版本为 `iris_1.0` 的 iris 的数据下载到 `~/dvc_test/iris`

## 环境准备

### dvc 安装

确保你已经安装DVC可以使用`pip install dvc`进行安装。

获取dvc地址, 并配置环境变量

下面以 conda 上的 python pip 安装为例子,配置 conda 的环境变量,使得组件能正确找到`dvc`命令

```shell
which dvc
# >> ~/anaconda3/bin/dvc
```

你需要进入admin账户配置一个conda环境变量(请提前[安装anaconda](https://docs.continuum.io/anaconda/install/)
或者[安装miniconda](https://docs.conda.io/en/latest/miniconda.html#installing) )。

![dvc_env_config](../../../../img/tasks/demo/dvc_env_config.png)

后续注意配置任务时,环境选择上面创建的conda环境,否则程序会找不到conda环境。

![dvc_env_name](../../../../img/tasks/demo/dvc_env_name.png)
Binary file added docs/img/tasks/demo/dvc_download.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/tasks/demo/dvc_env_config.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/tasks/demo/dvc_env_name.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/tasks/demo/dvc_init.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/tasks/demo/dvc_upload.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/tasks/icons/dvc.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
Expand Up @@ -171,6 +171,12 @@
<artifactId>dolphinscheduler-task-openmldb</artifactId>
<version>${project.version}</version>
</dependency>

<dependency>
<groupId>org.apache.dolphinscheduler</groupId>
<artifactId>dolphinscheduler-task-dvc</artifactId>
<version>${project.version}</version>
</dependency>
</dependencies>

</project>
46 changes: 46 additions & 0 deletions dolphinscheduler-task-plugin/dolphinscheduler-task-dvc/pom.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
<?xml version="1.0" encoding="UTF-8"?>
<!--
~ Licensed to the Apache Software Foundation (ASF) under one or more
~ contributor license agreements. See the NOTICE file distributed with
~ this work for additional information regarding copyright ownership.
~ The ASF licenses this file to You under the Apache License, Version 2.0
~ (the "License"); you may not use this file except in compliance with
~ the License. You may obtain a copy of the License at
~
~ http://www.apache.org/licenses/LICENSE-2.0
~
~ Unless required by applicable law or agreed to in writing, software
~ distributed under the License is distributed on an "AS IS" BASIS,
~ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
~ See the License for the specific language governing permissions and
~ limitations under the License.
-->
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<parent>
<artifactId>dolphinscheduler-task-plugin</artifactId>
<groupId>org.apache.dolphinscheduler</groupId>
<version>dev-SNAPSHOT</version>
</parent>
<modelVersion>4.0.0</modelVersion>

<artifactId>dolphinscheduler-task-dvc</artifactId>
<packaging>jar</packaging>

<dependencies>
<dependency>
<groupId>org.apache.dolphinscheduler</groupId>
<artifactId>dolphinscheduler-spi</artifactId>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.dolphinscheduler</groupId>
<artifactId>dolphinscheduler-task-api</artifactId>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-collections4</artifactId>
</dependency>
</dependencies>
</project>
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.dolphinscheduler.plugin.task.dvc;

public class DvcConstants {
private DvcConstants() {
throw new IllegalStateException("Utility class");
}

public static final String CHECK_AND_SET_DVC_REPO = "which dvc || { echo \"dvc does not exist\"; exit 1; }; DVC_REPO=%s";

public static final String SET_DATA_PATH = "DVC_DATA_PATH=%s";

public static final String SET_DATA_LOCATION = "DVC_DATA_LOCATION=%s";

public static final String SET_VERSION = "DVC_VERSION=%s";

public static final String SET_MESSAGE = "DVC_MESSAGE=\"%s\"";

public static final String GIT_CLONE_DVC_REPO = "git clone $DVC_REPO dvc-repository; cd dvc-repository; pwd";

public static final String DVC_AUTOSTAGE = "dvc config core.autostage true --local || exit 1";

public static final String DVC_ADD_DATA = "dvc add $DVC_DATA_PATH -v -o $DVC_DATA_LOCATION --to-remote || exit 1";

public static final String GIT_UPDATE_FOR_UPDATE_DATA = "git commit -am \"$DVC_MESSAGE\"\n" +
"git tag \"$DVC_VERSION\" -m \"$DVC_MESSAGE\"\n" +
"git push --all\n" +
"git push --tags";

public static final String DVC_DOWNLOAD = "dvc get $DVC_REPO $DVC_DATA_LOCATION -o $DVC_DATA_PATH -v --rev $DVC_VERSION";


public static final String DVC_INIT = "dvc init || exit 1";

public static final String DVC_ADD_REMOTE = "dvc remote add origin %s -d";

public static final String GIT_UPDATE_FOR_INIT_DVC = "git commit -am \"init dvc project and add remote\"; git push";

}

Loading