Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Docs][DataQuality]: Add DataQuality Docs #9512

Merged
merged 8 commits into from
Apr 21, 2022
Merged

Conversation

zixi0825
Copy link
Member

Purpose of the pull request

This pull request adds dataquality docs.

Copy link
Contributor

@caishunfeng caishunfeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhongjiajie
Copy link
Member

Will take a look this weekend

docs/docs/en/guide/task/data-quality.md Outdated Show resolved Hide resolved
docs/docs/en/guide/task/data-quality.md Outdated Show resolved Hide resolved
docs/docs/en/guide/task/data-quality.md Outdated Show resolved Hide resolved
docs/docs/en/guide/task/data-quality.md Outdated Show resolved Hide resolved
docs/docs/en/guide/task/data-quality.md Outdated Show resolved Hide resolved
docs/docs/en/guide/task/data-quality.md Outdated Show resolved Hide resolved
docs/docs/en/guide/task/data-quality.md Outdated Show resolved Hide resolved
docs/docs/en/guide/task/data-quality.md Outdated Show resolved Hide resolved
docs/docs/en/guide/task/data-quality.md Outdated Show resolved Hide resolved
Copy link
Member

@zhongjiajie zhongjiajie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, waiting for CI

@zhongjiajie zhongjiajie self-requested a review April 21, 2022 06:50
@zhongjiajie zhongjiajie merged commit 337696e into apache:dev Apr 21, 2022
Copy link
Member

@Tianqi-Dotes Tianqi-Dotes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better rewrite it

## 1.1 Introduction

The data quality task is used to check the data accuracy during the integration and processing of data. Data quality tasks in this release include single-table checking, single-table custom SQL checking, multi-table accuracy, and two-table value comparisons. The running environment of the data quality task is Spark 2.4.0, and other versions have not been verified, and users can verify by themselves.
- The execution flow of the data quality task is as follows:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The execution flow of
->
The execution logic of
or
The execution code logic of

# 1 Overview
## 1.1 Introduction

The data quality task is used to check the data accuracy during the integration and processing of data. Data quality tasks in this release include single-table checking, single-table custom SQL checking, multi-table accuracy, and two-table value comparisons. The running environment of the data quality task is Spark 2.4.0, and other versions have not been verified, and users can verify by themselves.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better add a blank line.


> The user defines the task in the interface, and the user input value is stored in `TaskParam`
When running a task, `Master` will parse `TaskParam`, encapsulate the parameters required by `DataQualityTask` and send it to `Worker`.
Worker runs the data quality task. After the data quality task finishes running, it writes the statistical results to the specified storage engine. The current data quality task result is stored in the `t_ds_dq_execute_result` table of `dolphinscheduler`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t_ds_dq_execute_result table of dolphinscheduler
->
table t_ds_dq_execute_result of the database dolphinscheduler.

> The user defines the task in the interface, and the user input value is stored in `TaskParam`
When running a task, `Master` will parse `TaskParam`, encapsulate the parameters required by `DataQualityTask` and send it to `Worker`.
Worker runs the data quality task. After the data quality task finishes running, it writes the statistical results to the specified storage engine. The current data quality task result is stored in the `t_ds_dq_execute_result` table of `dolphinscheduler`
`Worker` sends the task result to `Master`, after `Master` receives `TaskResponse`, it will judge whether the task type is `DataQualityTask`, if so, it will read the corresponding result from `t_ds_dq_execute_result` according to `taskInstanceId`, and then The result is judged according to the check mode, operator and threshold configured by the user. If the result is a failure, the corresponding operation, alarm or interruption will be performed according to the failure policy configured by the user.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and then The result
->
and then the result

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check mode
->
check formula

```properties
data-quality.jar.name=dolphinscheduler-data-quality-dev-SNAPSHOT.jar

Please fill in `data-quality.jar.name` according to the actual package name,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

,
->
.
if these 2 sentence is the same sentence, should use ',' at the end of the first one ,and 'if you ....' lower case at the second one (When we write a long sentence, we always enter a newline at about 35 words)
if these 2 are 2 sentence, should use '.' and the first sentence and upper case at the second sentence.

- Src filter conditions: such as the title, it will also be used when counting the total number of rows in the table, optional
- Src table check column: drop-down to select check column name
- start time: the start time of a time range
- end time: the end time of a time range
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use upper case at the start

- Src table check column: drop-down to select check column name
- start time: the start time of a time range
- end time: the end time of a time range
- Time Format: Set the corresponding time format
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Time Format: Set the corresponding time format
->
Time Format: set the corresponding time format

- [Actual/Expected]x100%
- [(Expected-Actual)/Expected]x100%
- Check operators: =, >, >=, <, <=, ! =
- Threshold: The value used in the formula for comparison
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value used
->
the value

- Note: The SQL must be statistical SQL, such as counting the number of rows, calculating the maximum value, minimum value, etc.
- select max(a) as max_num from ${src_table}, the table name must be filled like this
- Src filter conditions: such as the title, it will also be used when counting the total number of rows in the table, optional
- Check method:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check method:
->
Check method: select a suitable check method.

- Threshold: The value used in the formula for comparison
- Failure strategy
- Alert: The data quality task failed, the DolphinScheduler task result is successful, and an alert is sent
- Blocking: The data quality task fails, the DolphinScheduler task result is failed, and an alarm is sent
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the difference between alarm and alert

@zhongjiajie
Copy link
Member

@Tianqi-Dotes Sorry, I merged it directly because we released 3.0.0-alpha, and this doc must in our website

fengjian1129 pushed a commit to fengjian1129/dolphinscheduler that referenced this pull request Apr 23, 2022
Co-authored-by: Jiajie Zhong <zhongjiajie955@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants