-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Docs][DataQuality]: Add DataQuality Docs #9512
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, PTAL @Tianqi-Dotes @zhongjiajie
Will take a look this weekend |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, waiting for CI
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better rewrite it
## 1.1 Introduction | ||
|
||
The data quality task is used to check the data accuracy during the integration and processing of data. Data quality tasks in this release include single-table checking, single-table custom SQL checking, multi-table accuracy, and two-table value comparisons. The running environment of the data quality task is Spark 2.4.0, and other versions have not been verified, and users can verify by themselves. | ||
- The execution flow of the data quality task is as follows: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The execution flow of
->
The execution logic of
or
The execution code logic of
# 1 Overview | ||
## 1.1 Introduction | ||
|
||
The data quality task is used to check the data accuracy during the integration and processing of data. Data quality tasks in this release include single-table checking, single-table custom SQL checking, multi-table accuracy, and two-table value comparisons. The running environment of the data quality task is Spark 2.4.0, and other versions have not been verified, and users can verify by themselves. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better add a blank line.
|
||
> The user defines the task in the interface, and the user input value is stored in `TaskParam` | ||
When running a task, `Master` will parse `TaskParam`, encapsulate the parameters required by `DataQualityTask` and send it to `Worker`. | ||
Worker runs the data quality task. After the data quality task finishes running, it writes the statistical results to the specified storage engine. The current data quality task result is stored in the `t_ds_dq_execute_result` table of `dolphinscheduler` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
t_ds_dq_execute_result
table of dolphinscheduler
->
table t_ds_dq_execute_result
of the database dolphinscheduler
.
> The user defines the task in the interface, and the user input value is stored in `TaskParam` | ||
When running a task, `Master` will parse `TaskParam`, encapsulate the parameters required by `DataQualityTask` and send it to `Worker`. | ||
Worker runs the data quality task. After the data quality task finishes running, it writes the statistical results to the specified storage engine. The current data quality task result is stored in the `t_ds_dq_execute_result` table of `dolphinscheduler` | ||
`Worker` sends the task result to `Master`, after `Master` receives `TaskResponse`, it will judge whether the task type is `DataQualityTask`, if so, it will read the corresponding result from `t_ds_dq_execute_result` according to `taskInstanceId`, and then The result is judged according to the check mode, operator and threshold configured by the user. If the result is a failure, the corresponding operation, alarm or interruption will be performed according to the failure policy configured by the user. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and then The result
->
and then the result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check mode
->
check formula
```properties | ||
data-quality.jar.name=dolphinscheduler-data-quality-dev-SNAPSHOT.jar | ||
|
||
Please fill in `data-quality.jar.name` according to the actual package name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
,
->
.
if these 2 sentence is the same sentence, should use ',' at the end of the first one ,and 'if you ....' lower case at the second one (When we write a long sentence, we always enter a newline at about 35 words)
if these 2 are 2 sentence, should use '.' and the first sentence and upper case at the second sentence.
- Src filter conditions: such as the title, it will also be used when counting the total number of rows in the table, optional | ||
- Src table check column: drop-down to select check column name | ||
- start time: the start time of a time range | ||
- end time: the end time of a time range |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use upper case at the start
- Src table check column: drop-down to select check column name | ||
- start time: the start time of a time range | ||
- end time: the end time of a time range | ||
- Time Format: Set the corresponding time format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Time Format: Set the corresponding time format
->
Time Format: set the corresponding time format
- [Actual/Expected]x100% | ||
- [(Expected-Actual)/Expected]x100% | ||
- Check operators: =, >, >=, <, <=, ! = | ||
- Threshold: The value used in the formula for comparison |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value used
->
the value
- Note: The SQL must be statistical SQL, such as counting the number of rows, calculating the maximum value, minimum value, etc. | ||
- select max(a) as max_num from ${src_table}, the table name must be filled like this | ||
- Src filter conditions: such as the title, it will also be used when counting the total number of rows in the table, optional | ||
- Check method: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check method:
->
Check method: select a suitable check method.
- Threshold: The value used in the formula for comparison | ||
- Failure strategy | ||
- Alert: The data quality task failed, the DolphinScheduler task result is successful, and an alert is sent | ||
- Blocking: The data quality task fails, the DolphinScheduler task result is failed, and an alarm is sent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the difference between alarm and alert
@Tianqi-Dotes Sorry, I merged it directly because we released 3.0.0-alpha, and this doc must in our website |
Co-authored-by: Jiajie Zhong <zhongjiajie955@gmail.com>
Purpose of the pull request
This pull request adds dataquality docs.