Removal of local_task_job.task_exit.<job_id>.<dag_id>.<task_id>.<return_code> metric #31004

howardyoo · 2023-05-02T00:18:30Z

howardyoo
May 2, 2023

I'd like to start a discussion on what we should do with the airflow metric
local_task_job.task_exit.<job_id>.<dag_id>.<task_id>.<return_code> that is mentioned in https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/metrics.html.

This particular metric could be seen if user runs a task locally, using airflow Local Executor (https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/executor/local.html), and in normal airflow production environment, this would not be the case. So, most of its use case would be when a user want to run DAG in a local, development environment to develop and test something out.

The metric is created when a task runs by the local executor, and according to its exit code, it would create a counter that increments on the pattern mentioned above, recording the job_id, dag_id, task_id, and the task's return code of either 0 or higher number integer. The value of this metric would be the number of occurrence of such with certain exit code.

Example

airflow.local_task_job.task_exit.162312.my_dag.task_1.0         value: 0
airflow.local_task_job.task_exit.162312.my_dag.task_1.1         value: 1

I believe the original intent to instrument this metric was to see if user can observe how their local executed DAG's task would either end up without any errors (return_code=0), or end up with certain errors (return_code != 0). This metric could also be useful if the user wants to observe whether their locally executed DAG has completed running, or is in hang condition (without any return code) - since if they do not see this metric showing up in the time series database and query, it might mean that the DAG has not completed running and is still in running state.

However, this metric seems to introduce more problems in relation to its usefulness.

As you can see in the above example, this metric design would introduce a high-cardinality condition that would have the metric having multiple time series coming from a single DAG run. job_id is an integer number that identifies the particular job (in this case, local executor?) which changes everytime the DAG is running on a different process. Also, the way the metric contains return code at the end of the metric name end up having many time series with mostly 0 or 1 at the end of the name - even further increasing the cardinality of the metrics data.
A good metric data has regular intervals, and has useful information that can help users to detect 'disruptive signals' to ease the detection of problems. This counter metric, even though it may have regular intervals, may have no changes of its counter value, unless the local executor re-runs the DAG, and that may even end up having different job_id, so, in general sense, users would end up seeing multiple flat lines of time series data, that rarely changes (only when something is run) - so I usually consider these kind of data as event type data, not metric data.
Also, how useful this metric actually is? Running something locally, the user would look and check the logs in most of the cases, and would be able to use it to troubleshoot problems much better than checking the metrics. It just feel too much of an overkill to publish this information as metrics - as time series data are most useful when there are a large number of something running, and aggregate those data to detect 'anomalies' or correlate with different problems visually.

Due to the above observations,
I would like to hear what everybody in the airflow community thinks about it, and if there's a critical and compelling reason to absolutely use this metric on some key use cases, would like to hear about it.

If not, due to the nature of this metric data resulting in more harm than being useful, I'd like to see if we could remove this metric from the future counter list of airflow metrics.

potiuk · 2023-05-06T13:51:48Z

potiuk
May 6, 2023
Collaborator

This particular metric could be seen if user runs a task locally, using airflow Local Executor (https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/executor/local.html), and in normal airflow production environment, this would not be the case. So, most of its use case would be when a user want to run DAG in a local, development environment to develop and test something out.

Just to clarify some assumptions above (which I think are not entirely correct):

LocalExecutor is perfectly valid, production-grade executor. There is a big number of installations of Airlfow where people have local executor, and they still should be able to monitor them. Not a well known fact as well (but some people use it too) is that you can run multiple Schedulers (each with own LocalExecutor), and you should get pretty scalable Airlfow installation with that. So this is actually pretty good solution in many cases to use LocalExecutor for production.
Not a well known fact, but Airlfow Kubernetes executor uses actually LocalExecutor to run tasks. The way it works is that it starts a local executor in the Pod and executes the single Task using the executor.

However, I agree cardinality it introduces by job_id is problematic and makes it next to useless. I would be for removing it entirely.

0 replies

abullus · 2024-04-24T09:59:32Z

abullus
Apr 24, 2024

Just to add a voice of support - we started using KubernetesExecutors and this metric completely exploded the number of metrics we were producing. I'd be for removing it entirely (or at least putting it behind a flag).

0 replies

howardyoo · 2024-04-24T13:00:16Z

howardyoo
Apr 24, 2024
Author

In my opinion, those attributes such as return code would introduce unnecessary time series according to the values, so I'd also agree that having this information in metrics may not have been a great idea. It would be really nice if those can actually be part of the 'traces', since traces wouldn't have the issue of containing the information, and it will not explode the number of time series in metrics.

…

On Wed, Apr 24, 2024 at 4:59 AM abullus ***@***.***> wrote: Just to add a voice of support - we started using KubernetesExecutors and this metric completely exploded the number of metrics we were producing. I'd be for removing it entirely (or at least putting it behind a flag). — Reply to this email directly, view it on GitHub <#31004 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHZNLLVLVGJINTNIRRLYZQ3Y657BVAVCNFSM6AAAAAAXSKWMPSVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TEMJRGE3TO> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

0 replies

NickMoignard · 2024-11-01T01:29:16Z

NickMoignard
Nov 1, 2024

I am in agreement. This metric is a pain when running on Kubernetes.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Removal of local_task_job.task_exit.<job_id>.<dag_id>.<task_id>.<return_code> metric #31004

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Removal of local_task_job.task_exit.<job_id>.<dag_id>.<task_id>.<return_code> metric #31004

howardyoo May 2, 2023

Replies: 4 comments

potiuk May 6, 2023 Collaborator

abullus Apr 24, 2024

howardyoo Apr 24, 2024 Author

NickMoignard Nov 1, 2024

howardyoo
May 2, 2023

potiuk
May 6, 2023
Collaborator

abullus
Apr 24, 2024

howardyoo
Apr 24, 2024
Author

NickMoignard
Nov 1, 2024