[AIP-49] OpenTelemetry Traces for Apache Airflow Part 2 #40802

howardyoo · 2024-07-15T21:20:44Z

This is a PR PART 2 for AIP-49 which is Open Telemetry support for Airflow. In last year, a group of contributors pushed out the first release of Airflow's commitment to OpenTelemetry by providing OTEL metrics support. This PR addresses the PART 2 of second phase of the OTEL implementation for Airflow, which provides instrumentation that produces spans and span logs for the traces in Airflow.

kaxil · 2024-07-16T07:30:48Z

@howardyoo Is this the final part, the cut-off for Airflow 2.10 is next week and we plan to release 2.10 mid-August.

In one of the last dev calls, there were some questions on AIP-49 https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-49+OpenTelemetry+Support+for+Apache+Airflow on when you want to target Phase 2 of it.

cc @vikramkoka @potiuk who were interested in this during the dev call

howardyoo · 2024-07-16T14:11:37Z

@howardyoo Is this the final part, the cut-off for Airflow 2.10 is next week and we plan to release 2.10 mid-August.

In one of the last dev calls, there were some questions on AIP-49 https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-49+OpenTelemetry+Support+for+Apache+Airflow on when you want to target Phase 2 of it.

cc @vikramkoka @potiuk who were interested in this during the dev call

Hi @kaxil , yes - you're right. But honestly, I believe this could be included as part of Airflow 2.10, if timely reviews are given to the code. I really do not see any reason why the second part should be part of AF 3.0, since the instrumentation and tracing code will not change at all. You mentioned the part 2 could be put into AF 3.0. but I am curious to know what would be the reason behind having part 2 into AF 3.0, as having only part 1 in AF 2.10 won't actually emit any traces at all.

airflow/executors/base_executor.py

potiuk

This looks good. And except try_number the tests should be fixed.

BTW. Hint to other reviewers - selecting "hide whiltespaces" helps enormously in review as there are a number of whitespace-only indentation changes.

howardyoo · 2024-07-17T18:58:43Z

cool, will modify the code to not subtract 1 and push it out :-). Thanks for the comments!

…

On Wed, Jul 17, 2024 at 1:02 PM Jarek Potiuk ***@***.***> wrote: ***@***.**** commented on this pull request. This looks good. And except try_number the tests should be fixed. BTW. Hint to other reviewers - selecting "hide whiltespaces" help enormously in review as there are a number of whitespace-only indentation changes. — Reply to this email directly, view it on GitHub <#40802 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHZNLLVZRLN6H6PWFC7YAHDZM2WUVAVCNFSM6AAAAABK5KFCK2VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDCOBTGY2DINBTGY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

potiuk · 2024-07-18T11:52:46Z

Marked it for 2.10.0 but some tests are still failing

howardyoo · 2024-07-18T12:55:42Z

Thank you, Jarek.Will take a look at them ASAP.Sent from my iPhoneOn Jul 18, 2024, at 6:53 AM, Jarek Potiuk ***@***.***> wrote: Marked it for 2.10.0 but some tests are still failing —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

potiuk · 2024-07-18T19:41:36Z

LOOOKS GOOD!

potiuk · 2024-07-18T19:42:30Z

Anyone else who would like to review it (@ferruzzi maybe?) (as a reminder - hiding whitespace makes it much, much easier).

airflow/dag_processing/manager.py

airflow/executors/base_executor.py

airflow/jobs/job.py

airflow/traces/utils.py

combine the same message. Co-authored-by: D. Ferruzzi <ferruzzi@amazon.com>

The indentation in apache#40802 has changed heartbeating to not work in case of standalone date processing. This PR fixes it back.

…40929) The indentation in #40802 has changed heartbeating to not work in case of standalone date processing. This PR fixes it back.

kaxil · 2024-07-24T20:38:33Z

@howardyoo Is there anything pending before we can mark the AIP-49 complete?

howardyoo · 2024-07-24T20:58:11Z

None,AIP-49 is completed.Sent from my iPhoneOn Jul 24, 2024, at 3:38 PM, Kaxil Naik ***@***.***> wrote: @howardyoo Is there anything pending before we can mark the AIP-49 complete? —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

--------- Co-authored-by: D. Ferruzzi <ferruzzi@amazon.com>

ashb · 2024-08-16T22:23:27Z

airflow/executors/base_executor.py

+                span.set_attribute("run_id", key.run_id)
+                span.set_attribute("task_id", key.task_id)
+                span.set_attribute("try_number", key.try_number)
+                span.set_attribute("command", str(command))


What is the value of putting the command in the span? It simply duplicates (in a format that is of no use as it's a single string) the dag_id, run_id, task_id etc.

What is the value of putting the command in the span? It simply duplicates (in a format that is of no use as it's a single string) the dag_id, run_id, task_id etc.

I did put the command as attribute, assuming that there may be something additional other than dag_id, run_id, task_id, etc. Due to the fact that I did not have too much of deep understanding of what the command can be, felt it was worth recording it as part of the span. If there's no real value of instrumenting the details of 'command' whatsoever, I'd say we can remove that instrumentation out of it for the next release.

ashb · 2024-11-07T13:56:02Z

airflow/jobs/scheduler_job_runner.py

+                if conf.has_option("traces", "otel_task_log_event") and conf.getboolean(
+                    "traces", "otel_task_log_event"
+                ):
+                    from airflow.utils.log.log_reader import TaskLogReader
+
+                    task_log_reader = TaskLogReader()
+                    if task_log_reader.supports_read:
+                        metadata: dict[str, Any] = {}
+                        logs, metadata = task_log_reader.read_log_chunks(ti, ti.try_number, metadata)
+                        if ti.hostname in dict(logs[0]):
+                            message = str(dict(logs[0])[ti.hostname]).replace("\\n", "\n")
+                            while metadata["end_of_log"] is False:
+                                logs, metadata = task_log_reader.read_log_chunks(
+                                    ti, ti.try_number - 1, metadata
+                                )
+                                if ti.hostname in dict(logs[0]):
+                                    message = message + str(dict(logs[0])[ti.hostname]).replace("\\n", "\n")
+                            if span.is_recording():
+                                span.add_event(
+                                    name="task_log",
+                                    attributes={
+                                        "message": message,
+                                        "metadata": str(metadata),
+                                    },
+                                )


@howardyoo @ferruzzi This is a huge no-no. The scheduler cannot do any processing that will block the main scheduling loop for so long, and going and reading all of the logs is going to block the scheduler loop for a noticable time.

THis block needs reverting I'm afraid -- it is not a feature that can exist in the scheduler.

howardyoo · 2024-11-07T14:23:43Z

I understand. So we can revert this no problem. In that case, would it be okay to make this done in some async way, such that it will not block the scheduler loop, but can execute to get the logs?

…

On Thu, Nov 7, 2024 at 7:56 AM Ash Berlin-Taylor ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In airflow/jobs/scheduler_job_runner.py <#40802 (comment)>: > + if conf.has_option("traces", "otel_task_log_event") and conf.getboolean( + "traces", "otel_task_log_event" + ): + from airflow.utils.log.log_reader import TaskLogReader + + task_log_reader = TaskLogReader() + if task_log_reader.supports_read: + metadata: dict[str, Any] = {} + logs, metadata = task_log_reader.read_log_chunks(ti, ti.try_number, metadata) + if ti.hostname in dict(logs[0]): + message = str(dict(logs[0])[ti.hostname]).replace("\\n", "\n") + while metadata["end_of_log"] is False: + logs, metadata = task_log_reader.read_log_chunks( + ti, ti.try_number - 1, metadata + ) + if ti.hostname in dict(logs[0]): + message = message + str(dict(logs[0])[ti.hostname]).replace("\\n", "\n") + if span.is_recording(): + span.add_event( + name="task_log", + attributes={ + "message": message, + "metadata": str(metadata), + }, + ) @howardyoo <https://github.com/howardyoo> @ferruzzi <https://github.com/ferruzzi> This is a huge no-no. The scheduler cannot do any processing that will block the main scheduling loop for so long, and going and reading all of the logs is going to block the scheduler loop for a noticable time. THis block needs reverting I'm afraid -- it is not a feature that can exist in the scheduler. — Reply to this email directly, view it on GitHub <#40802 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHZNLLVWTA67MDPGFO4QMXDZ7NWQVAVCNFSM6AAAAABK5KFCK2VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDIMRRGA2TIMZVG4> . You are receiving this because you were mentioned.Message ID: ***@***.***>

ashb · 2024-11-07T16:59:06Z

There could be Mbs of task logs potentially. Sending it via an otel span seems like a bad idea on general principle. Otel has logging already doesn't it

jedcunningham · 2024-11-07T17:04:13Z

There could be Mbs of task logs potentially.

In the wild, we see logs big enough to OOM the webserver. I'd imagine the same would happen for the scheduler too. Definitely problematic, beyond just being slow.

howardyoo · 2024-11-07T18:31:54Z

I understand. However, I know some of the airflow users would like to have task logs sent out as span events - and would see those as good value (hence the implementation). Currently, the ability for task logs to be emitted is configurable, meaning that you can turn it off it it becomes problematic. How about some additional options to the configs for making this feature not be removed, but provided as an option: - providing way to limit the max size of the task log, e.g. have the maximum limit to emit the first N characters (e.g. first 32k characters), and trim off any subsequent logs so that if they want to view the rest of the log, they can use the log link. The OTEL log implementation is just yet another structure that closely resembles the span event, so simply using the log sdk won't resolve this. However, I believe having the option for default limit (we can settle in for such as first 64k characters) and then having option to either limit or increase the limit based on user's preference could be something. Any opinions on this?

…

On Thu, Nov 7, 2024 at 11:04 AM Jed Cunningham ***@***.***> wrote: There could be Mbs of task logs potentially. In the wild, we see logs big enough to OOM the webserver. I'd imagine the same would happen for the scheduler too. Definitely problematic, beyond just being slow. — Reply to this email directly, view it on GitHub <#40802 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHZNLLRWF327WCSKMDPRXC3Z7OMSLAVCNFSM6AAAAABK5KFCK2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINRSG43TSMRSGE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

potiuk · 2024-11-11T12:13:36Z

However, I know some of the airflow users would like to have task logs sent
out as span events - and would see those as good value (hence the
implementation).

I see the value of it indeed, but I agree there should be a limit. I think huge percentage (9X%) - the logs will be really small and very useful to see immediately in the OTEL span context, and for the rest it would be super useful if you only see beginning of the logs to get a bit more context.

How about having a resonable (say 2K?) limit for the log size being sent with some indication (ellipsis) that it's not complete. Maybe later also we could connect it with our logging framework, so that such a message could also contain (maybe in structured form) link to the log message accessible in whatever remote logging we have configured (and for task logs links to Airflow UI where you could see the logs from tasks).

That would make OTEL spans really, really useful as a first / main part of "application debugging" problems.

I really see OTEL as main way which will a) make debugging of problems with Airflow easier, b) it will also make it easier for us to help our users. One of the great features of OTEL and tools like jaeger is the they have export capabilities. Similarly to py-spy and memray flamegraphs, such OTEL exports can be sent to us for further analysis in case our users have OTEL enabled - seing even limited logs included in such exports would be a fantastic aid that will allow us to open such export using jaeger for example and be able to diagnose many issues much faster.

I think eventually we should even provide our users some information on how they can setup some OTEL tools (jaeger seems like an easy one ) and how to create such exports so that we can analyse them (likely with some anonymisation/obfuscation options for sensitive names for users who care about it etc, but I guess that should be possible with tools like Jaeger)

This is really part of #40975 - "Improve Airflow's debugging story" - which clearly from the survey run by @omkar-foss had shown needs improvement, I see OTEL as a big chance to make it easy to have a fantastic tool and easy to set-up configuration for our users to provide use much more data about the problems they are experiencing and allowing us to diagnose and fix them way faster.

howardyoo · 2024-11-11T13:02:36Z

I agree with Jarek on his message above. Just created an issue : #43868 for this, so that we can start working on it.

…

On Mon, Nov 11, 2024 at 6:14 AM Jarek Potiuk ***@***.***> wrote: However, I know some of the airflow users would like to have task logs sent out as span events - and would see those as good value (hence the implementation). I see the value of it indeed, but I agree there should be a limit. I think huge percentage (9X%) - the logs will be really small and very useful to see immediately in the OTEL span context, and for the rest it would be super useful if you only see beginning of the logs to get a bit more context. How about having a resonable (say 2K?) limit for the log size being sent with some indication (ellipsis) that it's not complete. Maybe later also we could connect it with our logging framework, so that such a message could also contain (maybe in structured form) link to the log message accessible in whatever remote logging we have configured (and for task logs links to Airflow UI where you could see the logs from tasks). That would make OTEL spans really, really useful as a first / main part of "application debugging" problems. I really see OTEL as main way which will a) make debugging of problems with Airflow easier, b) it will also make it easier for us to help our users. One of the great features of OTEL and tools like jaeger is the they have export capabilities. Similarly to py-spy and memray flamegraphs, such OTEL exports can be sent to us for further analysis in case our users have OTEL enabled - seing even limited logs included in such exports would be a fantastic aid that will allow us to open such export using jaeger for example and be able to diagnose many issues much faster. I think eventually we should even provide our users some information on how they can setup some OTEL tools (jaeger seems like an easy one ) and how to create such exports so that we can analyse them (likely with some anonymisation/obfuscation options for sensitive names for users who care about it etc, but I guess that should be possible with tools like Jaeger) This is really part of #40975 <#40975> - "Improve Airflow's debugging story" - which clearly from the survey run by @omkar-foss <https://github.com/omkar-foss> had shown needs improvement, I see OTEL as a big chance to make it easy to have a fantastic tool and easy to set-up configuration for our users to provide use much more data about the problems they are experiencing and allowing us to diagnose and fix them way faster. — Reply to this email directly, view it on GitHub <#40802 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AHZNLLTCYHNLOAIILMURHXT2ACNQTAVCNFSM6AAAAABK5KFCK2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINRYGAZTEMRVGU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

ferruzzi · 2024-11-12T20:18:39Z

I like the idea of truncating the logs. The big question there would be if we should head the logs or tail the logs.... if we are cutting them short, is it more useful to have the beginning or the end? I'd say maybe the end, but honestly, it's so hard to say. Many times you have to scroll back to find a real root cause.

Either way, if I may humbly make a request, can we please make an effort to refer to Traces and Metrics rather than calling both OTel? They are two distinct features which both use OTel, but I feel like the recent discussion around improving Traces has undermined confidence in the OTel Metrics implementation which AFAIK is not having any issues. There is a project underway to improve the Metrics docs which seems to be stalled because of confusion around the Traces discussion. The two are not related and IMHO discussion around Traces shouldn't be affecting the Metrics improvement project.

dstandish · 2024-11-12T21:49:08Z

You can't upload task logs from the scheduler loop. Even just a snippet of logs. It's not the place to do that. Can't be retrieving connections and fetching s3 blobs from scheduler loop.

howardyoo · 2024-11-12T21:52:44Z

Yes, I actually thought about that as well! But for now, we may just have to make decision on perhaps truncating, because it would actually require less work and thus more robust.(Maybe in the future we may want to make it as a configurable option of having truncate by head, tail, or head+tail.)Yeah traces and metrics should be treated differently and we should try hard to reduce the confusion and noise around that.Sent from my iPhoneOn Nov 12, 2024, at 1:19 PM, D. Ferruzzi ***@***.***> wrote: I like the idea of truncating the logs. The big question there would be if we should head the logs or tail the logs.... if we are cutting them short, is it more useful to have the beginning or the end? I'd say maybe the end, but honestly, it's so hard to say. Many times you have to scroll back to find a real root cause. Either way, if I may humbly make a request, can we please make an effort to refer to Traces and Metrics rather than calling both OTel? They are two distinct features which both use OTel, but I feel like the recent discussion around improving Traces has undermined confidence in the OTel Metrics implementation which AFAIK is not having any issues. There is a project underway to improve the Metrics docs which seems to be stalled because of confusion around the Traces discussion. The two are not related and IMHO discussion around Traces shouldn't be affecting the Metrics improvement project. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

dstandish · 2024-11-12T22:03:16Z

Yes, I actually thought about that as well! But for now, we may just have to make decision on perhaps truncating, because it would actually require less work and thus more robust.(Maybe in the future we may want to make it as a configurable option of having truncate by head, tail, or head+tail.)Yeah traces and metrics should be treated differently and we should try hard to reduce the confusion and noise around that.

Sorry but we can't be doing this in the scheduler

howardyoo · 2024-11-12T22:14:46Z

Okay,If you say so, we can remove this from the code. Thanks!Sent from my iPhoneOn Nov 12, 2024, at 3:03 PM, Daniel Standish ***@***.***> wrote: Yes, I actually thought about that as well! But for now, we may just have to make decision on perhaps truncating, because it would actually require less work and thus more robust.(Maybe in the future we may want to make it as a configurable option of having truncate by head, tail, or head+tail.)Yeah traces and metrics should be treated differently and we should try hard to reduce the confusion and noise around that. Sorry but we can't be doing this in the scheduler —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

howardyoo requested review from potiuk, ashb, jedcunningham, ephraimbuddy, kaxil, XD-DENG, o-nikolas, pierrejeambrun and hussein-awala as code owners July 15, 2024 21:20

boring-cyborg bot added area:dev-tools area:Executors-core LocalExecutor & SequentialExecutor area:Scheduler including HA (high availability) scheduler area:Triggerer labels Jul 15, 2024

howardyoo force-pushed the otel/otel-trace-integration-2 branch from 1d723e2 to 09ad16a Compare July 15, 2024 21:50

potiuk reviewed Jul 17, 2024

View reviewed changes

airflow/executors/base_executor.py Outdated Show resolved Hide resolved

potiuk reviewed Jul 17, 2024

View reviewed changes

airflow/executors/base_executor.py Outdated Show resolved Hide resolved

potiuk reviewed Jul 17, 2024

View reviewed changes

potiuk added this to the Airflow 2.10.0 milestone Jul 18, 2024

potiuk requested review from uranusjr and ferruzzi July 18, 2024 19:45

ferruzzi reviewed Jul 18, 2024

View reviewed changes

howardyoo added 2 commits July 19, 2024 12:48

Part two of OTEL trace support of Airflow

d5b388f

Adjusted version of jaeger - correct version.

57b8edc

howardyoo and others added 3 commits July 19, 2024 12:48

checks to be compatible with unit tests

21e87c6

Update airflow/jobs/job.py

11165b8

combine the same message. Co-authored-by: D. Ferruzzi <ferruzzi@amazon.com>

fixes for the code-review

d33fe10

potiuk force-pushed the otel/otel-trace-integration-2 branch from 39a603a to d33fe10 Compare July 19, 2024 10:48

ferruzzi approved these changes Jul 19, 2024

View reviewed changes

potiuk merged commit 0f4884c into apache:main Jul 20, 2024
80 checks passed

potiuk mentioned this pull request Jul 22, 2024

Fix indentation of scheduler_job_runner for standalone dag processor #40929

Merged

potiuk added a commit that referenced this pull request Jul 22, 2024

Fix indentation of scheduler_job_runner for standalone dag processor (#…

2a37728

…40929) The indentation in #40802 has changed heartbeating to not work in case of standalone date processing. This PR fixes it back.

ephraimbuddy added the changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) label Jul 22, 2024

romsharon98 pushed a commit to romsharon98/airflow that referenced this pull request Jul 26, 2024

[AIP-49] OpenTelemetry Traces for Apache Airflow Part 2 (apache#40802)

4d53a0d

--------- Co-authored-by: D. Ferruzzi <ferruzzi@amazon.com>

ashb reviewed Aug 16, 2024

View reviewed changes

mattogburke mentioned this pull request Aug 30, 2024

Correct scheduled slots documentation and missing open telemetry span #41899

Merged

0vj00 mentioned this pull request Oct 10, 2024

Fix broken stat scheduler_loop_duration #42886

Merged

ashb reviewed Nov 7, 2024

View reviewed changes

potiuk mentioned this pull request Nov 11, 2024

Improve Airflow's debugging story #40975

Open

xBis7 mentioned this pull request Nov 12, 2024

Provide an alternative OpenTelemetry implementation for traces that follows standard otel practices #43941

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AIP-49] OpenTelemetry Traces for Apache Airflow Part 2 #40802

[AIP-49] OpenTelemetry Traces for Apache Airflow Part 2 #40802

howardyoo commented Jul 15, 2024

kaxil commented Jul 16, 2024 •

edited

Loading

howardyoo commented Jul 16, 2024

potiuk left a comment •

edited

Loading

howardyoo commented Jul 17, 2024 via email

potiuk commented Jul 18, 2024

howardyoo commented Jul 18, 2024 via email

potiuk commented Jul 18, 2024

potiuk commented Jul 18, 2024

kaxil commented Jul 24, 2024

howardyoo commented Jul 24, 2024 via email

ashb Aug 16, 2024

howardyoo Aug 16, 2024

ashb Nov 7, 2024 •

edited

Loading

howardyoo commented Nov 7, 2024 via email

ashb commented Nov 7, 2024

jedcunningham commented Nov 7, 2024

howardyoo commented Nov 7, 2024 via email

potiuk commented Nov 11, 2024

howardyoo commented Nov 11, 2024 via email

ferruzzi commented Nov 12, 2024

dstandish commented Nov 12, 2024

howardyoo commented Nov 12, 2024 via email

dstandish commented Nov 12, 2024

howardyoo commented Nov 12, 2024 via email

[AIP-49] OpenTelemetry Traces for Apache Airflow Part 2 #40802

[AIP-49] OpenTelemetry Traces for Apache Airflow Part 2 #40802

Conversation

howardyoo commented Jul 15, 2024

kaxil commented Jul 16, 2024 • edited Loading

howardyoo commented Jul 16, 2024

potiuk left a comment • edited Loading

Choose a reason for hiding this comment

howardyoo commented Jul 17, 2024 via email

potiuk commented Jul 18, 2024

howardyoo commented Jul 18, 2024 via email

potiuk commented Jul 18, 2024

potiuk commented Jul 18, 2024

kaxil commented Jul 24, 2024

howardyoo commented Jul 24, 2024 via email

ashb Aug 16, 2024

Choose a reason for hiding this comment

howardyoo Aug 16, 2024

Choose a reason for hiding this comment

ashb Nov 7, 2024 • edited Loading

Choose a reason for hiding this comment

howardyoo commented Nov 7, 2024 via email

ashb commented Nov 7, 2024

jedcunningham commented Nov 7, 2024

howardyoo commented Nov 7, 2024 via email

potiuk commented Nov 11, 2024

howardyoo commented Nov 11, 2024 via email

ferruzzi commented Nov 12, 2024

dstandish commented Nov 12, 2024

howardyoo commented Nov 12, 2024 via email

dstandish commented Nov 12, 2024

howardyoo commented Nov 12, 2024 via email

kaxil commented Jul 16, 2024 •

edited

Loading

potiuk left a comment •

edited

Loading

ashb Nov 7, 2024 •

edited

Loading