Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AIP-49] OpenTelemetry Traces for Apache Airflow Part 2 #40802

Merged
merged 8 commits into from
Jul 20, 2024

Conversation

howardyoo
Copy link
Contributor

closes #37752

This is a PR PART 2 for AIP-49 which is Open Telemetry support for Airflow. In last year, a group of contributors pushed out the first release of Airflow's commitment to OpenTelemetry by providing OTEL metrics support. This PR addresses the PART 2 of second phase of the OTEL implementation for Airflow, which provides instrumentation that produces spans and span logs for the traces in Airflow.

@kaxil
Copy link
Member

kaxil commented Jul 16, 2024

@howardyoo Is this the final part, the cut-off for Airflow 2.10 is next week and we plan to release 2.10 mid-August.

In one of the last dev calls, there were some questions on AIP-49 https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-49+OpenTelemetry+Support+for+Apache+Airflow on when you want to target Phase 2 of it.

cc @vikramkoka @potiuk who were interested in this during the dev call

@howardyoo
Copy link
Contributor Author

@howardyoo Is this the final part, the cut-off for Airflow 2.10 is next week and we plan to release 2.10 mid-August.

In one of the last dev calls, there were some questions on AIP-49 https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-49+OpenTelemetry+Support+for+Apache+Airflow on when you want to target Phase 2 of it.

cc @vikramkoka @potiuk who were interested in this during the dev call

Hi @kaxil , yes - you're right. But honestly, I believe this could be included as part of Airflow 2.10, if timely reviews are given to the code. I really do not see any reason why the second part should be part of AF 3.0, since the instrumentation and tracing code will not change at all. You mentioned the part 2 could be put into AF 3.0. but I am curious to know what would be the reason behind having part 2 into AF 3.0, as having only part 1 in AF 2.10 won't actually emit any traces at all.

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. And except try_number the tests should be fixed.

BTW. Hint to other reviewers - selecting "hide whiltespaces" helps enormously in review as there are a number of whitespace-only indentation changes.

@howardyoo
Copy link
Contributor Author

howardyoo commented Jul 17, 2024 via email

@potiuk potiuk added this to the Airflow 2.10.0 milestone Jul 18, 2024
@potiuk
Copy link
Member

potiuk commented Jul 18, 2024

Marked it for 2.10.0 but some tests are still failing

@howardyoo
Copy link
Contributor Author

howardyoo commented Jul 18, 2024 via email

@potiuk
Copy link
Member

potiuk commented Jul 18, 2024

LOOOKS GOOD!

@potiuk
Copy link
Member

potiuk commented Jul 18, 2024

Anyone else who would like to review it (@ferruzzi maybe?) (as a reminder - hiding whitespace makes it much, much easier).

@potiuk potiuk requested review from uranusjr and ferruzzi July 18, 2024 19:45
airflow/dag_processing/manager.py Outdated Show resolved Hide resolved
airflow/dag_processing/manager.py Outdated Show resolved Hide resolved
airflow/dag_processing/manager.py Outdated Show resolved Hide resolved
airflow/executors/base_executor.py Outdated Show resolved Hide resolved
airflow/jobs/job.py Outdated Show resolved Hide resolved
airflow/traces/utils.py Outdated Show resolved Hide resolved
howardyoo and others added 3 commits July 19, 2024 12:48
combine the same message.

Co-authored-by: D. Ferruzzi <ferruzzi@amazon.com>
@potiuk potiuk force-pushed the otel/otel-trace-integration-2 branch from 39a603a to d33fe10 Compare July 19, 2024 10:48
@potiuk potiuk merged commit 0f4884c into apache:main Jul 20, 2024
80 checks passed
potiuk added a commit to potiuk/airflow that referenced this pull request Jul 22, 2024
The indentation in apache#40802 has changed heartbeating to not work
in case of standalone date processing.

This PR fixes it back.
potiuk added a commit that referenced this pull request Jul 22, 2024
…40929)

The indentation in #40802 has changed heartbeating to not work
in case of standalone date processing.

This PR fixes it back.
@ephraimbuddy ephraimbuddy added the changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..) label Jul 22, 2024
@kaxil
Copy link
Member

kaxil commented Jul 24, 2024

@howardyoo Is there anything pending before we can mark the AIP-49 complete?

@howardyoo
Copy link
Contributor Author

howardyoo commented Jul 24, 2024 via email

romsharon98 pushed a commit to romsharon98/airflow that referenced this pull request Jul 26, 2024

---------

Co-authored-by: D. Ferruzzi <ferruzzi@amazon.com>
span.set_attribute("run_id", key.run_id)
span.set_attribute("task_id", key.task_id)
span.set_attribute("try_number", key.try_number)
span.set_attribute("command", str(command))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the value of putting the command in the span? It simply duplicates (in a format that is of no use as it's a single string) the dag_id, run_id, task_id etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the value of putting the command in the span? It simply duplicates (in a format that is of no use as it's a single string) the dag_id, run_id, task_id etc.

I did put the command as attribute, assuming that there may be something additional other than dag_id, run_id, task_id, etc. Due to the fact that I did not have too much of deep understanding of what the command can be, felt it was worth recording it as part of the span. If there's no real value of instrumenting the details of 'command' whatsoever, I'd say we can remove that instrumentation out of it for the next release.

Comment on lines +848 to +872
if conf.has_option("traces", "otel_task_log_event") and conf.getboolean(
"traces", "otel_task_log_event"
):
from airflow.utils.log.log_reader import TaskLogReader

task_log_reader = TaskLogReader()
if task_log_reader.supports_read:
metadata: dict[str, Any] = {}
logs, metadata = task_log_reader.read_log_chunks(ti, ti.try_number, metadata)
if ti.hostname in dict(logs[0]):
message = str(dict(logs[0])[ti.hostname]).replace("\\n", "\n")
while metadata["end_of_log"] is False:
logs, metadata = task_log_reader.read_log_chunks(
ti, ti.try_number - 1, metadata
)
if ti.hostname in dict(logs[0]):
message = message + str(dict(logs[0])[ti.hostname]).replace("\\n", "\n")
if span.is_recording():
span.add_event(
name="task_log",
attributes={
"message": message,
"metadata": str(metadata),
},
)
Copy link
Member

@ashb ashb Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@howardyoo @ferruzzi This is a huge no-no. The scheduler cannot do any processing that will block the main scheduling loop for so long, and going and reading all of the logs is going to block the scheduler loop for a noticable time.

THis block needs reverting I'm afraid -- it is not a feature that can exist in the scheduler.

@howardyoo
Copy link
Contributor Author

howardyoo commented Nov 7, 2024 via email

@ashb
Copy link
Member

ashb commented Nov 7, 2024

There could be Mbs of task logs potentially. Sending it via an otel span seems like a bad idea on general principle. Otel has logging already doesn't it

@jedcunningham
Copy link
Member

There could be Mbs of task logs potentially.

In the wild, we see logs big enough to OOM the webserver. I'd imagine the same would happen for the scheduler too. Definitely problematic, beyond just being slow.

@howardyoo
Copy link
Contributor Author

howardyoo commented Nov 7, 2024 via email

@potiuk
Copy link
Member

potiuk commented Nov 11, 2024

However, I know some of the airflow users would like to have task logs sent
out as span events - and would see those as good value (hence the
implementation).

I see the value of it indeed, but I agree there should be a limit. I think huge percentage (9X%) - the logs will be really small and very useful to see immediately in the OTEL span context, and for the rest it would be super useful if you only see beginning of the logs to get a bit more context.

How about having a resonable (say 2K?) limit for the log size being sent with some indication (ellipsis) that it's not complete. Maybe later also we could connect it with our logging framework, so that such a message could also contain (maybe in structured form) link to the log message accessible in whatever remote logging we have configured (and for task logs links to Airflow UI where you could see the logs from tasks).

That would make OTEL spans really, really useful as a first / main part of "application debugging" problems.

I really see OTEL as main way which will a) make debugging of problems with Airflow easier, b) it will also make it easier for us to help our users. One of the great features of OTEL and tools like jaeger is the they have export capabilities. Similarly to py-spy and memray flamegraphs, such OTEL exports can be sent to us for further analysis in case our users have OTEL enabled - seing even limited logs included in such exports would be a fantastic aid that will allow us to open such export using jaeger for example and be able to diagnose many issues much faster.

I think eventually we should even provide our users some information on how they can setup some OTEL tools (jaeger seems like an easy one ) and how to create such exports so that we can analyse them (likely with some anonymisation/obfuscation options for sensitive names for users who care about it etc, but I guess that should be possible with tools like Jaeger)

This is really part of #40975 - "Improve Airflow's debugging story" - which clearly from the survey run by @omkar-foss had shown needs improvement, I see OTEL as a big chance to make it easy to have a fantastic tool and easy to set-up configuration for our users to provide use much more data about the problems they are experiencing and allowing us to diagnose and fix them way faster.

@howardyoo
Copy link
Contributor Author

howardyoo commented Nov 11, 2024 via email

@ferruzzi
Copy link
Contributor

I like the idea of truncating the logs. The big question there would be if we should head the logs or tail the logs.... if we are cutting them short, is it more useful to have the beginning or the end? I'd say maybe the end, but honestly, it's so hard to say. Many times you have to scroll back to find a real root cause.

Either way, if I may humbly make a request, can we please make an effort to refer to Traces and Metrics rather than calling both OTel? They are two distinct features which both use OTel, but I feel like the recent discussion around improving Traces has undermined confidence in the OTel Metrics implementation which AFAIK is not having any issues. There is a project underway to improve the Metrics docs which seems to be stalled because of confusion around the Traces discussion. The two are not related and IMHO discussion around Traces shouldn't be affecting the Metrics improvement project.

@dstandish
Copy link
Contributor

You can't upload task logs from the scheduler loop. Even just a snippet of logs. It's not the place to do that. Can't be retrieving connections and fetching s3 blobs from scheduler loop.

@howardyoo
Copy link
Contributor Author

howardyoo commented Nov 12, 2024 via email

@dstandish
Copy link
Contributor

Yes, I actually thought about that as well! But for now, we may just have to make decision on perhaps truncating, because it would actually require less work and thus more robust.(Maybe in the future we may want to make it as a configurable option of having truncate by head, tail, or head+tail.)Yeah traces and metrics should be treated differently and we should try hard to reduce the confusion and noise around that.

Sorry but we can't be doing this in the scheduler

@howardyoo
Copy link
Contributor Author

howardyoo commented Nov 12, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:dev-tools area:Executors-core LocalExecutor & SequentialExecutor area:Scheduler including HA (high availability) scheduler area:Triggerer changelog:skip Changes that should be skipped from the changelog (CI, tests, etc..)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[AIP-49] Airflow support for OTEL traces
8 participants