-
Notifications
You must be signed in to change notification settings - Fork 14.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AIP-49] OpenTelemetry Traces for Apache Airflow Part 2 #40802
Conversation
1d723e2
to
09ad16a
Compare
@howardyoo Is this the final part, the cut-off for Airflow 2.10 is next week and we plan to release 2.10 mid-August. In one of the last dev calls, there were some questions on AIP-49 https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-49+OpenTelemetry+Support+for+Apache+Airflow on when you want to target Phase 2 of it. cc @vikramkoka @potiuk who were interested in this during the dev call |
Hi @kaxil , yes - you're right. But honestly, I believe this could be included as part of Airflow 2.10, if timely reviews are given to the code. I really do not see any reason why the second part should be part of AF 3.0, since the instrumentation and tracing code will not change at all. You mentioned the part 2 could be put into AF 3.0. but I am curious to know what would be the reason behind having part 2 into AF 3.0, as having only part 1 in AF 2.10 won't actually emit any traces at all. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. And except try_number the tests should be fixed.
BTW. Hint to other reviewers - selecting "hide whiltespaces" helps enormously in review as there are a number of whitespace-only indentation changes.
cool, will modify the code to not subtract 1 and push it out :-). Thanks
for the comments!
…On Wed, Jul 17, 2024 at 1:02 PM Jarek Potiuk ***@***.***> wrote:
***@***.**** commented on this pull request.
This looks good. And except try_number the tests should be fixed.
BTW. Hint to other reviewers - selecting "hide whiltespaces" help
enormously in review as there are a number of whitespace-only indentation
changes.
—
Reply to this email directly, view it on GitHub
<#40802 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AHZNLLVZRLN6H6PWFC7YAHDZM2WUVAVCNFSM6AAAAABK5KFCK2VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDCOBTGY2DINBTGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Marked it for 2.10.0 but some tests are still failing |
Thank you, Jarek.Will take a look at them ASAP.Sent from my iPhoneOn Jul 18, 2024, at 6:53 AM, Jarek Potiuk ***@***.***> wrote:
Marked it for 2.10.0 but some tests are still failing
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
LOOOKS GOOD! |
Anyone else who would like to review it (@ferruzzi maybe?) (as a reminder - hiding whitespace makes it much, much easier). |
combine the same message. Co-authored-by: D. Ferruzzi <ferruzzi@amazon.com>
39a603a
to
d33fe10
Compare
The indentation in apache#40802 has changed heartbeating to not work in case of standalone date processing. This PR fixes it back.
@howardyoo Is there anything pending before we can mark the AIP-49 complete? |
None,AIP-49 is completed.Sent from my iPhoneOn Jul 24, 2024, at 3:38 PM, Kaxil Naik ***@***.***> wrote:
@howardyoo Is there anything pending before we can mark the AIP-49 complete?
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
--------- Co-authored-by: D. Ferruzzi <ferruzzi@amazon.com>
span.set_attribute("run_id", key.run_id) | ||
span.set_attribute("task_id", key.task_id) | ||
span.set_attribute("try_number", key.try_number) | ||
span.set_attribute("command", str(command)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the value of putting the command in the span? It simply duplicates (in a format that is of no use as it's a single string) the dag_id, run_id, task_id etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the value of putting the command in the span? It simply duplicates (in a format that is of no use as it's a single string) the dag_id, run_id, task_id etc.
I did put the command as attribute, assuming that there may be something additional other than dag_id, run_id, task_id, etc. Due to the fact that I did not have too much of deep understanding of what the command can be, felt it was worth recording it as part of the span. If there's no real value of instrumenting the details of 'command' whatsoever, I'd say we can remove that instrumentation out of it for the next release.
if conf.has_option("traces", "otel_task_log_event") and conf.getboolean( | ||
"traces", "otel_task_log_event" | ||
): | ||
from airflow.utils.log.log_reader import TaskLogReader | ||
|
||
task_log_reader = TaskLogReader() | ||
if task_log_reader.supports_read: | ||
metadata: dict[str, Any] = {} | ||
logs, metadata = task_log_reader.read_log_chunks(ti, ti.try_number, metadata) | ||
if ti.hostname in dict(logs[0]): | ||
message = str(dict(logs[0])[ti.hostname]).replace("\\n", "\n") | ||
while metadata["end_of_log"] is False: | ||
logs, metadata = task_log_reader.read_log_chunks( | ||
ti, ti.try_number - 1, metadata | ||
) | ||
if ti.hostname in dict(logs[0]): | ||
message = message + str(dict(logs[0])[ti.hostname]).replace("\\n", "\n") | ||
if span.is_recording(): | ||
span.add_event( | ||
name="task_log", | ||
attributes={ | ||
"message": message, | ||
"metadata": str(metadata), | ||
}, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@howardyoo @ferruzzi This is a huge no-no. The scheduler cannot do any processing that will block the main scheduling loop for so long, and going and reading all of the logs is going to block the scheduler loop for a noticable time.
THis block needs reverting I'm afraid -- it is not a feature that can exist in the scheduler.
I understand.
So we can revert this no problem.
In that case, would it be okay to make this done in some async way, such
that it will not block the scheduler loop, but can execute to get the logs?
…On Thu, Nov 7, 2024 at 7:56 AM Ash Berlin-Taylor ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In airflow/jobs/scheduler_job_runner.py
<#40802 (comment)>:
> + if conf.has_option("traces", "otel_task_log_event") and conf.getboolean(
+ "traces", "otel_task_log_event"
+ ):
+ from airflow.utils.log.log_reader import TaskLogReader
+
+ task_log_reader = TaskLogReader()
+ if task_log_reader.supports_read:
+ metadata: dict[str, Any] = {}
+ logs, metadata = task_log_reader.read_log_chunks(ti, ti.try_number, metadata)
+ if ti.hostname in dict(logs[0]):
+ message = str(dict(logs[0])[ti.hostname]).replace("\\n", "\n")
+ while metadata["end_of_log"] is False:
+ logs, metadata = task_log_reader.read_log_chunks(
+ ti, ti.try_number - 1, metadata
+ )
+ if ti.hostname in dict(logs[0]):
+ message = message + str(dict(logs[0])[ti.hostname]).replace("\\n", "\n")
+ if span.is_recording():
+ span.add_event(
+ name="task_log",
+ attributes={
+ "message": message,
+ "metadata": str(metadata),
+ },
+ )
@howardyoo <https://github.com/howardyoo> @ferruzzi
<https://github.com/ferruzzi> This is a huge no-no. The scheduler cannot
do any processing that will block the main scheduling loop for so long, and
going and reading all of the logs is going to block the scheduler loop for
a noticable time.
THis block needs reverting I'm afraid -- it is not a feature that can
exist in the scheduler.
—
Reply to this email directly, view it on GitHub
<#40802 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AHZNLLVWTA67MDPGFO4QMXDZ7NWQVAVCNFSM6AAAAABK5KFCK2VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDIMRRGA2TIMZVG4>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
There could be Mbs of task logs potentially. Sending it via an otel span seems like a bad idea on general principle. Otel has logging already doesn't it |
In the wild, we see logs big enough to OOM the webserver. I'd imagine the same would happen for the scheduler too. Definitely problematic, beyond just being slow. |
I understand.
However, I know some of the airflow users would like to have task logs sent
out as span events - and would see those as good value (hence the
implementation).
Currently, the ability for task logs to be emitted is configurable, meaning
that you can turn it off it it becomes problematic. How about some
additional options to the configs for making this feature not be removed,
but provided as an option:
- providing way to limit the max size of the task log, e.g. have the
maximum limit to emit the first N characters (e.g. first 32k characters),
and trim off any subsequent logs so that if they want to view the rest of
the log, they can use the log link.
The OTEL log implementation is just yet another structure that closely
resembles the span event, so simply using the log sdk won't resolve this.
However, I believe having the option for default limit (we can settle in
for such as first 64k characters) and then having option to either limit or
increase the limit based on user's preference could be something.
Any opinions on this?
…On Thu, Nov 7, 2024 at 11:04 AM Jed Cunningham ***@***.***> wrote:
There could be Mbs of task logs potentially.
In the wild, we see logs big enough to OOM the webserver. I'd imagine the
same would happen for the scheduler too. Definitely problematic, beyond
just being slow.
—
Reply to this email directly, view it on GitHub
<#40802 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AHZNLLRWF327WCSKMDPRXC3Z7OMSLAVCNFSM6AAAAABK5KFCK2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINRSG43TSMRSGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I see the value of it indeed, but I agree there should be a limit. I think huge percentage (9X%) - the logs will be really small and very useful to see immediately in the OTEL span context, and for the rest it would be super useful if you only see beginning of the logs to get a bit more context. How about having a resonable (say 2K?) limit for the log size being sent with some indication (ellipsis) that it's not complete. Maybe later also we could connect it with our logging framework, so that such a message could also contain (maybe in structured form) link to the log message accessible in whatever remote logging we have configured (and for task logs links to Airflow UI where you could see the logs from tasks). That would make OTEL spans really, really useful as a first / main part of "application debugging" problems. I really see OTEL as main way which will a) make debugging of problems with Airflow easier, b) it will also make it easier for us to help our users. One of the great features of OTEL and tools like jaeger is the they have export capabilities. Similarly to py-spy and memray flamegraphs, such OTEL exports can be sent to us for further analysis in case our users have OTEL enabled - seing even limited logs included in such exports would be a fantastic aid that will allow us to open such export using jaeger for example and be able to diagnose many issues much faster. I think eventually we should even provide our users some information on how they can setup some OTEL tools (jaeger seems like an easy one ) and how to create such exports so that we can analyse them (likely with some anonymisation/obfuscation options for sensitive names for users who care about it etc, but I guess that should be possible with tools like Jaeger) This is really part of #40975 - "Improve Airflow's debugging story" - which clearly from the survey run by @omkar-foss had shown needs improvement, I see OTEL as a big chance to make it easy to have a fantastic tool and easy to set-up configuration for our users to provide use much more data about the problems they are experiencing and allowing us to diagnose and fix them way faster. |
I agree with Jarek on his message above.
Just created an issue : #43868 for
this, so that we can start working on it.
…On Mon, Nov 11, 2024 at 6:14 AM Jarek Potiuk ***@***.***> wrote:
However, I know some of the airflow users would like to have task logs sent
out as span events - and would see those as good value (hence the
implementation).
I see the value of it indeed, but I agree there should be a limit. I think
huge percentage (9X%) - the logs will be really small and very useful to
see immediately in the OTEL span context, and for the rest it would be
super useful if you only see beginning of the logs to get a bit more
context.
How about having a resonable (say 2K?) limit for the log size being sent
with some indication (ellipsis) that it's not complete. Maybe later also we
could connect it with our logging framework, so that such a message could
also contain (maybe in structured form) link to the log message accessible
in whatever remote logging we have configured (and for task logs links to
Airflow UI where you could see the logs from tasks).
That would make OTEL spans really, really useful as a first / main part of
"application debugging" problems.
I really see OTEL as main way which will a) make debugging of problems
with Airflow easier, b) it will also make it easier for us to help our
users. One of the great features of OTEL and tools like jaeger is the they
have export capabilities. Similarly to py-spy and memray flamegraphs, such
OTEL exports can be sent to us for further analysis in case our users have
OTEL enabled - seing even limited logs included in such exports would be a
fantastic aid that will allow us to open such export using jaeger for
example and be able to diagnose many issues much faster.
I think eventually we should even provide our users some information on
how they can setup some OTEL tools (jaeger seems like an easy one ) and how
to create such exports so that we can analyse them (likely with some
anonymisation/obfuscation options for sensitive names for users who care
about it etc, but I guess that should be possible with tools like Jaeger)
This is really part of #40975
<#40975> - "Improve Airflow's
debugging story" - which clearly from the survey run by @omkar-foss
<https://github.com/omkar-foss> had shown needs improvement, I see OTEL
as a big chance to make it easy to have a fantastic tool and easy to set-up
configuration for our users to provide use much more data about the
problems they are experiencing and allowing us to diagnose and fix them way
faster.
—
Reply to this email directly, view it on GitHub
<#40802 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AHZNLLTCYHNLOAIILMURHXT2ACNQTAVCNFSM6AAAAABK5KFCK2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINRYGAZTEMRVGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I like the idea of truncating the logs. The big question there would be if we should Either way, if I may humbly make a request, can we please make an effort to refer to Traces and Metrics rather than calling both OTel? They are two distinct features which both use OTel, but I feel like the recent discussion around improving Traces has undermined confidence in the OTel Metrics implementation which AFAIK is not having any issues. There is a project underway to improve the Metrics docs which seems to be stalled because of confusion around the Traces discussion. The two are not related and IMHO discussion around Traces shouldn't be affecting the Metrics improvement project. |
You can't upload task logs from the scheduler loop. Even just a snippet of logs. It's not the place to do that. Can't be retrieving connections and fetching s3 blobs from scheduler loop. |
Yes, I actually thought about that as well! But for now, we may just have to make decision on perhaps truncating, because it would actually require less work and thus more robust.(Maybe in the future we may want to make it as a configurable option of having truncate by head, tail, or head+tail.)Yeah traces and metrics should be treated differently and we should try hard to reduce the confusion and noise around that.Sent from my iPhoneOn Nov 12, 2024, at 1:19 PM, D. Ferruzzi ***@***.***> wrote:
I like the idea of truncating the logs. The big question there would be if we should head the logs or tail the logs.... if we are cutting them short, is it more useful to have the beginning or the end? I'd say maybe the end, but honestly, it's so hard to say. Many times you have to scroll back to find a real root cause.
Either way, if I may humbly make a request, can we please make an effort to refer to Traces and Metrics rather than calling both OTel? They are two distinct features which both use OTel, but I feel like the recent discussion around improving Traces has undermined confidence in the OTel Metrics implementation which AFAIK is not having any issues. There is a project underway to improve the Metrics docs which seems to be stalled because of confusion around the Traces discussion. The two are not related and IMHO discussion around Traces shouldn't be affecting the Metrics improvement project.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Sorry but we can't be doing this in the scheduler |
Okay,If you say so, we can remove this from the code. Thanks!Sent from my iPhoneOn Nov 12, 2024, at 3:03 PM, Daniel Standish ***@***.***> wrote:
Yes, I actually thought about that as well! But for now, we may just have to make decision on perhaps truncating, because it would actually require less work and thus more robust.(Maybe in the future we may want to make it as a configurable option of having truncate by head, tail, or head+tail.)Yeah traces and metrics should be treated differently and we should try hard to reduce the confusion and noise around that.
Sorry but we can't be doing this in the scheduler
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
closes #37752
This is a PR PART 2 for AIP-49 which is Open Telemetry support for Airflow. In last year, a group of contributors pushed out the first release of Airflow's commitment to OpenTelemetry by providing OTEL metrics support. This PR addresses the PART 2 of second phase of the OTEL implementation for Airflow, which provides instrumentation that produces spans and span logs for the traces in Airflow.