Improve Airflow's debugging story #40975

kaxil · 2024-07-23T22:20:06Z

Summary

As we prepare for the release of Airflow 3.0, one of the key areas that need significant enhancement is the debugging experience.

Current Challenges

Insufficient Logging: Logs are often fragmented, in some cases overly verbose or non-existent and lack sufficient detail to easily trace issues. We should do an audit of the existing logs.
Complex Tracebacks: Debugging stack traces can be difficult due to the complex nature of DAG (Directed Acyclic Graph) execution and requires a full-running Airflow. Airflow's dag.test and task.test does a good job already but we should see if we can do even better.
Error Handling: Current error messages are not always informative or actionable, making it hard to understand the root cause of failures. We should do an audit of the existing errors.
Tooling Integration: Lack of integration with modern debugging and observability tools hinders the debugging process. Can we create a listing tool or some capabilities in the Airflow CLI that catches obvious errors? airflow dags parse does a job at it, worth checking if it is sufficient or not.

Whoever takes on this task should conduct a user research on the mailing list, Slack, Meetup or Airflow Summit to identify other common debugging problems that can be fixed.

The text was updated successfully, but these errors were encountered:

potiuk · 2024-07-24T00:13:31Z

Let me add to it what I wrote about OTEL in the https://lists.apache.org/thread/b2bvn8sbxfncg9qpvry9w142944mnlj6 - this might be a great tool to hlep with things. I a not sure if I want to take lone ownership about that one - maybe there will be someone else who would like to take a look and explore things as well - but I am happy to be deeply involved in that one.

Dev-iL · 2024-07-24T00:14:46Z

I'd like to be involved in this effort in some capacity. At least: brainstorming, qa, and documentation.

omkar-foss · 2024-07-24T14:14:32Z

Happy to help out with some of the logging and error handling implementation.

The debug snapshot idea sounds very useful @potiuk. It may give a canonical view of the user's environment. I suppose Jaeger provides a similar tool called Anonymizer, which generates a shareable json of a trace - probably same one that you were referring to in your mail. We can build our own debug snapshot util, or can think of using this tool with Jaeger since it supports the existing OTEL metrics and traces.

kaxil · 2024-07-24T14:52:07Z

@Dev-iL Could I assign this GitHub issue to you? You can the lead the "scoping" part of this epic by talking to Jarek and others on Slack, mailing list and other venues and come back with a concrete proposal. Would you like to do that?

Dev-iL · 2024-07-24T15:12:38Z

@kaxil Honestly? It sounds a bit scary going from contributing minor patches to being responsible for an important feature in an upcoming release. I prefer to actively observe and learn, at least once, how something like this is done and take on a similar responsibility after I know how much time/work it requires.

kaxil · 2024-07-25T00:39:34Z

Absolutely, that's completely fine

@kaxil Honestly? It sounds a bit scary going from contributing minor patches to being responsible for an important feature in an upcoming release. I prefer to actively observe and learn, at least once, how something like this is done and take on a similar responsibility after I know how much time/work it requires.

@omkar-foss Do you want to take a stab at leading it?

omkar-foss · 2024-07-25T18:24:37Z

@omkar-foss Do you want to take a stab at leading it?

@kaxil I would love to take the lead on this, but right now I suppose I'm still a rookie in the ways of the Airflow community.

So for this one, I'll prefer to assist all of you in every way possible, while trying to get a better grasp of the processes, codebase etc. Hope that's okay, thanks for considering me though 😇

omkar-foss · 2024-07-31T20:29:47Z

Whoever takes on this task should conduct a user research on the mailing list, Slack, Meetup or Airflow Summit to identify other common debugging problems that can be fixed.

@kaxil Any idea if there's a predefined user research template that has been used for prior releases?

If not, I'd like to propose the following for conducting the survey:

We can create a survey form with questions pertaining to understanding the users' debugging journey. Probably can use something like SurveyMonkey.
We can have groups of questions in the form, each group for a section in "Current Challenges" above. For example, the groups could be Logging, Traceback, Error Handling, Tooling & Integrations.
We can have 3 to 5 questions for each group. Let's try to keep the survey as brief and concise as possible.
The survey form can be circulated on the mailing list, Slack, and other places.
We can collate the feedback from the survey form and prioritize items accordingly.

Please let me know your thoughts on this, thanks.

cc: @potiuk @Dev-iL

Dev-iL · 2024-08-01T14:10:13Z

@omkar-foss The main question is who the target audience of the research is, where possible answers are: maintainers, contributors, power users, general public, etc. Based on @kaxil's instructions, I'd say mostly power-users and above. If that is the case, I'm assuming most will be willing to participate in a survey, even if it has questions on topics people might not have an opinion on. If on the other hand, we're looking to get more participants, I think a literal survey is not the way, since people might open it, see how long it is, and just give up. That, of course, would be a terrible waste, because there are likely many use-cases that will not be represented.

For the above reason, I was thinking something like a feature voting platform (example1, example2) could be suitable - that way, if someone has a pain-point related to how a particular system works, they can look for existing posts or briefly explain what they have in mind (possibly with a template like a bug report) and allow others to vote or add to these suggestions. This also takes care of much of the aggregation work of the results.

omkar-foss · 2024-08-05T07:33:34Z

Hey @Dev-iL, I agree with your reasoning above. I checked out the sample Feature Upvote board that you've shared above and it surely feels simpler (and quicker) to submit compared to a regular survey form.

I suppose we'll need an initial list of features on the upvote board for the participants to vote, would be great to hear if you've any thoughts around it.

Not sure how much help I can be on this, but I'm here so feel free to tag me if you need any assistance! :)

potiuk · 2024-08-05T07:40:25Z

@omkar-foss The main question is who the target audience of the research is, where possible answers are: maintainers, contributors, power users, general public, etc. Based on @kaxil's instructions, I'd say mostly power-users and above.

I'd say mostly power-users - yes, but also the tooling and debuggability should be targeted for "new" users. I think power-users mostly know their ways - they can do remote debugging, they know how to connect their IDEs to the code, they are able to even use pdb, py-spy and other tools while remote shelling to container instances etc.

But the goal here is to shorten the path between "I wrote some DAG and it does not work" to "how do I most effectively find inspect and understand what's going on there" - for a user who just wrote their first few dags.

I think an assumption should be that that person has some Python experience, they have an IDE (PyCharm/ VSCode) and they are willing to follow some instructions on setting up things first - while ideally this should be one-time setup and they should be able to re-use it easily (and teach others how to do it).

If that is the case, I'm assuming most will be willing to participate in a survey, even if it has questions on topics people might not have an opinion on. If on the other hand, we're looking to get more participants, I think a literal survey is not the way, since people might open it, see how long it is, and just give up. That, of course, would be a terrible waste, because there are likely many use-cases that will not be represented.

I think yes - survey is a good idea if well prepared and those power-users might indeed be willing to share their experiences - we can even leverage the upcoming Airlfow summit and do some prices / recognition and generally a bit more fuss about it - so if we could do it still in August and maybe run the survey during the Summit as well, we could likely make it much more efficient.

Dev-iL · 2024-08-07T09:11:25Z

@potiuk @omkar-foss In the interest of moving ahead with this, I've made a google doc so we can start hashing out this survey collaboratively. Currently, it's publicly open for commenting - please send me your google account via slack so I could add you to the editors. If there are any privacy or other concerns, I don't mind moving the document to another platform.

kaxil · 2024-08-07T22:27:40Z

@Dev-iL Drop a mail to dev@airflow.apache.org too (Public archive: https://lists.apache.org/list.html?dev@airflow.apache.org). I am sure a lot of developer & users might want to add things to it as well as in Airflow's slack channel

Dev-iL · 2024-08-11T07:26:50Z

It's been a few days, and the document hasn't seen any activity (outside of my own placeholder ideas), nor did anyone approach me for editing rights. If this trend continues, we won't have the survey ready on time.

@kaxil I just saw your comment on the mailing list. My plan was to first iterate on the survey's structure in docs, move to form once satisfied, then circulate it for responses.

omkar-foss · 2024-08-12T02:42:42Z

I've made a google doc so we can start hashing out this survey collaboratively.

Doc looks good to me. Just one question/suggestion - will all questions be optional, or some mandatory, some optional? My suggestion would be to keep as many questions optional especially free text type questions (Q 2.4, 3.4, 4.4, 4.5). Reason being not all people will have feedback suiting each question.

Dev-iL · 2024-08-12T04:52:51Z

@omkar-foss don't suggest - decide. I, too, think questions should be mostly optional. As for the contents of the survey - I don't believe it's ready. It currently has questions asking about general sentiments on things, and I don't know how actionable it will be unless users answer the free text questions en masse.

I'll give you an example: suppose user satisfaction with the airflow documentation comes out as "medium" overall - what do you do about this? OTOH, suppose we had a multi-select question that mentioned airflow features introduced in the last few 2.x releases, asking if users find the examples provided for them sufficient - now that would be something actionable. See what I mean?

It needs the eyes of someone who knows airflow and its power user community better than I do, to know the right questions to ask, potentially about specific components, plugins, use-cases, etc, so that feedback is insightful and useful.

potiuk · 2024-08-13T23:17:16Z

I think none of the maintainers know "power users" well. Almost by definition, we are not running, nor maining airlfow and we do not have teams of people working together on DAGs. we are pretty much blind-folded when it comes to their needs and can at most guess what is troublesome for them or what can help them.

We mostly know how to debug Airflow itself, not how to debug Airflow DAGs. There are huge and significant differences for workflows, tooling and integration with IDEs. Same as with documentation - we are very POOR documentation writers, because a) we think about internals and not externals b) we have a lot of knowledge and assumptions that readers might not have and we might fail to explain it to them c) we tend to focus on HOW things are done not WHAT our users might want to learn form it.

That's why we NEED power users themselvs and ideally people who work in teams and have an opportunity to lead and decide on those questions and questionaire. We might definitely advise on decision making but we should not "lead" such process.

omkar-foss · 2024-08-15T05:40:12Z

That's why we NEED power users themselvs and ideally people who work in teams and have an opportunity to lead and decide on those questions and questionaire.

Yes, we're on the same page. We're now in the phase of collecting feedback on finalizing the survey draft on Airflow Slack, hoping for quicker response and finding users who use Airflow along with their teams. Starting with #contributors channel for now (most maintainers, less teams activity), we'll eventually check with #new-contributors, #user-troubleshooting, #documentation, etc. (least maintainers, most teams activity) in that order, as required.

Would be great if we all can continue this conversation from this issue to Airflow Slack (on #contributors channel) so we can discuss quicker and move closer to rolling out the final user survey.

amoghrajesh · 2024-08-16T05:46:52Z

I also got a chance to review the doc and make some suggestions to it.
@omkar-foss @Dev-iL I too would like to collaborate on this issue if that's ok by you.
I think one very good opportunity would be to create the questionnaire, get it reviewed by @kaxil / @potiuk like folks and also someone from the product side, and distribute it as a QR / link at the Airflow Summit because the Summit will have people from varying backgrounds, who are related in some way to Airflow

potiuk · 2024-08-21T13:16:41Z

I also looked at it - and actually I have a comment a bit contrary to those early comments of @amoghrajesh who insisted on "choice" answers.

Since we are not really sure about the debugging usage in a number of places I find the rating questions (Often/Rare/Satisfied etc.) telling us very little - especially that we also have no baseline to compare it.

I think this survey will be answered by a small number of people (not few 100s but few 10s maybe) so statistical aggregation of the data for such a small sample will be very misleading and useless - we will anyhow get mostly answers from people who are frustrated by their experiences, this is almost a given, so any stats based on the ranked answers will be a) super biased b) very little telling.

I think the biggest value of this survey is to get some concrete examples, stories, unknown to us ways how people are debugging Airlflow and the "free form" answer is absolutely most important insight we can get from it - we can learn for example that somoene uses x.y.z tool in this specific way, and that they miss that and this feature there - but we will never be able to ask the right question for it - especially one tha thave "rated" answer".

So I think pretty much all the questions there should be of the type:

I have no problems with it - this works fine
I have a problem with it - here is a detailed description

Or

I do not use any tools
i use some tools -> here describe what you are using

And I think the choice should be in most cases binary.

Otherwise I'd find very little value finding out that 15 of 20 people find that informations are often misleading without any additional explanation.

So I think all the questions that have 5 choices of satisfaction should be decresed to 2 choices ("not my problem/my problem) and the scond should be accompanied with obligatory explanation why. Yes it will make the survey longer to fill, and yes it will decrease the number of responses we get but I feel this will be way more useful for us.

kaxil · 2024-09-02T13:46:16Z

fyi, following are the docs that have actionable next steps based on the questions (and options) in the survey:

kaxil · 2024-09-02T17:07:29Z

The survey form is ready: https://s.apache.org/airflow-debugging-survey2024 , thanks to @Dev-iL , @omkar-foss & @amoghrajesh

kaxil · 2024-09-05T15:08:39Z

Thanks to @Dev-iL -- we have a QR code that links to the survey

This will allow him to interact with the GitHub project for sig-debugging: apache#40975

This will allow him to interact with the GitHub project for sig-debugging: #40975

omkar-foss · 2024-10-18T16:44:37Z

Hi all! As per discussion, we'll be tracking all issues related to Airflow Debugging Story (based on debugging survey responses) on this project: https://github.com/orgs/apache/projects/421

This will allow him to interact with the GitHub project for sig-debugging: apache#40975

potiuk · 2024-11-11T12:15:14Z

Also see #40802 (comment) discussion. I believe with OTEL and traces (and even including limited set of logs in the traces) we are closer to address big gap in debugging of Airflow where we can give our users a tool to provide us way more diagnostics information that will allow us to analyse, diagnose, and fix many problems much more efficiently.

This will allow him to interact with the GitHub project for sig-debugging: apache#40975

kaxil added the airflow3.0:candidate Potential candidates for Airflow 3.0 label Jul 23, 2024

kaxil mentioned this issue Jul 23, 2024

Release Airflow 3.0 #39593

Open

10 tasks

dosubot bot added area:logging kind:feature Feature Requests labels Jul 23, 2024

kaxil assigned Dev-iL, amoghrajesh and omkar-foss Sep 2, 2024

kaxil added a commit to astronomer/airflow that referenced this issue Oct 18, 2024

Add omkar-foss to the triage team

f4d7841

This will allow him to interact with the GitHub project for sig-debugging: apache#40975

kaxil mentioned this issue Oct 18, 2024

Add omkar-foss to the triage team #43169

Merged

kaxil added a commit that referenced this issue Oct 18, 2024

Add omkar-foss to the triage team (#43169)

7d6990f

This will allow him to interact with the GitHub project for sig-debugging: #40975

kaxil added this to Debugging Improvements - Airflow 3 Oct 18, 2024

kaxil moved this to In Progress in Debugging Improvements - Airflow 3 Oct 18, 2024

rawwar mentioned this issue Oct 19, 2024

Improve debug logging #43186

Open

2 tasks

harjeevanmaan pushed a commit to harjeevanmaan/airflow that referenced this issue Oct 23, 2024

Add omkar-foss to the triage team (apache#43169)

8843c72

This will allow him to interact with the GitHub project for sig-debugging: apache#40975

PaulKobow7536 pushed a commit to PaulKobow7536/airflow that referenced this issue Oct 24, 2024

Add omkar-foss to the triage team (apache#43169)

87c1438

This will allow him to interact with the GitHub project for sig-debugging: apache#40975

RNHTTR mentioned this issue Oct 30, 2024

Point users to a TaskInstance's Event Log if tasks are missing #43510

Open

2 tasks

potiuk mentioned this issue Nov 11, 2024

[AIP-49] OpenTelemetry Traces for Apache Airflow Part 2 #40802

Merged

omkar-foss mentioned this issue Nov 11, 2024

Create a basic guiding doc - How to Debug Your Airflow Deployment #43892

Open

2 tasks

ellisms pushed a commit to ellisms/airflow that referenced this issue Nov 13, 2024

Add omkar-foss to the triage team (apache#43169)

a745369

This will allow him to interact with the GitHub project for sig-debugging: apache#40975

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Airflow's debugging story #40975

Improve Airflow's debugging story #40975

kaxil commented Jul 23, 2024

potiuk commented Jul 24, 2024

Dev-iL commented Jul 24, 2024

omkar-foss commented Jul 24, 2024 •

edited

Loading

kaxil commented Jul 24, 2024

Dev-iL commented Jul 24, 2024

kaxil commented Jul 25, 2024

omkar-foss commented Jul 25, 2024

omkar-foss commented Jul 31, 2024 •

edited

Loading

Dev-iL commented Aug 1, 2024

omkar-foss commented Aug 5, 2024

potiuk commented Aug 5, 2024

Dev-iL commented Aug 7, 2024

kaxil commented Aug 7, 2024 •

edited

Loading

Dev-iL commented Aug 11, 2024

omkar-foss commented Aug 12, 2024

Dev-iL commented Aug 12, 2024

potiuk commented Aug 13, 2024

omkar-foss commented Aug 15, 2024 •

edited

Loading

amoghrajesh commented Aug 16, 2024

potiuk commented Aug 21, 2024 •

edited

Loading

kaxil commented Sep 2, 2024 •

edited

Loading

kaxil commented Sep 2, 2024

kaxil commented Sep 5, 2024 •

edited

Loading

omkar-foss commented Oct 18, 2024

potiuk commented Nov 11, 2024

Improve Airflow's debugging story #40975

Improve Airflow's debugging story #40975

Comments

kaxil commented Jul 23, 2024

Summary

Current Challenges

potiuk commented Jul 24, 2024

Dev-iL commented Jul 24, 2024

omkar-foss commented Jul 24, 2024 • edited Loading

kaxil commented Jul 24, 2024

Dev-iL commented Jul 24, 2024

kaxil commented Jul 25, 2024

omkar-foss commented Jul 25, 2024

omkar-foss commented Jul 31, 2024 • edited Loading

Dev-iL commented Aug 1, 2024

omkar-foss commented Aug 5, 2024

potiuk commented Aug 5, 2024

Dev-iL commented Aug 7, 2024

kaxil commented Aug 7, 2024 • edited Loading

Dev-iL commented Aug 11, 2024

omkar-foss commented Aug 12, 2024

Dev-iL commented Aug 12, 2024

potiuk commented Aug 13, 2024

omkar-foss commented Aug 15, 2024 • edited Loading

amoghrajesh commented Aug 16, 2024

potiuk commented Aug 21, 2024 • edited Loading

kaxil commented Sep 2, 2024 • edited Loading

kaxil commented Sep 2, 2024

kaxil commented Sep 5, 2024 • edited Loading

omkar-foss commented Oct 18, 2024

potiuk commented Nov 11, 2024

omkar-foss commented Jul 24, 2024 •

edited

Loading

omkar-foss commented Jul 31, 2024 •

edited

Loading

kaxil commented Aug 7, 2024 •

edited

Loading

omkar-foss commented Aug 15, 2024 •

edited

Loading

potiuk commented Aug 21, 2024 •

edited

Loading

kaxil commented Sep 2, 2024 •

edited

Loading

kaxil commented Sep 5, 2024 •

edited

Loading