Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to dump OpenLineage events that correspond to dataset/namespace from web #1927

Closed
mobuchowski opened this issue Mar 29, 2022 · 7 comments · Fixed by #2070
Closed

Option to dump OpenLineage events that correspond to dataset/namespace from web #1927

mobuchowski opened this issue Mar 29, 2022 · 7 comments · Fixed by #2070
Assignees
Milestone

Comments

@mobuchowski
Copy link
Contributor

Having this feature would make debugging, and replicating errors much faster.

@wslulciuc wslulciuc added this to the Roadmap milestone Mar 31, 2022
@wslulciuc
Copy link
Member

I would add a lineage export cmd to the marquez-cli that would export the OL events from the lineage_events table. For example:

$ java -jar marquez.jar lineage export > lineage.json

@howardyoo
Copy link
Collaborator

It would also be nice if Marquez UI would have a little button at the bottom that, when pressed, will reveal a console panel that outputs all the raw OL events that it received - acting sort of like a kind of debug console. Typically, users would use marquez to visualize the events (after they get stored in its backend DB) - but also may want to monitor how the OL messages are actually being received.

@rossturk
Copy link
Collaborator

rossturk commented Aug 9, 2022

Perhaps there could be a sortable/filterable/paginated table in the UI for lineage events.

There have been many times where I just wanted to see "what got emitted" or "did my pipeline do it right?" It would be an excellent debugging tool, and could help people build OL integrations more quickly.

Does an API exist that could support this? If not, one could possibly serve both a) data export and b) debugging use cases.

@mobuchowski
Copy link
Contributor Author

mobuchowski commented Aug 10, 2022

Does an API exist that could support this? If not, one could possibly serve both a) data export and b) debugging use cases.

@wslulciuc @howardyoo @rossturk Do we want to

  1. display all events, perhaps sorted by time descending to see latest events?
  2. display all events, but choose some particular namespace on which we're looking at?
  3. display events that look at particular job or dataset only?

I think the API would look differently depending on decision here. Of course, the first option is the simplest.

@howardyoo
Copy link
Collaborator

@rossturk , BTW, there is a workaround to this issue now, of using OL proxy between the client side and marquez to evesdrop the raw events that gets received. The setup would be to have OL proxy in the front, and setup its
OpenLineage/OpenLineage@0e4a670
http type as streaming target.

@howardyoo
Copy link
Collaborator

Does an API exist that could support this? If not, one could possibly serve both a) data export and b) debugging use cases.

@wslulciuc @howardyoo @rossturk Do we want to

  1. display all events, perhaps sorted by time descending to see latest events?
  2. display all events, but choose some particular namespace on which we're looking at?
  3. display events that look at particular job or dataset only?

I think the API would look differently depending on decision here. Of course, the first option is the simplest.

  1. I would want the events sorted by time descending to see latest events - that would typically be the way users would skim through the dumps
  2. filtering based on namespace could be an option, if possible, and certainly would be very helpful.
  3. filtering based on particular job or dataset... it may be good to have, but I don't think it is a must have.

Rather, my opinion would be to be able to filter more on even types (like COMPLETE, FAIL, etc) that may be more useful, or based on particular time period.

@rossturk
Copy link
Collaborator

rossturk commented Aug 10, 2022

I think the answer to @mobuchowski is: yes to all three. As a user, I want all three of those things.

What I'm imagining is a table with filter/sort controls at the top and page controls at the bottom. The columns could more or less match the underlying DB table.

I agree that filtering on dataset or job is more difficult and marginally less interesting 👍

@mobuchowski mobuchowski self-assigned this Aug 11, 2022
@mobuchowski mobuchowski moved this to In Progress in Marquez Aug 12, 2022
Repository owner moved this from In Progress to Done in Marquez Sep 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants