Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added AzureBlobFileSystem support for StructuredDatasets #1109

Merged
merged 2 commits into from
Jul 25, 2022

Conversation

MorpheusXAUT
Copy link
Contributor

TL;DR

This PR adds support for storing StructuredDatasets using AzureBlobFileSystem (abfs).

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

Complete description

As discussed before, we've added support for abfs (using adlfs/stow under the hood) for StructuredDatasets by adding it to the registered protocols for transformers.
I've also noticed one file (plugins/flytekit-spark/flytekitplugins/spark/sd_transformers.py) which still used string constants and didn't support GCS either. As we're currently not using Spark anywhere, I wasn't able to verify this change though, so I can revert those lines if you'd prefer.

Tracking Issue

flyteorg/flyte#2709

Follow-up issue

NA

@welcome
Copy link

welcome bot commented Jul 21, 2022

Thank you for opening this pull request! 🙌

These tips will help get your PR across the finish line:

  • Most of the repos have a PR template; if not, fill it out to the best of your knowledge.
  • Sign off your commits (Reference: DCO Guide).

Signed-off-by: Nick Müller <nmueller@blackshark.ai>
@MorpheusXAUT MorpheusXAUT force-pushed the structured-datasets-abfs branch from 81b3370 to 82da680 Compare July 21, 2022 19:43
@codecov
Copy link

codecov bot commented Jul 21, 2022

Codecov Report

Merging #1109 (f5bdbb3) into master (c6e7237) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #1109   +/-   ##
=======================================
  Coverage   86.95%   86.95%           
=======================================
  Files         276      276           
  Lines       25492    25493    +1     
  Branches     2865     2865           
=======================================
+ Hits        22167    22168    +1     
  Misses       2847     2847           
  Partials      478      478           
Impacted Files Coverage Δ
flytekit/types/structured/basic_dfs.py 100.00% <100.00%> (ø)
flytekit/types/structured/structured_dataset.py 93.06% <100.00%> (+0.02%) ⬆️
flytekit/deck/deck.py 82.97% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c6e7237...f5bdbb3. Read the comment docs.

@wild-endeavor
Copy link
Contributor

hey @MorpheusXAUT do you also have the data persistence plugin for abfs? is that what you meant by you previously added support for it under the hood?

@MorpheusXAUT
Copy link
Contributor Author

MorpheusXAUT commented Jul 21, 2022

hey @MorpheusXAUT do you also have the data persistence plugin for abfs? is that what you meant by you previously added support for it under the hood?

@wild-endeavor oh, good point 🤔 I believe that still hasn't been cleaned up & contributed back to this repo, sorry about that. 😕
I'll try to get that sorted first thing tomorrow. Should we add that to the same PR or create a new one?

@wild-endeavor
Copy link
Contributor

this pr is fine. thanks!

Signed-off-by: Nick Müller <nmueller@blackshark.ai>
@MorpheusXAUT
Copy link
Contributor Author

MorpheusXAUT commented Jul 22, 2022

@wild-endeavor Looking at it again, I was actually mistaken last night.

We don't use an extra (custom-written) datapersistence plugin for abfs, but instead make use of fsspec and its adlfs (albeit a fork at the moment, since we're waiting for a PR to be merged).

I've added adlfs to flytekit-data-fsspec plugin's extras_require map, the version constraint matches the version we're using/have tested it with.
I believe that's everything required on flytekit's/the data persistance plugin's side as the actual import for adlfs was configured in our projects.

EDIT: not quite sure why that one check failed, looks unrelated to me?

@wild-endeavor
Copy link
Contributor

rerunning the failed onnx job. not entirely sure what's happening there.

Comment on lines 40 to 44
BIGQUERY = "bq"
S3 = "s3"
ABFS = "abfs"
GCS = "gs"
LOCAL = "/"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it time to turn this into an enum, @wild-endeavor ?

@wild-endeavor wild-endeavor merged commit 3127735 into flyteorg:master Jul 25, 2022
@welcome
Copy link

welcome bot commented Jul 25, 2022

Congrats on merging your first pull request! 🎉

@MorpheusXAUT MorpheusXAUT deleted the structured-datasets-abfs branch July 25, 2022 21:26
wild-endeavor pushed a commit that referenced this pull request Aug 2, 2022
Signed-off-by: Nick Müller <nmueller@blackshark.ai>
wild-endeavor pushed a commit that referenced this pull request Aug 2, 2022
Signed-off-by: Nick Müller <nmueller@blackshark.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants