Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IO Implementation using Go CDK #176

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Conversation

loicalleyne
Copy link

Extends PR #111

Implements #92. The Go CDK has well-maintained implementations for accessing objects stores from S3, Azure, and GCS via a io/fs.Fs-like interface. However, their file interface doesn't support the io.ReaderAt interface or the Seek() function that Iceberg-Go requires for files. Furthermore, the File components are private. So we copied the wrappers and implement the remaining functions inside of Iceberg-Go directly.

In addition, we add support for S3 Read IO using the CDK, providing the option to choose between the existing and new implementation using an extra property.

GCS connection options can be passed in properties map.

Signed-off-by: Loïc Alleyne <loicalleyne@gmail.com>
Signed-off-by: Loïc Alleyne <loicalleyne@gmail.com>
@loicalleyne
Copy link
Author

@dwilson1988 I saw your note about wanting to work on the CDK features, if you're able to provide some feedback that would be great.

@dwilson1988
Copy link
Contributor

@loicalleyne - happy to take a look. We use this internally in some of our software with Parquet and implemented a ReaderAt. I'll do a more thorough review when I get a chance, but my first thought was to leave it completely separate from the blob.Bucket implementation and let the Create/New funcs simple accept a *blob.Bucket and leave the rest as an exercise to the user. This keeps it more or less completely isolated from the implementation. Thoughts on this direction?

@loicalleyne
Copy link
Author

My goal today was just to "get something on paper" to move this forward since the other PR has been stalled since July, I used the other PR as a starting point so I mostly followed the existing patterns. Very open to moving things around if it makes sense. Do you have any idea how your idea would work with the interfaces defined in io.go?

@dwilson1988
Copy link
Contributor

Understood! I'll dig into your last question and get back to you.

@dwilson1988
Copy link
Contributor

dwilson1988 commented Oct 16, 2024

Okay, played around a bit and here's where my head is at.

The main reason I'd like to isolate the creation of a *blob.Bucket is I've found that the particular implementation of bucket access can get tricky and rather than support it in this package for all situations, support the most common usage in io.LoadFS/inferFileIOFromSchema and change io.CreateBlobFileIO to accept a *url.URL and a *blob.Bucket. This enables a user to open a bucket with whatever implementation they so choose (GCS, Azure, S3, MinIO, Mem, FileSystem, etc) and there's less code here to maintain.

What I came up with is changing CreateBlobFileIO to:

// CreateBlobFileIO creates a new BlobFileIO instance
func CreateBlobFileIO(parsed *url.URL, bucket *blob.Bucket) *BlobFileIO {
	ctx := context.Background()
	return &BlobFileIO{Bucket: bucket, ctx: ctx, opts: &blob.ReaderOptions{}, prefix: parsed.Host + parsed.Path}
}

The URL is still critical there, but now we don't have to concern ourselves with credentials to open the bucket except for in LoadFS.

Thoughts on this?

Signed-off-by: Loïc Alleyne <loicalleyne@gmail.com>
@loicalleyne
Copy link
Author

@dwilson1988
Sounds good, I've made the changes, please take a look.

Signed-off-by: Loïc Alleyne <loicalleyne@gmail.com>
Copy link
Contributor

@dwilson1988 dwilson1988 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@loicalleyne, This looks really good to me! I'm not a maintainer of this repo, so I can't give the final word or anything, but this is exactly the direction I was thinking.

I'm happy to give azure a go after this is merged.

@zeroshade

io/blob.go Outdated
Comment on lines 177 to 183
// BlobFileIO represents a file system backed by a bucket in object store. It implements the `iceberg-go/io.FileIO` interface.
type BlobFileIO struct {
*blob.Bucket
ctx context.Context
opts *blob.ReaderOptions
prefix string
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to be more conservative in what we actually export. Is there any need to export this type as opposed to just let it be used through the interfaces?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to unexported

io/blob.go Outdated
Comment on lines 201 to 203
// Open a Blob from a Bucket using the BlobFileIO. Note this
// function is copied from blob.Bucket.Open, but extended to
// return a iceberg-go/io.File instance instead of io/fs.File
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we just wrap and extend the blob.Bucket.Open instead of duplicating it here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at it again it has to be a copy because the CDK iofsFileInfo is unexported and doesn't support the io.ReaderAt interface .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be accomplished if gocloud changed this to a io.ReadSeeker which would slightly alter the public interface. Otherwise, I can't see a way to do this without copying:

https://github.com/google/go-cloud/blob/0d91a59252ba29091dad4a127be6b654797ff026/blob/blob_fs.go#L54

io/gcs.go Outdated Show resolved Hide resolved
io/gcs.go Outdated Show resolved Hide resolved
io/io.go Outdated Show resolved Hide resolved
io/local.go Outdated Show resolved Hide resolved
io/s3.go Outdated Show resolved Hide resolved
@dwilson1988
Copy link
Contributor

@loicalleyne is this still on your radar?

@loicalleyne
Copy link
Author

hi @dwilson1988
yes, I'm wrapping up some work on another project and will be jumping back on this in a day or two.

@dwilson1988
Copy link
Contributor

Cool - just checking. I'll be patient. 🙂

@loicalleyne
Copy link
Author

@dwilson1988 made the suggested changes, there's a deprecation warning on the S3 config EndpointResolver methods that I haven't had time to look into, maybe you could take a look?

@dwilson1988
Copy link
Contributor

@dwilson1988 made the suggested changes, there's a deprecation warning on the S3 config EndpointResolver methods that I haven't had time to look into, maybe you could take a look?

Yes, can probably take a look next week

@loicalleyne
Copy link
Author

Hi @dwilson1988, do you think you'll have time to take a look at this?

@dwilson1988
Copy link
Contributor

Hi @dwilson1988, do you think you'll have time to take a look at this?

I opened a PR on your branch earlier today

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants