Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add s3 adapter #1643

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Add s3 adapter #1643

wants to merge 2 commits into from

Conversation

barbuz
Copy link

@barbuz barbuz commented Feb 16, 2024

Since it was impossible to load a Package stored on s3 this adds an adapter to the AWS Plugin that previously only had a loader.

It would be useful to merge #1632 first, as I was not able to run the tests before applying those changes.

Note that this allows loading a Package, but it is still not possible to load Resources contained in a datapackage file that are stored on s3. i.e. the following test would still fail:

@mock_s3
def test_s3_package_resource(bucket_name):
    # Write
    client = boto3.resource("s3", region_name="us-east-1")
    bucket = client.create_bucket(Bucket=bucket_name, ACL="public-read")  # type: ignore
    bucket.put_object(
        ACL="private",
        Body=open("data/package.json", "rb"),
        Bucket=bucket_name,
        ContentType="text/json",
        Key="prefix/package.json",
    )
    bucket.put_object(
        ACL="private",
        Body=open("data/table.csv", "rb"),
        Bucket=bucket_name,
        ContentType="text/csv",
        Key="prefix/table.csv",
    )

    # Read
    package = Package(f"s3://{bucket_name}/prefix/package.json")

    assert package.get_table_resource("name").read_rows() == [
        {"id": 1, "name": "english"},
        {"id": 2, "name": "中国人"},
    ]

The first issue for that is in the remote loader, that greedily captures anything that helpers.is_remote_path recognises as remote, including s3 URLs. This can easily be patched, but then I found out that for some reason those resources get assigned a scheme of file rather than a scheme of s3, so they are not loaded by the AWS loader; I was not able to find why this happens, it seems to be some dynamically generated code.

@roll you have marked that issue as "good first issue" but I ended up getting quite confused by the Plugin system when trying to solve it. Do you have any pointers that could be useful in finding what is happening to these Resource descriptors?

@roll
Copy link
Member

roll commented Apr 29, 2024

Hi @barbuz, sorry for the confusion and thanks for you help!

Do you think if this one is good to go (I think it might need some documentation as well) or you would like to add support for resources as well in this PR?

The tests have been fixed so it can be tested on CI

@barbuz
Copy link
Author

barbuz commented Apr 30, 2024

Hi! I can try to add some documentation to the AwsAdapter class.
It would be great to add support for resources as well but as I said I would need some pointers because I could not figure out how to make it work.
So if you have any suggestion on that regard please share them, otherwise I think this can be merged in and it will still be better than nothing :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cannot read datapackage from s3
2 participants