Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot read datapackage from s3 #1596

Open
barbuz opened this issue Oct 4, 2023 · 2 comments · May be fixed by #1643
Open

Cannot read datapackage from s3 #1596

barbuz opened this issue Oct 4, 2023 · 2 comments · May be fixed by #1643
Labels
bug Something isn't working good first issue Good for newcomers

Comments

@barbuz
Copy link

barbuz commented Oct 4, 2023

Overview

I want to use Frictionless datapackages to provide metadata about some collections hosted on s3, but I'm encountering issues when trying to read these files.
I can load the data fine as a Resource, and I can even validate it against a local tableschema, but if I try loading the datapackage I get the following error:

>>> pak = frictionless.Package('s3://rimrep-data-public-development/csiro-seltmp-baseline-surveys-jul22/datapackage.json')
Traceback (most recent call last):
  File "/home/leo/miniconda3/lib/python3.10/site-packages/frictionless/metadata.py", line 306, in metadata_retrieve
    response = session.get(descriptor, stream=True)
  File "/home/leo/miniconda3/lib/python3.10/site-packages/requests/sessions.py", line 600, in get
    return self.request("GET", url, **kwargs)
  File "/home/leo/miniconda3/lib/python3.10/site-packages/requests/sessions.py", line 587, in request
    resp = self.send(prep, **send_kwargs)
  File "/home/leo/miniconda3/lib/python3.10/site-packages/requests/sessions.py", line 695, in send
    adapter = self.get_adapter(url=request.url)
  File "/home/leo/miniconda3/lib/python3.10/site-packages/requests/sessions.py", line 792, in get_adapter
    raise InvalidSchema(f"No connection adapters were found for {url!r}")
requests.exceptions.InvalidSchema: No connection adapters were found for 's3://rimrep-data-public-development/csiro-seltmp-baseline-surveys-jul22/datapackage.json'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/leo/miniconda3/lib/python3.10/site-packages/frictionless/package/factory.py", line 38, in __call__
    cls.from_descriptor(source, basepath=basepath, **options),  # type: ignore
  File "/home/leo/miniconda3/lib/python3.10/site-packages/frictionless/metadata.py", line 162, in from_descriptor
    descriptor = cls.metadata_retrieve(descriptor)
  File "/home/leo/miniconda3/lib/python3.10/site-packages/frictionless/metadata.py", line 324, in metadata_retrieve
    raise FrictionlessException(Error(note=note)) from exception
frictionless.exception.FrictionlessException: [package-error] The data package has an error: cannot retrieve metadata "s3://rimrep-data-public-development/csiro-seltmp-baseline-surveys-jul22/datapackage.json" because "No connection adapters were found for 's3://rimrep-data-public-development/csiro-seltmp-baseline-surveys-jul22/datapackage.json'"

I have also tried opening a local copy of the datapackage with its resource path pointing to s3://rimrep-data-public-development/csiro-seltmp-baseline-surveys-jul22/data.parquet/part.0.parquet, but then the validation fails with:

>>> pak.validate()
{'valid': False,
 'stats': {'tasks': 1, 'errors': 1, 'warnings': 0, 'seconds': 0.057},
 'warnings': [],
 'errors': [],
 'tasks': [{'name': 'data',
            'type': 'table',
            'valid': False,
            'place': 's3://rimrep-data-public-development/csiro-seltmp-baseline-surveys-jul22/data.parquet/part.0.parquet',
            'labels': [],
            'stats': {'errors': 1, 'warnings': 0, 'seconds': 0.026},
            'warnings': [],
            'errors': [{'type': 'source-error',
                        'title': 'Source Error',
                        'description': 'Data reading error because of not '
                                       'supported or inconsistent contents.',
                        'message': 'The data source has not supported or has '
                                   'inconsistent contents: '
                                   's3://rimrep-data-public-development/csiro-seltmp-baseline-surveys-jul22/data.parquet/part.0.parquet',
                        'tags': [],
                        'note': 's3://rimrep-data-public-development/csiro-seltmp-baseline-surveys-jul22/data.parquet/part.0.parquet'}]}]}

Finally, I've done some experiments with the CLI but encountered the same errors there too. In particular, trying to validate the remote data against a local tableschema.json file worked, but if the tableschema was also hosted on s3 I got the error "No connection adapters were found for 's3://rimrep-data-public-development/csiro-seltmp-baseline-surveys-jul22/tableschema.json'"

All the files used here should be public, so you can try replicating the issue. Please let me know if I'm doing something wrong or if this is an actual bug.

@PeterBaker0
Copy link

We have a similar use case.

I replicated this issue and tried various combinations and couldn't get it to resolve correctly.

Does the AWS plugin expose all of the necessary parts to validate a whole data package, or is it only at the Resource level such as in the guide here? https://framework.frictionlessdata.io/docs/schemes/aws.html

@roll roll added bug Something isn't working good first issue Good for newcomers labels Nov 22, 2023
@roll
Copy link
Member

roll commented Nov 22, 2023

Thanks for reporting!

@barbuz barbuz linked a pull request Feb 16, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issue Good for newcomers
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants