Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple support for remote web stores #7325

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
108 commits
Select commit Hold shift + click to select a range
c3db55b
Initial implementation
qqmyers Oct 13, 2020
e8c1578
null check on dateString
qqmyers Oct 13, 2020
00d53ee
adjust incoming identifier for HttpOverlay drivers
qqmyers Oct 14, 2020
94921bd
support overlay case
qqmyers Oct 14, 2020
cbdd35c
document need to update for overlay case
qqmyers Oct 14, 2020
11535bd
keep owner for getStorageIO call for HttpOverlay case
qqmyers Oct 14, 2020
1800575
typos
qqmyers Oct 14, 2020
239d5a8
debug logging
qqmyers Oct 14, 2020
e86c2d0
more logging
qqmyers Oct 14, 2020
0062c68
fix storageidentifier parsing/updating
qqmyers Oct 14, 2020
d6a5f65
more info about errors handled by ThrowableHandler
qqmyers Oct 14, 2020
d821b62
fine debug to show size
qqmyers Oct 14, 2020
1a8f0f1
actually instantiate an HttpClient !
qqmyers Oct 14, 2020
ad86e4c
algorithm fixes and logging
qqmyers Oct 14, 2020
4a9f209
log exception
qqmyers Oct 15, 2020
b339583
support auxPath for direct/overlay case
qqmyers Oct 15, 2020
5131e5e
create dir when needed for aux
qqmyers Oct 15, 2020
afa37ef
S3 flag to distinguish overlap and direct-upload cases
qqmyers Oct 15, 2020
6aaabe2
fix s3 storagelocation
qqmyers Oct 15, 2020
bd37c2e
Revert "fix s3 storagelocation"
qqmyers Oct 15, 2020
14a1196
fine logging
qqmyers Oct 15, 2020
e47eed7
fix storagelocation issues
qqmyers Oct 15, 2020
8497b2b
Merge remote-tracking branch 'IQSS/develop' into IQSS/7324_TRSA-HTTP-…
qqmyers Nov 3, 2020
5c8cb1a
Merge remote-tracking branch 'IQSS/develop' into IQSS/7324_TRSA-HTTP-…
qqmyers Nov 13, 2020
b253ab2
Merge remote-tracking branch 'IQSS/develop' into IQSS/7324_TRSA-HTTP-…
qqmyers Dec 10, 2020
140ffaa
Merge remote-tracking branch 'IQSS/develop' into IQSS/7324_TRSA-HTTP-…
qqmyers Jan 8, 2021
9b14433
Merge remote-tracking branch 'IQSS/develop' into
qqmyers Feb 23, 2021
e72c4e5
Merge remote-tracking branch 'IQSS/develop' into IQSS/7324_TRSA-HTTP-…
qqmyers Apr 7, 2021
0ea4cf9
Merge remote-tracking branch 'IQSS/develop' into IQSS/7324_TRSA-HTTP-…
qqmyers Apr 13, 2021
6fa5e90
Merge remote-tracking branch 'IQSS/develop' into
qqmyers May 20, 2021
257349a
Merge remote-tracking branch 'IQSS/develop' into IQSS/7324_TRSA-HTTP-…
qqmyers Aug 4, 2021
41dedcb
format/cleanup
qqmyers Aug 5, 2021
7881a70
Merge remote-tracking branch 'IQSS/develop' into IQSS/7324_TRSA-HTTP-…
qqmyers Sep 3, 2021
e7ddf86
fix for get dataset logo with overlay store
qqmyers Sep 7, 2021
6b9cdef
update to check store type
qqmyers Sep 7, 2021
60d7d0d
refactor to support support addFiles api from #7901
qqmyers Sep 7, 2021
da133ec
refactor UI code
qqmyers Sep 7, 2021
c719a88
Merge remote-tracking branch 'IQSS/develop' into IQSS/7324_TRSA-HTTP-…
qqmyers Oct 13, 2021
76bfee2
Merge remote-tracking branch 'IQSS/develop' into IQSS/7324_TRSA-HTTP-…
qqmyers Jan 25, 2022
bbc7e32
Merge remote-tracking branch 'IQSS/develop' into IQSS/7324_TRSA-HTTP-…
qqmyers Feb 2, 2022
cc763f8
Merge remote-tracking branch 'IQSS/develop' into
qqmyers Mar 21, 2022
6dded83
Merge remote-tracking branch 'IQSS/develop' into IQSS/7324_TRSA-HTTP-…
qqmyers Mar 30, 2022
ce6bafe
Merge remote-tracking branch 'IQSS/develop' into IQSS/7324_TRSA-HTTP-…
qqmyers Apr 6, 2022
5a823ab
Merge remote-tracking branch 'IQSS/develop' into IQSS/7324_TRSA-HTTP-…
qqmyers Apr 14, 2022
c5246d2
Merge remote-tracking branch 'IQSS/develop' into
qqmyers Apr 29, 2022
7b68d57
Refactor to RemoteOverlay, use constants for store types/sep
qqmyers Apr 29, 2022
bebc275
refactor strings to RemoteOverlay
qqmyers Apr 29, 2022
edc9152
add basic support for remote tag/label in file table
qqmyers Apr 29, 2022
648ee1c
start doc changes
qqmyers Apr 29, 2022
570e97a
documentation, tweak to new branding property names
qqmyers Apr 29, 2022
7e590ad
Merge remote-tracking branch 'IQSS/develop' into IQSS/7324_TRSA-HTTP-…
qqmyers Apr 29, 2022
62b5488
typo
qqmyers Apr 29, 2022
e62a163
fix tabs in preexisting code
qqmyers Apr 29, 2022
3d3aab6
typos
qqmyers May 10, 2022
6bd92d6
Merge remote-tracking branch 'IQSS/develop' into IQSS/7324_TRSA-HTTP-…
qqmyers Jun 8, 2022
9133de7
cut/paste logic error re: remote tag
qqmyers Jun 8, 2022
1080031
Merge remote-tracking branch 'IQSS/develop' into IQSS/7324_TRSA-HTTP-…
qqmyers Jun 15, 2022
e8c3ed3
force lowercase for hash values - that's what is generated internally
qqmyers Jul 5, 2022
1bad2f3
log mismatched checksum values
qqmyers Jul 5, 2022
4441795
Merge remote-tracking branch 'IQSS/develop' into IQSS/7324_TRSA-HTTP-…
qqmyers Jul 25, 2022
37e2581
refactor for download redirect in remoteoverlaystore
qqmyers Jul 8, 2022
a401048
refactor to allow URL token substitution outside tools framework
qqmyers Jun 21, 2022
e23fb30
support passthrough for uploading files
qqmyers Jul 26, 2022
bcad012
Merge remote-tracking branch 'IQSS/develop' into
qqmyers Aug 2, 2022
c3db1ba
doc typo
qqmyers Aug 3, 2022
751a829
Merge remote-tracking branch 'IQSS/develop' into IQSS/7324_TRSA-HTTP-…
qqmyers Aug 3, 2022
846d866
Apply suggestions from code review
qqmyers Aug 4, 2022
8c6b31a
switch to hyphens per review
qqmyers Aug 4, 2022
984254a
reduce variations on trusted remote store
qqmyers Aug 4, 2022
c3bbfec
add signer tests, flip param order so sign/validate match, fix val bug
qqmyers Aug 4, 2022
56f7676
update secret-key, cleanup
qqmyers Aug 5, 2022
1e4a724
Add tests/add support for local file base store tests
qqmyers Aug 5, 2022
5705e67
add an API test for local dev/testing #7324
pdurbin Aug 5, 2022
0902975
sign even for internal access
qqmyers Aug 5, 2022
ab90c16
Merge branch 'IQSS/7324_TRSA-HTTP-store' of https://github.com/Global…
qqmyers Aug 5, 2022
7e9d066
add some validation and test
qqmyers Aug 5, 2022
db4192e
typo in method name
qqmyers Aug 5, 2022
0b424f6
Merge branch 'develop' into IQSS/7324_TRSA-HTTP-store #7324
pdurbin Aug 8, 2022
c4eee7c
add curl example #7324
pdurbin Aug 8, 2022
c688e99
Merge remote-tracking branch 'IQSS/develop' into IQSS/7324_TRSA-HTTP-…
qqmyers Aug 8, 2022
e40ea11
Merge branch 'IQSS/7324_TRSA-HTTP-store' of https://github.com/Global…
qqmyers Aug 8, 2022
0ce597a
Error handling or default on required params
qqmyers Aug 8, 2022
800eca2
sanity check to make sure driver being specified in addFile exists
qqmyers Aug 8, 2022
f730afa
only get value from json once
qqmyers Aug 8, 2022
1583788
update RemoteStoreIT test to show JVM options used #7324
pdurbin Aug 9, 2022
25b4059
add separate downloadRedirectEnabled for aux objects method
qqmyers Aug 9, 2022
504ca17
Merge remote-tracking branch 'IQSS/develop' into IQSS/7324_TRSA-HTTP-…
qqmyers Aug 9, 2022
e6fb485
add logic to check base store download redirect for aux objects
qqmyers Aug 9, 2022
0fd56cf
minor error meg and comment changes
qqmyers Aug 9, 2022
085770a
Merge remote-tracking branch 'IQSS/develop' into IQSS/7324_TRSA-HTTP-…
qqmyers Aug 10, 2022
361018f
remove cruft from tests #7324
pdurbin Aug 10, 2022
cee4f9d
Added a note about limitations of what's in the PR.
qqmyers Aug 10, 2022
c4f6fa5
Merge branch 'IQSS/7324_TRSA-HTTP-store' of https://github.com/Global…
qqmyers Aug 10, 2022
909b9c7
use single file API call /add
qqmyers Aug 16, 2022
5f633e4
copy non-globus parts from #8891 per review request
qqmyers Aug 16, 2022
cb1755d
add missing label
qqmyers Aug 16, 2022
7f990dc
Merge remote-tracking branch 'IQSS/develop' into IQSS/7324_TRSA-HTTP-…
qqmyers Aug 16, 2022
3d9418e
Handle null file size per QA discussion
qqmyers Aug 16, 2022
643b924
add checking w.r.t. dataset storage driver/base driver
qqmyers Aug 17, 2022
7301c62
add remote store in direct access to support sending file delete call
qqmyers Aug 17, 2022
0da52fc
typo
qqmyers Aug 17, 2022
45aa976
fix for delete
qqmyers Aug 17, 2022
94ffcbf
update to docs per QA
qqmyers Aug 17, 2022
708637d
keep remote and base identifiers in getStorageLocation, fix base config
qqmyers Aug 17, 2022
37bba52
add direct link to s3 call
qqmyers Aug 17, 2022
38856ef
fix base store config/related test that missed
qqmyers Aug 17, 2022
e72def0
Add test for bad remote URLs
qqmyers Aug 18, 2022
70a8b3b
note re 404 URLs
qqmyers Aug 18, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions doc/sphinx-guides/source/api/native-api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1411,7 +1411,37 @@ In practice, you only need one the ``dataset_id`` or the ``persistentId``. The e
print '-' * 40
print r.json()
print r.status_code

.. _add-remote-file-api:

Add a Remote File to a Dataset
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If your Dataverse installation has been configured to support :ref:`trusted-remote-storage`
you can add files from remote URLs to datasets. These remote files appear in your Dataverse
installation as if they were ordinary files but are stored remotely.

The location of the remote file is specified in the ``storageIdentifier`` field in JSON you supply.
The base URL of the file is contained in the "store" (e.g. "trsa" in the example below) and is followed by the
path to the file (e.g. "themes/custom...").

In the JSON example below, all fields are required except for ``description``. Other optional fields are shown under :ref:`add-file-api`.

.. code-block:: bash

export API_TOKEN=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
export SERVER_URL=https://demo.dataverse.org
export PERSISTENT_ID=doi:10.5072/FK2/J8SJZB
export JSON_DATA='{"description":"A remote image.","storageIdentifier":"trsa://themes/custom/qdr/images/CoreTrustSeal-logo-transparent.png","checksumType":"MD5","md5Hash":"509ef88afa907eaf2c17c1c8d8fde77e","label":"testlogo.png","fileName":"testlogo.png","mimeType":"image/png"}'

curl -H "X-Dataverse-key: $API_TOKEN" -X POST "$SERVER_URL/api/datasets/:persistentId/add?persistentId=$PERSISTENT_ID" -F "jsonData=$JSON_DATA"

The fully expanded example above (without environment variables) looks like this:

.. code-block:: bash

curl -H X-Dataverse-key: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx -X POST https://demo.dataverse.org/api/datasets/:persistentId/add?persistentId=doi:10.5072/FK2/J8SJZB -F 'jsonData={"description":"A remote image.","storageIdentifier":"trsa://themes/custom/qdr/images/CoreTrustSeal-logo-transparent.png","checksumType":"MD5","md5Hash":"509ef88afa907eaf2c17c1c8d8fde77e","label":"testlogo.png","fileName":"testlogo.png","mimeType":"image/png"}'

Report the data (file) size of a Dataset
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Expand Down
67 changes: 63 additions & 4 deletions doc/sphinx-guides/source/developers/big-data-support.rst
Original file line number Diff line number Diff line change
@@ -1,19 +1,19 @@
Big Data Support
================

Big data support is highly experimental. Eventually this content will move to the Installation Guide.
Big data support includes some highly experimental options. Eventually more of this content will move to the Installation Guide.

.. contents:: |toctitle|
:local:

Various components need to be installed and/or configured for big data support.
Various components will need to be installed and/or configured for big data support via the methods described below.

S3 Direct Upload and Download
-----------------------------

A lightweight option for supporting file sizes beyond a few gigabytes - a size that can cause performance issues when uploaded through a Dataverse installation itself - is to configure an S3 store to provide direct upload and download via 'pre-signed URLs'. When these options are configured, file uploads and downloads are made directly to and from a configured S3 store using secure (https) connections that enforce a Dataverse installation's access controls. (The upload and download URLs are signed with a unique key that only allows access for a short time period and a Dataverse installation will only generate such a URL if the user has permission to upload/download the specific file in question.)

This option can handle files >40GB and could be appropriate for files up to a TB. Other options can scale farther, but this option has the advantages that it is simple to configure and does not require any user training - uploads and downloads are done via the same interface as normal uploads to a Dataverse installation.
This option can handle files >300GB and could be appropriate for files up to a TB or larger. Other options can scale farther, but this option has the advantages that it is simple to configure and does not require any user training - uploads and downloads are done via the same interface as normal uploads to a Dataverse installation.

To configure these options, an administrator must set two JVM options for the Dataverse installation using the same process as for other configuration options:

Expand All @@ -32,7 +32,7 @@ For AWS, the minimum allowed part size is 5*1024*1024 bytes and the maximum is 5

It is also possible to set file upload size limits per store. See the :MaxFileUploadSizeInBytes setting described in the :doc:`/installation/config` guide.

At present, one potential drawback for direct-upload is that files are only partially 'ingested', tabular and FITS files are processed, but zip files are not unzipped, and the file contents are not inspected to evaluate their mimetype. This could be appropriate for large files, or it may be useful to completely turn off ingest processing for performance reasons (ingest processing requires a copy of the file to be retrieved by the Dataverse installation from the S3 store). A store using direct upload can be configured to disable all ingest processing for files above a given size limit:
At present, one potential drawback for direct-upload is that files are only partially 'ingested' - tabular and FITS files are processed, but zip files are not unzipped, and the file contents are not inspected to evaluate their mimetype. This could be appropriate for large files, or it may be useful to completely turn off ingest processing for performance reasons (ingest processing requires a copy of the file to be retrieved by the Dataverse installation from the S3 store). A store using direct upload can be configured to disable all ingest processing for files above a given size limit:

``./asadmin create-jvm-options "-Ddataverse.files.<id>.ingestsizelimit=<size in bytes>"``

Expand Down Expand Up @@ -61,6 +61,65 @@ Alternatively, you can enable CORS using the AWS S3 web interface, using json-en

Since the direct upload mechanism creates the final file rather than an intermediate temporary file, user actions, such as neither saving or canceling an upload session before closing the browser page, can leave an abandoned file in the store. The direct upload mechanism attempts to use S3 Tags to aid in identifying/removing such files. Upon upload, files are given a "dv-state":"temp" tag which is removed when the dataset changes are saved and the new file(s) are added in the Dataverse installation. Note that not all S3 implementations support Tags: Minio does not. WIth such stores, direct upload works, but Tags are not used.

Trusted Remote Storage with the ``remote`` Store Type
-----------------------------------------------------

For very large, and/or very sensitive data, it may not make sense to transfer or copy files to Dataverse at all. The experimental ``remote`` store type in the Dataverse software now supports this use case.

With this storage option Dataverse stores a URL reference for the file rather than transferring the file bytes to a store managed directly by Dataverse. Basic configuration for a remote store is described at :ref:`file-storage` in the Configuration Guide.

Once the store is configured, it can be assigned to a collection or individual datasets as with other stores. In a dataset using this store, users can reference remote files which will then appear the same basic way as other datafiles.

Currently, remote files can only be added via the API. Users can also upload smaller files via the UI or API which will be stored in the configured base store.

If the store has been configured with a remote-store-name or remote-store-url, the dataset file table will include this information for remote files. These provide a visual indicator that the files are not managed directly by Dataverse and are stored/managed by a remote trusted store.

Rather than sending the file bytes, metadata for the remote file is added using the "jsonData" parameter.
jsonData normally includes information such as a file description, tags, provenance, whether the file is restricted, etc. For remote references, the jsonData object must also include values for:

* "storageIdentifier" - String, as specified in prior calls
* "fileName" - String
* "mimeType" - String
* fixity/checksum: either:

* "md5Hash" - String with MD5 hash value, or
* "checksum" - Json Object with "@type" field specifying the algorithm used and "@value" field with the value from that algorithm, both Strings

The allowed checksum algorithms are defined by the edu.harvard.iq.dataverse.DataFile.CheckSumType class and currently include MD5, SHA-1, SHA-256, and SHA-512

(The remote store leverages the same JSON upload syntax as the last step in direct upload to S3 described in the :ref:`Adding the Uploaded file to the Dataset <direct-add-to-dataset-api>` section of the :doc:`/developers/s3-direct-upload-api`.)

.. code-block:: bash

export API_TOKEN=xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
export SERVER_URL=https://demo.dataverse.org
export PERSISTENT_IDENTIFIER=doi:10.5072/FK27U7YBV
export JSON_DATA="{'description':'My description.','directoryLabel':'data/subdir1','categories':['Data'], 'restrict':'false', 'storageIdentifier':'trs://images/dataverse_project_logo.svg', 'fileName':'dataverse_logo.svg', 'mimeType':'image/svg+xml', 'checksum': {'@type': 'SHA-1', '@value': '123456'}}"

curl -X POST -H "X-Dataverse-key: $API_TOKEN" "$SERVER_URL/api/datasets/:persistentId/add?persistentId=$PERSISTENT_IDENTIFIER" -F "jsonData=$JSON_DATA"

The variant allowing multiple files to be added once that is discussed in the :doc:`/developers/s3-direct-upload-api` document can also be used.

Considerations:

* Remote stores are configured with a base-url which limits what files can be referenced, i.e. the absolute URL for the file is <base-url>/<path in storageidentifier>.
* The current store will not prevent you from providing a relative URL that results in a 404 when resolved. (I.e. if you make a typo). You should check to make sure the file exists at the location you specify - by trying to download in Dataverse, by checking to see that Dataverse was able to get the file size (which it does by doing a HEAD call to that location), or just manually trying the URL in your browser.
* Admins are trusting the organization managing the site/service at base-url to maintain the referenced files for as long as the Dataverse instance needs them. Formal agreements are recommended for production
* For large files, direct-download should always be used with a remote store. (Otherwise the Dataverse will be involved in the download.)
* For simple websites, a remote store should be marked public which will turn off restriction and embargo functionality in Dataverse (since Dataverse cannot restrict access to the file on the remote website)
* Remote stores can be configured with a secret-key. This key will be used to sign URLs when Dataverse retrieves the file content or redirects a user for download. If remote service is able to validate the signature and reject invalid requests, the remote store mechanism can be used to manage restricted and embargoes files, access requests in Dataverse, etc. Dataverse contains Java code that validates these signatures which could be used, for example, to create a validation proxy in front of a web server to allow Dataverse to manage access. The secret-key is a shared secret between Dataverse and the remote service and is not shared with/is not accessible by users or those with access to user's machines.
* Sophisticated remote services may wish to register file URLs that do not directly reference the file contents (bytes) but instead direct the user to a website where further information about the remote service's download process can be found.
* Due to the current design, ingest cannot be done on remote files and administrators should disable ingest when using a remote store. This can be done by setting the ingest size limit for the store to 0 and/or using the recently added option to not perform tabular ingest on upload.
* Dataverse will normally try to access the file contents itself, i.e. for ingest (in future versions), full-text indexing, thumbnail creation, etc. This processing may not be desirable for large/sensitive data, and, for the case where the URL does not reference the file itself, would not be possible. At present, administrators should configure the relevant size limits to avoid such actions.
* The current implementation of remote stores is experimental in the sense that future work to enhance it is planned. This work may result in changes to how the store works and lead to additional work when upgrading for sites that start using this mechanism now.

To configure the options mentioned above, an administrator must set two JVM options for the Dataverse installation using the same process as for other configuration options:

``./asadmin create-jvm-options "-Ddataverse.files.<id>.download-redirect=true"``
``./asadmin create-jvm-options "-Ddataverse.files.<id>.secret-key=somelongrandomalphanumerickeythelongerthebetter123456"``
``./asadmin create-jvm-options "-Ddataverse.files.<id>.public=true"``
``./asadmin create-jvm-options "-Ddataverse.files.<id>.ingestsizelimit=<size in bytes>"``

Data Capture Module (DCM)
-------------------------

Expand Down
4 changes: 3 additions & 1 deletion doc/sphinx-guides/source/developers/s3-direct-upload-api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,8 @@ If the client is unable to complete the multipart upload, it should call the abo
curl -X DELETE "$SERVER_URL/api/datasets/mpload?..."


.. _direct-add-to-dataset-api:

Adding the Uploaded file to the Dataset
---------------------------------------

Expand Down Expand Up @@ -117,7 +119,7 @@ Note that this API call can be used independently of the others, e.g. supporting
With current S3 stores the object identifier must be in the correct bucket for the store, include the PID authority/identifier of the parent dataset, and be guaranteed unique, and the supplied storage identifer must be prefaced with the store identifier used in the Dataverse installation, as with the internally generated examples above.

To add multiple Uploaded Files to the Dataset
-------------------------------------------------
---------------------------------------------

Once the files exists in the s3 bucket, a final API call is needed to add all the files to the Dataset. In this API call, additional metadata is added using the "jsonData" parameter.
jsonData normally includes information such as a file description, tags, provenance, whether the file is restricted, etc. For direct uploads, the jsonData object must also include values for:
Expand Down
Loading