Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GDCC/8749 S3 Archiver #8751

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 35 additions & 2 deletions doc/sphinx-guides/source/installation/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1081,7 +1081,9 @@ These archival Bags include all of the files and metadata in a given dataset ver

The Dataverse Software offers an internal archive workflow which may be configured as a PostPublication workflow via an admin API call to manually submit previously published Datasets and prior versions to a configured archive such as Chronopolis. The workflow creates a `JSON-LD <http://www.openarchives.org/ore/0.9/jsonld>`_ serialized `OAI-ORE <https://www.openarchives.org/ore/>`_ map file, which is also available as a metadata export format in the Dataverse Software web interface.

At present, the DPNSubmitToArchiveCommand, LocalSubmitToArchiveCommand, and GoogleCloudSubmitToArchive are the only implementations extending the AbstractSubmitToArchiveCommand and using the configurable mechanisms discussed below.
At present, archiving classes include the DuraCloudSubmitToArchiveCommand, LocalSubmitToArchiveCommand, GoogleCloudSubmitToArchive, and S3SubmitToArchiveCommand , which all extend the AbstractSubmitToArchiveCommand and use the configurable mechanisms discussed below.

All current options support the archival status APIs and the same status is available in the dataset page version table (for contributors/those who could view the unpublished dataset, with more detail available to superusers).

.. _Duracloud Configuration:

Expand Down Expand Up @@ -1144,7 +1146,7 @@ ArchiverClassName - the fully qualified class to be used for archiving. For exam
Google Cloud Configuration
++++++++++++++++++++++++++

The Google Cloud Archiver can send archival Bags to a bucket in Google's cloud, including those in the 'Coldline' storage class (cheaper, with slower access)
The Google Cloud Archiver can send Dataverse Archival Bags to a bucket in Google's cloud, including those in the 'Coldline' storage class (cheaper, with slower access)

``curl http://localhost:8080/api/admin/settings/:ArchiverClassName -X PUT -d "edu.harvard.iq.dataverse.engine.command.impl.GoogleCloudSubmitToArchiveCommand"``

Expand All @@ -1168,6 +1170,31 @@ For example:

``cp <your key file> /usr/local/payara5/glassfish/domains/domain1/files/googlecloudkey.json``

.. _S3 Archiver Configuration:

S3 Configuration
++++++++++++++++

The S3 Archiver can send Dataverse Archival Bag to a bucket at any S3 endpoint. The configuration for the S3 Archiver is independent of any S3 store that may be configured in Dataverse and may, for example, leverage colder (cheaper, slower access) storage.

``curl http://localhost:8080/api/admin/settings/:ArchiverClassName -X PUT -d "edu.harvard.iq.dataverse.engine.command.impl.S3SubmitToArchiveCommand"``

``curl http://localhost:8080/api/admin/settings/:ArchiverSettings -X PUT -d ":S3ArchiverConfig, :BagGeneratorThreads"``

The S3 Archiver defines one custom setting, a required :S3ArchiverConfig. It can also use the :BagGeneratorThreads setting as described in the DuraCloud Configuration section above.

The credentials for your S3 account, can be stored in a profile in a standard credentials file (e.g. ~/.aws/credentials) referenced via "profile" key in the :S3ArchiverConfig setting (will default to the default entry), or can via MicroProfile settings as described for S3 stores (dataverse.s3archiver.access-key and dataverse.s3archiver.secret-key)

The :S3ArchiverConfig setting is a json object that must include an "s3_bucket_name" and may include additional S3-related parameters as described for S3 Stores, including "profile", "connection-pool-size","custom-endpoint-url", "custom-endpoint-region", "path-style-access", "payload-signing", and "chunked-encoding".

\:S3ArchiverConfig - minimally includes the name of the bucket to use. For example:

``curl http://localhost:8080/api/admin/settings/:S3ArchiverConfig -X PUT -d '{"s3_bucket_name":"archival-bucket"}'``

\:S3ArchiverConfig - example to also set the name of an S3 profile to use. For example:

``curl http://localhost:8080/api/admin/settings/:S3ArchiverConfig -X PUT -d '{"s3_bucket_name":"archival-bucket", "profile":"archiver"}'``

.. _Archiving API Call:

API Calls
Expand Down Expand Up @@ -2665,6 +2692,12 @@ This is the local file system path to be used with the LocalSubmitToArchiveComma

These are the bucket and project names to be used with the GoogleCloudSubmitToArchiveCommand class. Further information is in the :ref:`Google Cloud Configuration` section above.

:S3ArchiverConfig
+++++++++++++++++

This is the JSON configuration object setting to be used with the S3SubmitToArchiveCommand class. Further information is in the :ref:`S3 Archiver Configuration` section above.


.. _:InstallationName:

:InstallationName
Expand Down
11 changes: 3 additions & 8 deletions src/main/java/edu/harvard/iq/dataverse/DatasetPage.java
Original file line number Diff line number Diff line change
Expand Up @@ -5599,7 +5599,7 @@ public boolean isArchivable() {
archivable = ((Boolean) m.invoke(null, params) == true);
} catch (ClassNotFoundException | IllegalAccessException | IllegalArgumentException
| InvocationTargetException | NoSuchMethodException | SecurityException e) {
logger.warning("Failed to call is Archivable on configured archiver class: " + className);
logger.warning("Failed to call isArchivable on configured archiver class: " + className);
e.printStackTrace();
}
}
Expand Down Expand Up @@ -5635,7 +5635,7 @@ public boolean isVersionArchivable() {
}
} catch (ClassNotFoundException | IllegalAccessException | IllegalArgumentException
| InvocationTargetException | NoSuchMethodException | SecurityException e) {
logger.warning("Failed to call is Archivable on configured archiver class: " + className);
logger.warning("Failed to call isSingleVersion on configured archiver class: " + className);
e.printStackTrace();
}
}
Expand All @@ -5646,12 +5646,7 @@ public boolean isVersionArchivable() {

public boolean isSomeVersionArchived() {
if (someVersionArchived == null) {
someVersionArchived = false;
for (DatasetVersion dv : dataset.getVersions()) {
if (dv.getArchivalCopyLocation() != null) {
someVersionArchived = true;
}
}
someVersionArchived = ArchiverUtil.isSomeVersionArchived(dataset);
}
return someVersionArchived;
}
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
package edu.harvard.iq.dataverse.engine.command.impl;

import edu.harvard.iq.dataverse.DOIDataCiteRegisterService;
import edu.harvard.iq.dataverse.DataCitation;
import edu.harvard.iq.dataverse.Dataset;
import edu.harvard.iq.dataverse.DatasetVersion;
import edu.harvard.iq.dataverse.DvObject;
Expand Down Expand Up @@ -94,6 +96,13 @@ public String describe() {
return super.describe() + "DatasetVersion: [" + version.getId() + " (v"
+ version.getFriendlyVersionNumber()+")]";
}

String getDataCiteXml(DatasetVersion dv) {
DataCitation dc = new DataCitation(dv);
Map<String, String> metadata = dc.getDataCiteMetadata();
return DOIDataCiteRegisterService.getMetadataFromDvObject(dv.getDataset().getGlobalId().asString(), metadata,
dv.getDataset());
}

public Thread startBagThread(DatasetVersion dv, PipedInputStream in, DigestInputStream digestInputStream2,
String dataciteXml, ApiToken token) throws IOException, InterruptedException {
Expand Down Expand Up @@ -160,7 +169,7 @@ public void run() {
}
return bagThread;
}

public static boolean isArchivable(Dataset dataset, SettingsWrapper settingsWrapper) {
return true;
}
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
package edu.harvard.iq.dataverse.engine.command.impl;

import edu.harvard.iq.dataverse.DOIDataCiteRegisterService;
import edu.harvard.iq.dataverse.DataCitation;
import edu.harvard.iq.dataverse.Dataset;
import edu.harvard.iq.dataverse.DatasetVersion;
import edu.harvard.iq.dataverse.DatasetLock.Reason;
Expand Down Expand Up @@ -108,10 +106,7 @@ public WorkflowStepResult performArchiveSubmission(DatasetVersion dv, ApiToken t
if (!store.spaceExists(spaceName)) {
store.createSpace(spaceName);
}
DataCitation dc = new DataCitation(dv);
Map<String, String> metadata = dc.getDataCiteMetadata();
String dataciteXml = DOIDataCiteRegisterService.getMetadataFromDvObject(
dv.getDataset().getGlobalId().asString(), metadata, dv.getDataset());
String dataciteXml = getDataCiteXml(dv);

MessageDigest messageDigest = MessageDigest.getInstance("MD5");
try (PipedInputStream dataciteIn = new PipedInputStream();
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
package edu.harvard.iq.dataverse.engine.command.impl;

import edu.harvard.iq.dataverse.DOIDataCiteRegisterService;
import edu.harvard.iq.dataverse.DataCitation;
import edu.harvard.iq.dataverse.Dataset;
import edu.harvard.iq.dataverse.DatasetVersion;
import edu.harvard.iq.dataverse.DatasetLock.Reason;
Expand Down Expand Up @@ -73,10 +71,7 @@ public WorkflowStepResult performArchiveSubmission(DatasetVersion dv, ApiToken t
String spaceName = dataset.getGlobalId().asString().replace(':', '-').replace('/', '-')
.replace('.', '-').toLowerCase();

DataCitation dc = new DataCitation(dv);
Map<String, String> metadata = dc.getDataCiteMetadata();
String dataciteXml = DOIDataCiteRegisterService.getMetadataFromDvObject(
dv.getDataset().getGlobalId().asString(), metadata, dv.getDataset());
String dataciteXml = getDataCiteXml(dv);
MessageDigest messageDigest = MessageDigest.getInstance("MD5");
try (PipedInputStream dataciteIn = new PipedInputStream();
DigestInputStream digestInputStream = new DigestInputStream(dataciteIn, messageDigest)) {
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
package edu.harvard.iq.dataverse.engine.command.impl;

import edu.harvard.iq.dataverse.DOIDataCiteRegisterService;
import edu.harvard.iq.dataverse.DataCitation;
import edu.harvard.iq.dataverse.Dataset;
import edu.harvard.iq.dataverse.DatasetVersion;
import edu.harvard.iq.dataverse.DatasetLock.Reason;
Expand Down Expand Up @@ -58,18 +56,16 @@ public WorkflowStepResult performArchiveSubmission(DatasetVersion dv, ApiToken t
String spaceName = dataset.getGlobalId().asString().replace(':', '-').replace('/', '-')
.replace('.', '-').toLowerCase();

DataCitation dc = new DataCitation(dv);
Map<String, String> metadata = dc.getDataCiteMetadata();
String dataciteXml = DOIDataCiteRegisterService
.getMetadataFromDvObject(dv.getDataset().getGlobalId().asString(), metadata, dv.getDataset());

String dataciteXml = getDataCiteXml(dv);

FileUtils.writeStringToFile(
new File(localPath + "/" + spaceName + "-datacite.v" + dv.getFriendlyVersionNumber() + ".xml"),
dataciteXml, StandardCharsets.UTF_8);
BagGenerator bagger = new BagGenerator(new OREMap(dv, false), dataciteXml);
bagger.setNumConnections(getNumberOfBagGeneratorThreads());
bagger.setAuthenticationKey(token.getTokenString());
zipName = localPath + "/" + spaceName + "v" + dv.getFriendlyVersionNumber() + ".zip";
//ToDo: generateBag(File f, true) seems to do the same thing (with a .tmp extension) - since we don't have to use a stream here, could probably just reuse the existing code?
bagger.generateBag(new FileOutputStream(zipName + ".partial"));

File srcFile = new File(zipName + ".partial");
Expand Down
Loading