-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TDL/7493 Make BagGenerator threads configurable #8606
TDL/7493 Make BagGenerator threads configurable #8606
Conversation
Co-authored-by: Philip Durbin <philipdurbin@gmail.com>
Co-authored-by: Philip Durbin <philipdurbin@gmail.com>
Co-authored-by: Philip Durbin <philipdurbin@gmail.com>
…om/TexasDigitalLibrary/dataverse.git into IQSS/7493-batch_archiving_API_call
checks to see if space exists
IQSS/8422-default_to_allow_dataset_level_choice_of_metadatalanguage
TDL/7493-batch_archiving_API_call Conflicts: modules/dataverse-parent/pom.xml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks good. It seems like all the ParallelScatterZipCreator
code is already in place and what this pull request is doing is reducing the default number of threads from 8 to 2 and making it configurable.
I didn't actually run the code but I navigated around it a bit.
I found a few small things we should probably fix up so I'm hitting "request changes" for now. Details are in the review.
@@ -180,7 +180,7 @@ archiver | |||
|
|||
A step that sends an archival copy of a Dataset Version to a configured archiver, e.g. the DuraCloud interface of Chronopolis. See the `DuraCloud/Chronopolis Integration documentation <http://guides.dataverse.org/en/latest/admin/integrations.html#id15>`_ for further detail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be nice to fix this "#id15" link while were in here. (It links to the table of contents rather than the actual content below the toc.) We could add a custom anchor and use :ref:
to link to it.
@@ -115,6 +115,7 @@ public class BagGenerator { | |||
private boolean usetemp = false; | |||
|
|||
private int numConnections = 8; | |||
public static final String BAG_GENERATOR_THREADS = ":BagGeneratorThreads"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having a String here, can we keep all the database settings centralized in the Key
enum at SettingsServiceBean?
I tried this locally and the following seems to work:
public static final String BAG_GENERATOR_THREADS = SettingsServiceBean.Key.BagGeneratorThreads.toString();
(I had to import SettingsServiceBean.)
I see that DuraCloudSubmitToArchiveCommand also has strings like :DuraCloudPort, :DuraCloudHost, and :DuraCloudContext hard-coded, but we could centralize those later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've purposely avoided that so far with the hope that at some point these archivers will be packaged separately. Right now they are loaded via reflection (you specify the classname in a property) and there should be no references to the specific Archiver classes or their properties in the main codebase. (As you note though we do have their properties in the master list in the guides so far.)
( I hope it isn't too much work for these classes and their dependencies to be pulled out into separate jars but it isn't something I know how to do off-hand - perhaps @poikilotherm?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. I guess I was under the impression that since these archiver classes/commands are in the main code base that they have to be.
It would be nice to figure out how to extract them to a different git repo. It would be a great success story of modularity in Dataverse.
Once they are truly extracted from the main code base, I suppose their documentation should be extracted too, just like we do with external tools.
It sounds like we're not there yet. That's fine. Thanks for the cleanup you did.
|
||
A Dataverse installation can be configured to submit a copy of published Datasets, packaged as `Research Data Alliance conformant <https://www.rd-alliance.org/system/files/Research%20Data%20Repository%20Interoperability%20WG%20-%20Final%20Recommendations_reviewed_0.pdf>`_ zipped `BagIt <https://tools.ietf.org/html/draft-kunze-bagit-17>`_ bags to the `Chronopolis <https://libraries.ucsd.edu/chronopolis/>`_ via `DuraCloud <https://duraspace.org/duracloud/>`_ | ||
A Dataverse installation can be configured to submit a copy of published Datasets, packaged as `Research Data Alliance conformant <https://www.rd-alliance.org/system/files/Research%20Data%20Repository%20Interoperability%20WG%20-%20Final%20Recommendations_reviewed_0.pdf>`_ zipped `BagIt <https://tools.ietf.org/html/draft-kunze-bagit-17>`_ bags to the `Chronopolis <https://libraries.ucsd.edu/chronopolis/>`_ via `DuraCloud <https://duraspace.org/duracloud/>`_, to a local file system, or to `Google Cloud Storage<https://cloud.google.com/storage>`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is failing with:
/github/workspace/doc/sphinx-guides/source/admin/integrations.rst:165:Unknown target name: "google cloud storagehttps://cloud.google.com/storage".
@@ -115,6 +115,7 @@ public class BagGenerator { | |||
private boolean usetemp = false; | |||
|
|||
private int numConnections = 8; | |||
public static final String BAG_GENERATOR_THREADS = ":BagGeneratorThreads"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. I guess I was under the impression that since these archiver classes/commands are in the main code base that they have to be.
It would be nice to figure out how to extract them to a different git repo. It would be a great success story of modularity in Dataverse.
Once they are truly extracted from the main code base, I suppose their documentation should be extracted too, just like we do with external tools.
It sounds like we're not there yet. That's fine. Thanks for the cleanup you did.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we fix make html
( TexasDigitalLibrary#2 ) I'm ready to move this to QA. (I did also throw a wording suggestion in this review.)
I haven't tested this at all but the changes make sense and are pretty minor. Again, we're reducing the default number of threads from 8 to 2 and making it configurable.
Co-authored-by: Philip Durbin <philipdurbin@gmail.com>
…ttps://github.com/TexasDigitalLibrary/dataverse.git into TDL/7493-make_BagGenerate_threads_configurable_only
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Again, I haven't tested it but the changes make sense.
What this PR does / why we need it: Improves scaling of Archiving and allows it to use more or fewer threads depending on machine size
Which issue(s) this PR closes:
Closes #8602
Special notes for your reviewer:
Suggestions on how to test this: Configure archiving (e.g. with LocalSubmitToArchiveCommand), change the :BagGeneratorThreads setting and compare the load and/or watch the log to see how many files are being zipped in parallel.
Does this PR introduce a user interface change? If mockups are available, please link/include them here: No
Is there a release notes update needed for this change?: multiple Bag/Archiving changes coming - perhaps one overall release note?
Additional documentation: guides updated