Skip to content

Commit

Permalink
Extra documentation entries for the validate files across dataset adm…
Browse files Browse the repository at this point in the history
…in API (#6558)
  • Loading branch information
landreev committed Apr 7, 2020
1 parent c3fbad2 commit 9ecb82c
Show file tree
Hide file tree
Showing 3 changed files with 21 additions and 3 deletions.
2 changes: 1 addition & 1 deletion doc/sphinx-guides/source/admin/troubleshooting.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ The following are two real life examples of problems that have resulted in corru
1. Botched file deletes - while a datafile is in DRAFT, attempting to delete it from the dataset involves deleting both the ``DataFile`` database table entry, and the physical file. (Deleting a datafile from a *published* version merely removes it from the future versions - but keeps the file in the dataset). The problem we've observed in the early versions of Dataverse was a *partially successful* delete, where the database tansaction would fail (for whatever reason), but only after the physical file had already been deleted from the filesystem. Thus resulting in a datafile entry remaining in the dataset, but with the corresponding physical file missing. We believe we have addressed the issue that was making this condition possible, so it shouldn't happen again - but there may be a datafile in this state in your database. Assuming the user's intent was in fact to delete the file, the easiest solution is simply to confirm it and purge the datafile entity from the database. Otherwise the file needs to be restored from backups, or obtained from the user and copied back into storage.
2. Another issue we've observed: a failed tabular data ingest that leaves the datafile un-ingested, BUT with the physical file already replaced by the generated tab-delimited version of the data. This datafile will fail the validation because the checksum in the database matches the file in the original format (Stata, SPSS, etc.) as uploaded by the user. To fix: luckily, this is easily reversable, since the uploaded original should be saved in your storage, with the .orig extension. Simply swapping the .orig copy with the main file associated with the datafile will fix it. Similarly, we believe this condition should not happen again in Dataverse versions 4.20+, but you may have some legacy cases on your server.

The validation API will stop after encountering the first file that does not pass the validation. You can consult the server log file for the error messages indicating which file has failed. But you will likely want to review and verify all the files in the dataset before you unlock it.
The validation API will stop after encountering the first file that does not pass the validation. Of course you will want to review and verify all the files in the dataset before you unlock it. We recommend using the ``/api/validate/dataset/files/{id}`` API. It will go through all the files for the dataset specified, and will report which ones have failed validation. see :ref:`Physical Files Validation in a Dataset <dataset-files-validation-api>` in the :doc:`/api/native-api` section of the User Guide.

Someone Created Spam Datasets and I Need to Delete Them
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down
21 changes: 19 additions & 2 deletions doc/sphinx-guides/source/api/native-api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2887,6 +2887,25 @@ Recalculate the check sum value value of a datafile, by supplying the file's dat
Validate an existing check sum value against one newly calculated from the saved file::
curl -H X-Dataverse-key:$API_TOKEN -X POST $SERVER_URL/api/admin/validateDataFileHashValue/{fileId}
.. _dataset-files-validation-api:
Physical Files Validation in a Dataset
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The following validates all the physical files in the dataset spcified, by recalculating the checksums and comparing them against the values saved in the database::
$SERVER_URL/api/admin/validate/dataset/files/{datasetId}
It will report the specific files that have failed the validation. For example::
curl http://localhost:8080/api/admin/validate/dataset/files/:persistentId/?persistentId=doi:10.5072/FK2/XXXXX
{"dataFiles": [
{"datafileId":2658,"storageIdentifier":"file://123-aaa","status":"valid"},
{"datafileId":2659,"storageIdentifier":"file://123-bbb","status":"invalid","errorMessage":"Checksum mismatch for datafile id 2669"},
{"datafileId":2659,"storageIdentifier":"file://123-ccc","status":"valid"}
]
}
These are only available to super users.
Expand Down Expand Up @@ -2928,8 +2947,6 @@ Note that if you are attempting to validate a very large number of datasets in y
asadmin set server-config.network-config.protocols.protocol.http-listener-1.http.request-timeout-seconds=3600
Workflows
~~~~~~~~~
Expand Down
1 change: 1 addition & 0 deletions src/main/java/edu/harvard/iq/dataverse/api/Admin.java
Original file line number Diff line number Diff line change
Expand Up @@ -1069,6 +1069,7 @@ public void write(OutputStream os) throws IOException,

JsonObjectBuilder output = Json.createObjectBuilder();
output.add("datafileId", dataFile.getId());
output.add("storageIdentifier", dataFile.getStorageIdentifier());


try {
Expand Down

0 comments on commit 9ecb82c

Please sign in to comment.