-
Notifications
You must be signed in to change notification settings - Fork 492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New API option to get the download size of the dataset version files without the original tabular files size #9960
New API option to get the download size of the dataset version files without the original tabular files size #9960
Conversation
…oadSize datasets API endpoint
This comment has been minimized.
This comment has been minimized.
…inal tab file sizes
if (dataset == null) { | ||
// should never happen - must indicate some data corruption in the database | ||
throw new CommandException(BundleUtil.getStringFromBundle("datasets.api.listing.error"), this); | ||
} | ||
|
||
logger.fine("getDataverseStorageSize called on " + dataset.getDisplayName()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed this order to avoid NPE when logging.
…s for /downloadsize API endpoint
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
2 similar comments
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
@Context HttpHeaders headers) { | ||
return response(req -> { | ||
DatasetVersion datasetVersion = getDatasetVersionOrDie(req, version, findDatasetOrDie(dvIdtf), uriInfo, headers); | ||
Long datasetStorageSize = ignoreOriginalTabularSize ? DatasetUtil.getDownloadSizeNumeric(datasetVersion, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. Backward compatibility is important... although I honestly have no idea how popular this api is (nobody seems to be using it on our instance here), and if it was originally implemented that way (to count both types) because somebody specifically requested it/had a reason to count both, or if it was just implemented like that unintentionally.
I'm ok with leaving its default behavior backward compatible...
However, I'm assuming that the new UI will work the same way as the current page - with the user having a choice between downloading either of the 2 types, but never both at the same time, correct? In this case we need this api to also be able to only count the original sizes. So, maybe instead of the ignoreOriginalTabularSize
boolean, something like mode=(all|original|archival)
, with all
being the default?
Aside from my earlier comment, my bigger concern - at least in the longer term is the performance. I feel like the best solution would be to use custom database queries instead. I'm not insisting that this needs to be done in this PR, but we need to somehow document this as a todo that needs to be addressed. I'm assuming it will need to be addressed for the MVP - we do have monstrous datasets in our prod. with 25K and 40+K files. |
Thank you for your detailed comments. Maybe I have been too strict with backwards compatibility, which on this endpoint is not so important. I agree with what you mention about performance and it is something that I have not considered at all and that I appreciate you mention it. I like the approach of using database queries + mode=(all|original|archival) and would like to at least evaluate the implementation for this PR. I will let you know. Thank you! |
From some quick archaeological digging, that "downloadsize" api was indeed added by request from an outside project, who wanted to use it in their own custom UI app they were building to work with Dataverse; and it was loosely matching their use case (they wanted to download everything, original and archival copies of the tab. files). It was obviously never anticipated, that we'd be building our own SPA UI that would want to use this api. I don't know if that project is still around, and if that backward compatibility is needed (but I can ask around). But, as I said, I'm still ok with preserving it, just in case. |
Yeah. The use case still makes sense at least: "I'm an integrating system (Whole Tale, etc) and I want to provide some expectation in my UI about how long something will take when pulling files from Dataverse" From #6524 (comment) |
Just want to mention here that, instead of creating custom database queries for calculating these sizes, another alternative would be to try and utilize the Eclipse left join "hints", to avoid the followup datatbase queries on the datafiles and datatables; so, instead of doing
etc. And that would look up all the datafiles and datatables in one query, without the N additional queries... And then we can iterate through the list as before. (There's a chance that you don't need the The only way to find out which approach is better would be test and time it on a dataset with 10Ks of files. A custom query would almost certainly be faster; but the difference may or may not be significant. The left-join approach above may feel a little cleaner, but I'm assuming you pay an extra cost in memory. |
I may need to take back the "bare minimum" part though - if we ever use that left join hint approach, we have to add hints for any other 1:1 relationships there - or else we are back to the EJB executing N additional queries for them... |
This comment has been minimized.
This comment has been minimized.
1 similar comment
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
1 similar comment
This comment has been minimized.
This comment has been minimized.
[edited:] Interesting, I never used QueryDSL myself, so was thinking of writing these queries myself. It definitely looks cleaner; and it's kind of neat, that you can write In case you saw the first version of this comment, I was a little confused trying to read the screenshot above; I ended up building the branch and looking at the Postgres log, and playing with QueryDSL some more. A minor thing, but, is the (I'm ready to approve the PR, looks good!) |
uploadTabularFileResponse.then().assertThat().statusCode(OK.getStatusCode()); | ||
|
||
// Get the original tabular file size | ||
int tabularOriginalSize = Integer.parseInt(uploadTabularFileResponse.getBody().jsonPath().getString("data.files[0].dataFile.filesize")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor thing, but why not just = 157
? (the known size of src/test/resources/tab/test.tab
).
It's a tiny file that should get ingested almost instantly. There may be a non-zero chance of the size in the response above already being the size of the generated archival file (I'm not sure... but it should be safest to use the known size, I think).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. If it makes it safer, it's worth to update it.
Sorry, my bad - if you have read it already, please ignore my last comment about the query - I missed that |
.from(fileMetadata) | ||
.where(fileMetadata.datasetVersion.id.eq(datasetVersion.getId())) | ||
.from(dataTable) | ||
.where(fileMetadata.dataFile.dataTables.isNotEmpty().and(dataTable.dataFile.eq(fileMetadata.dataFile))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, .where(dataTable.dataFile.eq(fileMetadata.dataFile))
should be sufficient - and it simplifies the resulting query a bit.
…SS/dataverse into 9958-dataset-files-size
…SS/dataverse into 9958-dataset-files-size
📦 Pushed preview images as
🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect, thank you!
return error(Response.Status.BAD_REQUEST, "Invalid mode: " + mode); | ||
} | ||
DatasetVersion datasetVersion = getDatasetVersionOrDie(req, version, findDatasetOrDie(dvIdtf), uriInfo, headers); | ||
long datasetStorageSize = datasetVersionFilesServiceBean.getFilesDownloadSize(datasetVersion, fileDownloadSizeMode); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, I'm wondering if I gave you bad advice, to move this size calculation out of the command. The command provided one extra function - the authorization (requiring the ViewUnpublishedDataset
permission on draft versions); which we no longer appear to be enforcing in the above (?).
I can't imagine this information, this download size, being truly sensitive... but we apparently used to require that permission, so at the very least we need to confirm if it is necessary or not.
Sorry for the extra confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I'll ask on slack)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh, never mind? That authorization check is still being performed, inside the getDatasetVersionOrDie()
.
@GPortas I am seeing the same download size for all variations of the endpoint. |
3d3a553
into
9852-files-api-deaccessioned-datasets
What this PR does / why we need it:
Added a new optional query parameter "mode" to the "getDownloadSize" API endpoint ("api/datasets/{identifier}/versions/{versionId}/downloadsize").
This parameter applies a filter criteria to the operation and supports the following values:
All (Default): Includes both archival and original sizes for tabular files
Archival: Includes only the archival size for tabular files
Original: Includes only the original size for tabular files
Which issue(s) this PR closes:
Special notes for your reviewer:
Suggestions on how to test this:
Upload tabular files to a dataset and test the endpoint via curl.
Ignoring the tabular original sizes:
curl -H "X-Dataverse-key:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" "http://localhost:8080/api/datasets/24/versions/1.0/downloadsize?mode=Archival"
Including the tabular original and archival sizes (current behavior):
curl -H "X-Dataverse-key:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" "http://localhost:8080/api/datasets/24/versions/1.0/downloadsize"
The above output should be the same as for:
curl -H "X-Dataverse-key:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" "http://localhost:8080/api/datasets/24/versions/1.0/downloadsize?mode=All"
Count only the original sizes (ignores archival size for tab files):
curl -H "X-Dataverse-key:xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx" "http://localhost:8080/api/datasets/24/versions/1.0/downloadsize?mode=Original"
Does this PR introduce a user interface change? If mockups are available, please link/include them here:
No
Is there a release notes update needed for this change?:
Yes
Additional documentation:
No