Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bean Validation errors calling file redetect api endpoint #8821

Closed
matthew-a-dunlap opened this issue Jun 29, 2022 · 5 comments · Fixed by #8835
Closed

Bean Validation errors calling file redetect api endpoint #8821

matthew-a-dunlap opened this issue Jun 29, 2022 · 5 comments · Fixed by #8835
Milestone

Comments

@matthew-a-dunlap
Copy link
Contributor

During development of CORE2, I've been using pyDataverse to handle our Dataverse interactions.

One aspect of this is uploading files. We ran into #8344 which causes mime type to not be set. Because we want to support older installations, I'm shooting for a solution that doesn't require the fix pushed by @landreev (though I'm glad it exists!).

The solution I've tried is to call the redetect endpoint to get the correct file type. This works and there are no errors thrown in the response... BUT there are concerning messages now appearing in our logs. Note this is on our S3-based test Dataverse running 5.3:

file_redetect_error.log

I'm curious if anyone over at IQSS has insight as to what might be causing this? Maybe this is a pyDataverse issue but it seems like the calls are pretty straightforward. We are concerned specifically that all these warnings indicate that something is corrupting the metadata in our database.

Thanks much!

p.s. Incase it helps here are the responses from a few of our calls to the pyDataverse upload_datafile and redetect_file_type functions:

2022-06-29 19:18:08 DEBUG    [dataverse:059] Manuscript 32, upload_file_response {'_content': b'{"status":"OK","data":{"files":[{"description":"","label":"LagodnyJonesKochEnns_MainAnalysis.dta","restricted":false,"version":1,"datasetVersionId":32164,"dataFile":{"id":7519737,"persistentId":"","pidURL":"","filename":"LagodnyJonesKochEnns_MainAnalysis.dta","contentType":"application/x-stata-14","filesize":317240,"description":"","storageIdentifier":"s3://dataverse-awstest-dev:181b0e63abf-550fb746720c","rootDataFileId":-1,"md5":"9009bf1fb8fa1a8a1388c8feec250857","checksum":{"type":"MD5","value":"9009bf1fb8fa1a8a1388c8feec250857"},"creationDate":"2022-06-29"}}]}}', '_content_consumed': True, '_next': None, 'status_code': 200, 'headers': {'Date': 'Wed, 29 Jun 2022 19:18:06 GMT', 'Server': 'Apache/2.4.37 (Red Hat Enterprise Linux) OpenSSL/1.1.1k', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Methods': 'PUT, GET, POST, DELETE, OPTIONS', 'Access-Control-Allow-Headers': 'Content-Type, X-Dataverse-Key', 'Content-Type': 'application/json;charset=UTF-8', 'Content-Length': '570', 'Keep-Alive': 'timeout=5, max=100', 'Connection': 'Keep-Alive'}, 'raw': <urllib3.response.HTTPResponse object at 0x7f9968364790>, 'url': 'https://dataverse-awstest.irss.unc.edu/api/v1/datasets/:persistentId/add?persistentId=doi:10.33563/FK2/B4PIIQ&User-Agent=pydataverse&key=feac0f49-c19a-42da-abb0-88ec3778e824', 'encoding': 'UTF-8', 'history': [], 'reason': 'OK', 'cookies': <RequestsCookieJar[]>, 'elapsed': datetime.timedelta(seconds=1, microseconds=747831), 'request': <PreparedRequest [POST]>, 'connection': <requests.adapters.HTTPAdapter object at 0x7f99a9132c40>}
2022-06-29 19:18:09 DEBUG    [dataverse:064] Manuscript 32, redetect_response {'_content': b'{"status":"OK","data":{"dryRun":false,"oldContentType":"application/x-stata-14","newContentType":"application/x-stata-14"}}', '_content_consumed': True, '_next': None, 'status_code': 200, 'headers': {'Date': 'Wed, 29 Jun 2022 19:18:08 GMT', 'Server': 'Apache/2.4.37 (Red Hat Enterprise Linux) OpenSSL/1.1.1k', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Methods': 'PUT, GET, POST, DELETE, OPTIONS', 'Access-Control-Allow-Headers': 'Content-Type, X-Dataverse-Key', 'Content-Type': 'application/json;charset=UTF-8', 'Content-Length': '123', 'Keep-Alive': 'timeout=5, max=100', 'Connection': 'Keep-Alive'}, 'raw': <urllib3.response.HTTPResponse object at 0x7f99988da130>, 'url': 'https://dataverse-awstest.irss.unc.edu/api/v1/files/7519737/redetect?dryRun=false&User-Agent=pydataverse&key=feac0f49-c19a-42da-abb0-88ec3778e824', 'encoding': 'UTF-8', 'history': [], 'reason': 'OK', 'cookies': <RequestsCookieJar[]>, 'elapsed': datetime.timedelta(seconds=1, microseconds=63350), 'request': <PreparedRequest [POST]>, 'connection': <requests.adapters.HTTPAdapter object at 0x7f99b97d5250>}
2022-06-29 19:18:10 DEBUG    [dataverse:059] Manuscript 32, upload_file_response {'_content': b'{"status":"OK","data":{"files":[{"description":"","label":"LagodnyJonesKochEnns_StatePolicyMood_Codebook.pdf","restricted":false,"version":1,"datasetVersionId":32164,"dataFile":{"id":7519738,"persistentId":"","pidURL":"","filename":"LagodnyJonesKochEnns_StatePolicyMood_Codebook.pdf","contentType":"text/plain","filesize":96596,"description":"","storageIdentifier":"s3://dataverse-awstest-dev:181b0e643cf-6953673c070b","rootDataFileId":-1,"md5":"c21ceab2fd4bdd1d34065d2da91d6651","checksum":{"type":"MD5","value":"c21ceab2fd4bdd1d34065d2da91d6651"},"creationDate":"2022-06-29"}}]}}', '_content_consumed': True, '_next': None, 'status_code': 200, 'headers': {'Date': 'Wed, 29 Jun 2022 19:18:09 GMT', 'Server': 'Apache/2.4.37 (Red Hat Enterprise Linux) OpenSSL/1.1.1k', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Methods': 'PUT, GET, POST, DELETE, OPTIONS', 'Access-Control-Allow-Headers': 'Content-Type, X-Dataverse-Key', 'Content-Type': 'application/json;charset=UTF-8', 'Content-Length': '581', 'Keep-Alive': 'timeout=5, max=100', 'Connection': 'Keep-Alive'}, 'raw': <urllib3.response.HTTPResponse object at 0x7f9968388700>, 'url': 'https://dataverse-awstest.irss.unc.edu/api/v1/datasets/:persistentId/add?persistentId=doi:10.33563/FK2/B4PIIQ&User-Agent=pydataverse&key=feac0f49-c19a-42da-abb0-88ec3778e824', 'encoding': 'UTF-8', 'history': [], 'reason': 'OK', 'cookies': <RequestsCookieJar[]>, 'elapsed': datetime.timedelta(microseconds=968581), 'request': <PreparedRequest [POST]>, 'connection': <requests.adapters.HTTPAdapter object at 0x7f995823e2e0>}
2022-06-29 19:18:11 DEBUG    [dataverse:064] Manuscript 32, redetect_response {'_content': b'{"status":"OK","data":{"dryRun":false,"oldContentType":"text/plain","newContentType":"application/pdf"}}', '_content_consumed': True, '_next': None, 'status_code': 200, 'headers': {'Date': 'Wed, 29 Jun 2022 19:18:10 GMT', 'Server': 'Apache/2.4.37 (Red Hat Enterprise Linux) OpenSSL/1.1.1k', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Methods': 'PUT, GET, POST, DELETE, OPTIONS', 'Access-Control-Allow-Headers': 'Content-Type, X-Dataverse-Key', 'Content-Type': 'application/json;charset=UTF-8', 'Content-Length': '104', 'Keep-Alive': 'timeout=5, max=100', 'Connection': 'Keep-Alive'}, 'raw': <urllib3.response.HTTPResponse object at 0x7f99988daf10>, 'url': 'https://dataverse-awstest.irss.unc.edu/api/v1/files/7519738/redetect?dryRun=false&User-Agent=pydataverse&key=feac0f49-c19a-42da-abb0-88ec3778e824', 'encoding': 'UTF-8', 'history': [], 'reason': 'OK', 'cookies': <RequestsCookieJar[]>, 'elapsed': datetime.timedelta(microseconds=949062), 'request': <PreparedRequest [POST]>, 'connection': <requests.adapters.HTTPAdapter object at 0x7f995827cee0>}
@matthew-a-dunlap
Copy link
Contributor Author

Ah woops! I just saw this is quite possibly the same issue as #7527 . I'll leave this up until it gets looked at but if I should move my issue there let me know.

@matthew-a-dunlap
Copy link
Contributor Author

matthew-a-dunlap commented Jun 29, 2022

I found a solution, which was to fork pyDataverse and manually add the mime-type to the file posts. I may create a pyDataverse PR to allow these to be passed in.

I'm going to leave this up for now because the issues I had related to #7527 might be helpful. But feel free to close or delete if that makes sense.

Edit: I've created a fork that allows passing the MIME-type to dataverse. This is needed for uploading with old installations https://github.com/OdumInstitute/pyDataverse/tree/mime_type_upload

@landreev
Copy link
Contributor

landreev commented Sep 12, 2022

I did miss this issue back in June (on account of being on vacation, most likely).

I also made a PR into pyDataverse, adding a way to explicitly supply the mime type as an argument on file upload (gdcc/pyDataverse#142); in parallel with the fix for handling of upload calls without type headers on the Dataverse side (#8392). It was never merged; I couldn't tell from your comments in gdcc/pyDataverse#118 if your saw and/or tried that. (I haven't looked yet, but sounds like your pyDataverse fork does the same thing).

@landreev
Copy link
Contributor

landreev commented Sep 12, 2022

(the info below may not be of interest/practical value to you - you seem to have worked around it anyway - but should be useful for us/the dev. team in the context of the overall cleanup of the redetect functionality)

The message at the top of the log - tmp is a file extension Dataverse doesn't know about... - that is the bug being fixed in #8835. Meaning once it's merged, it may or may not fix the redetect issue as reported above, depending on whether the filename had a recognized extension.

The error in the attached server log does look like a real constraint violation. Almost certainly the result of the redetect code trying to save the DataFile with the contentType set to null. I need to take a closer look at the code, but it appears that the method simply needs a null check in the end, before trying to save the new type in the database. Will make a PR if that's the case. (There are other things in the log there - like some index errors further down in the stack trace - but those are a result of not being able to save the datafile in the db). [Edit: #8835 already contains this null check!].

Interestingly, @donsizemore and I were looking at a virtually identical stack trace a few months ago, thrown by an attempt to recalculate the md5 of an old Odum file. Don traced it to a newline character in the mimetype of that file, which violated ^.*/.*$ in the constraint.

@landreev
Copy link
Contributor

I apologize for muddying the waters here. This is indeed the same issue as #7527, and #8835 fixes it.

@pdurbin pdurbin added this to the 5.12 milestone Sep 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants