-
Notifications
You must be signed in to change notification settings - Fork 492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DB Constraint to prevent duplicate DvObjects for the same physical file #6522
Comments
|
I started looking into this yesterday, was able to cleanly fix the 3 most affected datasets: 10.7910/DVN/IYGVYS, 10.7910/DVN/LQ2KFZ and 10.7910/DVN/8A1XO3. These had ~1300 duplicated datafiles between the 3 of them (!). |
The 3 datasets above, all the affected files had the same pattern: uploaded in large batches (from 250 to 550 files); all end up assigned sequential ids in the 2N range, where N is the size of the batch; each file ending up with 2 identical dvobjects/datafiles, with the ids I can think of various fanciful scenarios that can trigger it... But how much do we want to invest into figuring out how it happened? - as opposed to just slapping the unique constraint on the storageidentifier? |
All the existing duplicates have been removed. |
@landreev any chance you could include your SQL queries (and cleanup methods) in the PR? (and thank you!) |
Yes, definitely. The way I did it involved a lot of manual labor (unfortunately); I'm trying to package all the queries etc. in some better organized way now, that could be passed to the other installations. Have you actually seen this condition in your database? - I hope not; I'm still hoping that this was something that happened because of our unique load conditions/as part of specific system instability that we were experiencing here. But I'm preparing for the worst - for the possibility of at least some of the remote installations having experienced this as well. :( |
@donsizemore A quick test to check if you have been affected by this issue:
Then count the number of distinct datafile storageidentifiers within datasets, and see if you get the same number:
Of course, if this is your production database, you may get two different numbers because somebody has just uploaded and/or deleted some files... Running the queries on a snapshot copy of the database eliminates that problem. If it looks like you have some duplicate dvObjects, please let me know. |
We upgraded from 4.11 to 4.16 four days ago... |
OK, it looks like you have 8 duplicates. I'll send further instructions shortly. |
I realized (remembered) why we didn't put a unique constraint on the storageidentifier field in the first place: they were not in fact (historically) unique. In its current form, the storage identifier is a product of the timestamp in milliseconds and a reasonably long random string... so it's more or less guaranteed to be unique. But it wasn't always the case in the old days (DVN/VDC?); since each dataset has its own storage folder, the file name only needed to be unique within the dataset. So we have a number of grandfathered/migrated files with non-unique storageidentifiers in the database. So I guess instead of making the field unique, we need to make the combination of Although I'm debating if we should just go ahead and force uniqueness on the field, and migrate/rename any files with non-unique names. Anyone has an opinion on how to proceed? (@donsizemore: yours is likely the one installation outside of ours that's old enough to have such legacy files) |
The earliest dupes in our database were from late 2018. And that was before 4.11. |
@landreev thanks for bringing this up at standup. What I don't have a sense for is how much effort it would be (and how much risk there would be) to migrate/rename those old non-unique storage identifiers from the old days. If it's relatively easy it feels cleaner to me to migrate them and apply a single constraint on |
diagnostics script plus "pre-release note", to be sent preemptively to partner installations, so that if the issue with the datafile duplicates is detected we could start assisting them with cleanup before they can upgrade to the next release.
…hat we'll need to send out to the remote installations). #6522
…e out what to do with harvested files. (#6522)
@djbrooke I'd like to reopen this one, pending the decision next week on how (or whether) to proceed with the database constraint. |
OK, I didn't know if we should reopen or add a new issue for the specific harvesting case. Reopening now... |
As discussed, closing the issue, with the new issue #7451 opened to finish the one remaining task - resolve the issue with non-unique "storageidentifiers" for harvested files in legacy databases; then add a flyway script for adding the constraint to any databases that don't have it yet to the next release. |
Another issue with files discovered in prod. in recent days:
There appears to be a number of instances in the db, where TWO dvobjects/datafiles are associated with the same storage identifier (i.e., same physical file).
I have not yet identified the exact scenario that makes it possible. But it must be something going wrong during upload and save.
Opening the issue to diagnose and fix the problem.
As always with things like this, this will involve both code fixes and cleanup.
The text was updated successfully, but these errors were encountered: