Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

6522 duplicate dvobjects #6612

Merged
merged 6 commits into from
Feb 10, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 27 additions & 0 deletions doc/release-notes/6522-datafile-duplicates.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
In this Dataverse release, we are adding a database constraint to
prevent duplicate DataFile objects pointing to the same physical file
from being created.

Before this release can be deployed, your database must be checked
for any such duplicates that may already exist. If present,
the duplicates will need to be deleted, and the integrity of the
stored physical files verified.

(We have notified the community about this issue ahead of the release,
so you may have already addressed it. In this case, please disregard
this release note)

Please run the diagnostic script provided at
https://github.com/IQSS/dataverse/raw/develop/scripts/issues/6522/find_duplicates.sh.
The script relies on the PostgreSQL utility `psql` to access the
database. You will need to edit the credentials at the top of the script
to match your database configuration.

If this issue is not present in your database, you will see a message
`... no duplicate dvObjects in your database. Your installation is
ready to be upgraded to Dataverse 4.20`.

If duplicates are detected, it will provide further instructions. We
will need you to send us the produced output. We will then assist you
in resolving this problem in your database.

23 changes: 23 additions & 0 deletions scripts/issues/6522/PRE-RELEASE-INFO.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
In the next Dataverse release, we are adding a database constraint to
prevent duplicate DataFile objects pointing to the same physical file
from being created.

Before the next release can be deployed, your database must be checked
for any such duplicates that may already exist. If present,
the duplicates will need to be deleted, and the integrity of the
stored physical files verified.

Please run the diagnostic script provided at
https://github.com/IQSS/dataverse/raw/develop/scripts/issues/6522/find_duplicates.sh.
The script relies on the PostgreSQL utility psql to access the
database. You will need to edit the credentials at the top of the script
to match your database configuration.

If this issue is not present in your database, you will see a message
"... no duplicate dvObjects in your database. Your installation is
ready to be upgraded to Dataverse 4.20"

If duplicates are detected, it will provide further instructions. We
will need you to send us the produced output. We will then assist you
in resolving this problem in your database.

77 changes: 77 additions & 0 deletions scripts/issues/6522/find_duplicates.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
#!/bin/sh

# begin config
# PostgresQL credentials:
# edit the following lines so that psql can talk to your database
pg_host=localhost
pg_port=5432
pg_user=dvnapp
pg_db=dvndb
# you can leave the password blank, if Postgres is configured
# to accept connections without auth:
pg_pass=
# psql executable, add full path if necessary:
PSQL_EXEC=psql

# end config

PG_QUERY_0="SELECT COUNT(DISTINCT o.id) FROM datafile f, dataset s, dvobject p, dvobject o WHERE s.id = p.id AND o.id = f.id AND o.owner_id = s.id AND s.harvestingclient_id IS null AND o.storageidentifier IS NOT null"

PG_QUERY_1="SELECT s.id, o.storageidentifier FROM datafile f, dataset s, dvobject o WHERE o.id = f.id AND o.owner_id = s.id AND s.harvestingclient_id IS null AND o.storageidentifier IS NOT null ORDER by o.storageidentifier"

PG_QUERY_2="SELECT p.authority, p.identifier, o.storageidentifier, o.id, o.createdate, f.contenttype FROM datafile f, dvobject p, dvobject o WHERE o.id = f.id AND o.owner_id = p.id AND o.storageidentifier='%s' ORDER by o.id"

PGPASSWORD=$pg_pass; export PGPASSWORD

echo "Checking the number of non-harvested datafiles in the database..."

NUM_DATAFILES=`${PSQL_EXEC} -h ${pg_host} -U ${pg_user} -d ${pg_db} -tA -F ' ' -c "${PG_QUERY_0}"`
if [ $? != 0 ]
then
echo "FAILED to execute psql! Check the credentials and try again?"
echo "exiting..."
echo
echo "the command line that failed:"
echo "${PSQL_EXEC} -h ${pg_host} -U ${pg_user} -d ${pg_db} -tA -F ' ' -c \"${PG_QUERY_0}\""
exit 1
fi

echo $NUM_DATAFILES total.

echo "Let's check if any storage identifiers are referenced more than once within the same dataset:"

${PSQL_EXEC} -h ${pg_host} -U ${pg_user} -d ${pg_db} -tA -F ' ' -c "${PG_QUERY_1}" |
uniq -c |
awk '{if ($1 > 1) print $NF}' > /tmp/storageidentifiers.tmp

NUM_CONFIRMED=`cat /tmp/storageidentifiers.tmp | wc -l`

if [ $NUM_CONFIRMED == 0 ]
then
echo
echo "Good news - it appears that there are NO duplicate dvObjects in your database."
echo "Your installation is ready to be upgraded to Dataverse 4.20."
echo
exit 0
fi

echo "The following storage identifiers appear to be referenced from multiple DvObjects:"
cat /tmp/storageidentifiers.tmp
echo "(output saved in /tmp/storageidentifiers.tmp)"

echo "Looking up details for the affected datafiles:"

cat /tmp/storageidentifiers.tmp | while read si
do
PG_QUERY_SI=`printf "${PG_QUERY_2}" $si`
${PSQL_EXEC} -h ${pg_host} -U ${pg_user} -d ${pg_db} -tA -F ' ' -c "${PG_QUERY_SI}"
done | tee /tmp/duplicates_info.tmp

echo "(output saved in /tmp/duplicates_info.tmp)"

echo
echo "Please send the output above to Dataverse support."
echo "We will assist you in the database cleanup that needs to happen "
echo "before your installation can be upgraded to Dataverse 4.20."
echo "We apologize for any inconvenience."
echo
2 changes: 1 addition & 1 deletion src/main/java/edu/harvard/iq/dataverse/DvObject.java
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@
, @Index(columnList="owner_id")
, @Index(columnList="creator_id")
, @Index(columnList="releaseuser_id")},
uniqueConstraints = @UniqueConstraint(columnNames = {"authority,protocol,identifier"}))
uniqueConstraints = {@UniqueConstraint(columnNames = {"authority,protocol,identifier"}),@UniqueConstraint(columnNames = {"owner_id,storageidentifier"})})
public abstract class DvObject extends DataverseEntity implements java.io.Serializable {

public static final String DATAVERSE_DTYPE_STRING = "Dataverse";
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
ALTER TABLE dvobject ADD CONSTRAINT unq_dvobject_storageidentifier UNIQUE(owner_id, storageidentifier);