Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: As a data owner, I want to "uningest" a tabular data file so that files that should not have been ingested are saved appropriately #3766

Closed
landreev opened this issue Apr 11, 2017 · 10 comments

Comments

@landreev
Copy link
Contributor

We occasionally get requests from users to revert their ingested tabular files to their original state. (example: RT 247789 - https://help.hmdc.harvard.edu/Ticket/Display.html?id=247789) Some data were never meant to be tabular. This is particularly common with Excel spreadsheets. For example, authors may use a spreadsheet for listing their bibliographical references; having something like that automatically converted to tabular format, and inviting users to "explore" it with TwoRavens is not really what they want.

(There is an open ticket #2199, to allow users to skip tabular ingest on a new file; this issue deals with a file that's already ingested).

The process is fairly straightforward. These uningest requests are currently handled by running a command line script. There are several things that need to happen: delete the datatable object, and its child objects - datavariables, summarystatistics, etc.; recalculate the version unf; replace the file with the saved original; remove any derivative files; restore the original size, mime type and the file name. All these steps simply need to be reimplemented inside the app.

A little bit of thought will need to be invested into figuring out how to add this option to the UI. (is it an extra button shown to the users with the edit permission on the dataset? should we use the existing checkboxes - with the extra option in the pulldown menu under "edit files"?)

@djbrooke djbrooke changed the title Allow dataset owners to "uningest" tabular data files Allow dataset owners to "uningest" tabular data files Apr 12, 2017
@djbrooke djbrooke added ready and removed ready labels Apr 19, 2017
@djbrooke djbrooke added ready and removed ready labels May 2, 2017
@djbrooke djbrooke changed the title Allow dataset owners to "uningest" tabular data files API: Allow "uningest" of tabular data files Jan 25, 2018
@djbrooke
Copy link
Contributor

For now, switching this up to cover a new administrative/curation API endpoint instead of a user-facing feature.

@djbrooke djbrooke changed the title API: Allow "uningest" of tabular data files API: As a dataverse administrator, I want to "uningest" a tabular data file so that codebooks can be saved appropriately Jan 31, 2018
@djbrooke djbrooke changed the title API: As a dataverse administrator, I want to "uningest" a tabular data file so that codebooks can be saved appropriately API: As a data owner, I want to "uningest" a tabular data file so that codebooks can be saved appropriately Jan 31, 2018
@djbrooke djbrooke changed the title API: As a data owner, I want to "uningest" a tabular data file so that codebooks can be saved appropriately API: As a data owner, I want to "uningest" a tabular data file so that files that should not have been ingested are saved appropriately Jan 31, 2018
@djbrooke
Copy link
Contributor

djbrooke commented Feb 2, 2018

@landreev - in backlog grooming this week, you mentioned you'd share a script in this issue.

@djbrooke
Copy link
Contributor

djbrooke commented Apr 4, 2018

@landreev - we're taking this on this sprint, can you drop in that script? Thanks!

@landreev
Copy link
Contributor Author

Missed the message above - sorry.
Here's the script we are currently using in production to uningest a tabular file; creating an API call to do this would involve coding the same 9 or 10 steps from the script. (which in turn simply reverses what happens during a tabular ingest...)

The script takes the (database) id of the datafile;
Note that it calls 2 other scripts that live on the prod. system (in /usr/local/dataverse-admin/bin):
runsql - runs an arbitrary sql query in postgres; if it produces output, it will print it out tab-delimited;
lookup_file_by_id.sh - does what the name suggests; for our purposes, it translates the database id of the datafile into the S3 storage identifier - for example "s3://dataverse-bucket/10.7910/DVN/PQ7IUE/15ea0342996-41b3ee674df0" that the AWS command line utility understands.

the script:

#!/bin/sh

# takes the datafile db id as the argument
# (see 247789-uningest-tabular/README.txt)

fileid=$1

if [ $fileid"x" = "x" ]
then
    echo "usge: ./uningest.sh <DATAFILE ID>"
    exit 1
fi

PATH=/usr/local/dataverse-admin/bin:$PATH; export PATH

filename=`echo $fileid | lookup_file_byid.sh  | grep '^FILE NAME' | awk '{print $4}'`

datatableid=`runsql -q 'SELECT id FROM datatable WHERE datafile_id='$fileid`

originaltype=`runsql -q 'SELECT originalfileformat FROM datatable where datafile_id = '$fileid`

echo "ORIGINAL TYPE: " $originaltype

# delete the tabular data associated with the file:

runsql -q "DELETE FROM summarystatistic WHERE datavariable_id in (SELECT id FROM datavariable WHERE datatable_id=$datatableid)"

runsql -q "DELETE FROM datavariable WHERE datatable_id=$datatableid"
    
runsql -q "DELETE FROM datatable WHERE id=$datatableid"

# and reset the ingest stats: 

runsql -q 'DELETE FROM ingestreport WHERE datafile_id = '$fileid
runsql -q "UPDATE datafile SET ingeststatus='A' where id=$fileid"

# reset the size (from the size of the generated tabular file, back to the size of the original)

originalsize=`aws s3 ls ${filename}.orig | awk '{print $3}'`
echo "ORIGINAL SIZE: " $originalsize

runsql -q "UPDATE datafile SET filesize=$originalsize WHERE id=$fileid"

# replace the generated tabular file with the preserved original: 

echo aws s3 mv $filename.orig $filename
aws s3 mv $filename.orig $filename

# fix the mime type: 

runsql -q "UPDATE datafile SET contenttype='$originaltype' WHERE id=$fileid"

case $originaltype in
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
	originalextension="xlsx"
	;;
application/stata*)
	originalextension="dta"
	;;
application/x-stata*)
	originalextension="dta"
	;;
application/x-spss-sav)
	originalextension="sav"
	;;
application/x-spss-por)
	originalextension="por"
	;;
application/x-rlang-transport)
	originalextension="RData"
	;;
text/csv)
	originalextension="csv"
	;;
*)
	originalextension="unknown"
	;;
esac

echo "restoring the ORIGINAL EXTENSION: "$originalextension

# Change the filename extension from ".tab" back to what it was before the file was ingested:

runsql -q "SELECT id,label FROM filemetadata WHERE datafile_id=$fileid" | sed 's/\.tab//' | while read fmid label
do
    echo "filemetadata id: "$fmid", changing label to "$label"."$originalextension
    runsql -q "UPDATE filemetadata SET label='${label}.${originalextension}' WHERE id=$fmid"
done


# finally, fix/recalculate the UNFs of the versions that have this file:

runsql -q "SELECT datasetversion_id FROM filemetadata WHERE datafile_id=$fileid" | while read versionid
do
    runsql -q 'UPDATE datasetversion SET unf=null WHERE id='$versionid
    curl -X POST http://localhost:8080/api/admin/datasets/integrity/$versionid/fixmissingunf?forceRecalculate=true
done

@landreev
Copy link
Contributor Author

landreev commented Apr 10, 2018

To clarify: whoever ends up working on this, you don't need to literally reimplement this script in Java 1:1! It's provided for reference, to list everything that needs to be done. But it is an admin script specifically written for our prod. system. It assumes that files live on S3 - but you're not going to make any such assumptions, you'll simply use the StorageIO system to replace the tabular file with the stored original; you don't need to write the code that generates the file extension based on the stored original mime type - there is already a method in FileUtil that does that. Etc.

@landreev landreev removed their assignment May 8, 2018
@kcondon kcondon self-assigned this May 8, 2018
@kcondon
Copy link
Contributor

kcondon commented May 8, 2018

Issues/ questions so far:

  • Api endpoint needs to be documented
  • What should happen when ingested file is tabular but also geospatial? Should geospatial aspects be deleted? Dataverse side is deleted but WorldMap remains.
  • Extra ingest-related files remain: .RData and .prep
  • Export files remain from ingested version of files so still contain variable info.
  • Tabular tags remain after uningest

@landreev
Copy link
Contributor Author

landreev commented May 9, 2018

@sekmiller Are you sure it was necessary, to do that .merge() on the filemetadata, in order to get the new extension to stick?
It was working for me consistently without that merge. (We do a merge on the DatasetVersion, that contains that FileMetadata two lines down...)
Well, I don't think it should hurt either... Looks ok otherwise.

@sekmiller
Copy link
Contributor

It wasn't working at all for me without it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants