API: As a data owner, I want to "uningest" a tabular data file so that files that should not have been ingested are saved appropriately #3766

landreev · 2017-04-11T20:54:01Z

We occasionally get requests from users to revert their ingested tabular files to their original state. (example: RT 247789 - https://help.hmdc.harvard.edu/Ticket/Display.html?id=247789) Some data were never meant to be tabular. This is particularly common with Excel spreadsheets. For example, authors may use a spreadsheet for listing their bibliographical references; having something like that automatically converted to tabular format, and inviting users to "explore" it with TwoRavens is not really what they want.

(There is an open ticket #2199, to allow users to skip tabular ingest on a new file; this issue deals with a file that's already ingested).

The process is fairly straightforward. These uningest requests are currently handled by running a command line script. There are several things that need to happen: delete the datatable object, and its child objects - datavariables, summarystatistics, etc.; recalculate the version unf; replace the file with the saved original; remove any derivative files; restore the original size, mime type and the file name. All these steps simply need to be reimplemented inside the app.

A little bit of thought will need to be invested into figuring out how to add this option to the UI. (is it an extra button shown to the users with the edit permission on the dataset? should we use the existing checkboxes - with the extra option in the pulldown menu under "edit files"?)

djbrooke · 2018-01-25T17:20:35Z

For now, switching this up to cover a new administrative/curation API endpoint instead of a user-facing feature.

djbrooke · 2018-02-02T20:05:01Z

@landreev - in backlog grooming this week, you mentioned you'd share a script in this issue.

djbrooke · 2018-04-04T18:30:40Z

@landreev - we're taking this on this sprint, can you drop in that script? Thanks!

landreev · 2018-04-10T22:31:40Z

Missed the message above - sorry.
Here's the script we are currently using in production to uningest a tabular file; creating an API call to do this would involve coding the same 9 or 10 steps from the script. (which in turn simply reverses what happens during a tabular ingest...)

The script takes the (database) id of the datafile;
Note that it calls 2 other scripts that live on the prod. system (in /usr/local/dataverse-admin/bin):
runsql - runs an arbitrary sql query in postgres; if it produces output, it will print it out tab-delimited;
lookup_file_by_id.sh - does what the name suggests; for our purposes, it translates the database id of the datafile into the S3 storage identifier - for example "s3://dataverse-bucket/10.7910/DVN/PQ7IUE/15ea0342996-41b3ee674df0" that the AWS command line utility understands.

the script:

#!/bin/sh

# takes the datafile db id as the argument
# (see 247789-uningest-tabular/README.txt)

fileid=$1

if [ $fileid"x" = "x" ]
then
    echo "usge: ./uningest.sh <DATAFILE ID>"
    exit 1
fi

PATH=/usr/local/dataverse-admin/bin:$PATH; export PATH

filename=`echo $fileid | lookup_file_byid.sh  | grep '^FILE NAME' | awk '{print $4}'`

datatableid=`runsql -q 'SELECT id FROM datatable WHERE datafile_id='$fileid`

originaltype=`runsql -q 'SELECT originalfileformat FROM datatable where datafile_id = '$fileid`

echo "ORIGINAL TYPE: " $originaltype

# delete the tabular data associated with the file:

runsql -q "DELETE FROM summarystatistic WHERE datavariable_id in (SELECT id FROM datavariable WHERE datatable_id=$datatableid)"

runsql -q "DELETE FROM datavariable WHERE datatable_id=$datatableid"
    
runsql -q "DELETE FROM datatable WHERE id=$datatableid"

# and reset the ingest stats: 

runsql -q 'DELETE FROM ingestreport WHERE datafile_id = '$fileid
runsql -q "UPDATE datafile SET ingeststatus='A' where id=$fileid"

# reset the size (from the size of the generated tabular file, back to the size of the original)

originalsize=`aws s3 ls ${filename}.orig | awk '{print $3}'`
echo "ORIGINAL SIZE: " $originalsize

runsql -q "UPDATE datafile SET filesize=$originalsize WHERE id=$fileid"

# replace the generated tabular file with the preserved original: 

echo aws s3 mv $filename.orig $filename
aws s3 mv $filename.orig $filename

# fix the mime type: 

runsql -q "UPDATE datafile SET contenttype='$originaltype' WHERE id=$fileid"

case $originaltype in
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet)
	originalextension="xlsx"
	;;
application/stata*)
	originalextension="dta"
	;;
application/x-stata*)
	originalextension="dta"
	;;
application/x-spss-sav)
	originalextension="sav"
	;;
application/x-spss-por)
	originalextension="por"
	;;
application/x-rlang-transport)
	originalextension="RData"
	;;
text/csv)
	originalextension="csv"
	;;
*)
	originalextension="unknown"
	;;
esac

echo "restoring the ORIGINAL EXTENSION: "$originalextension

# Change the filename extension from ".tab" back to what it was before the file was ingested:

runsql -q "SELECT id,label FROM filemetadata WHERE datafile_id=$fileid" | sed 's/\.tab//' | while read fmid label
do
    echo "filemetadata id: "$fmid", changing label to "$label"."$originalextension
    runsql -q "UPDATE filemetadata SET label='${label}.${originalextension}' WHERE id=$fmid"
done


# finally, fix/recalculate the UNFs of the versions that have this file:

runsql -q "SELECT datasetversion_id FROM filemetadata WHERE datafile_id=$fileid" | while read versionid
do
    runsql -q 'UPDATE datasetversion SET unf=null WHERE id='$versionid
    curl -X POST http://localhost:8080/api/admin/datasets/integrity/$versionid/fixmissingunf?forceRecalculate=true
done

landreev · 2018-04-10T22:37:37Z

To clarify: whoever ends up working on this, you don't need to literally reimplement this script in Java 1:1! It's provided for reference, to list everything that needs to be done. But it is an admin script specifically written for our prod. system. It assumes that files live on S3 - but you're not going to make any such assumptions, you'll simply use the StorageIO system to replace the tabular file with the stored original; you don't need to write the code that generates the file extension based on the stored original mime type - there is already a method in FileUtil that does that. Etc.

kcondon · 2018-05-08T20:32:43Z

Issues/ questions so far:

Api endpoint needs to be documented
What should happen when ingested file is tabular but also geospatial? Should geospatial aspects be deleted? Dataverse side is deleted but WorldMap remains.
Extra ingest-related files remain: .RData and .prep
Export files remain from ingested version of files so still contain variable info.
Tabular tags remain after uningest

landreev · 2018-05-09T20:40:36Z

@sekmiller Are you sure it was necessary, to do that .merge() on the filemetadata, in order to get the new extension to stick?
It was working for me consistently without that merge. (We do a merge on the DatasetVersion, that contains that FileMetadata two lines down...)
Well, I don't think it should hurt either... Looks ok otherwise.

sekmiller · 2018-05-09T20:43:50Z

It wasn't working at all for me without it.

djbrooke changed the title ~~Allow dataset owners to "uningest" tabular data files~~ Allow dataset owners to "uningest" tabular data files Apr 12, 2017

djbrooke added ready and removed ready labels Apr 19, 2017

djbrooke added ready and removed ready labels May 2, 2017

pdurbin added the Feature: File Upload & Handling label Jun 8, 2017

pdurbin mentioned this issue Nov 16, 2017

Multiple formats under "Download All" dropdown #4000

Closed

djbrooke added Status: Backlog and removed Feature: File Upload & Handling labels Jan 25, 2018

djbrooke changed the title ~~Allow dataset owners to "uningest" tabular data files~~ API: Allow "uningest" of tabular data files Jan 25, 2018

djbrooke changed the title ~~API: Allow "uningest" of tabular data files~~ API: As a dataverse administrator, I want to "uningest" a tabular data file so that codebooks can be saved appropriately Jan 31, 2018

djbrooke changed the title ~~API: As a dataverse administrator, I want to "uningest" a tabular data file so that codebooks can be saved appropriately~~ API: As a data owner, I want to "uningest" a tabular data file so that codebooks can be saved appropriately Jan 31, 2018

djbrooke changed the title ~~API: As a data owner, I want to "uningest" a tabular data file so that codebooks can be saved appropriately~~ API: As a data owner, I want to "uningest" a tabular data file so that files that should not have been ingested are saved appropriately Jan 31, 2018

pdurbin added Feature: File Upload & Handling and removed Component: File Upload & Handling labels Feb 6, 2018

djbrooke added Status: This/Next Sprint and removed Status: Backlog labels Feb 23, 2018

djbrooke added Status: This/Next Sprint and removed Status: Backlog Feature: File Upload & Handling labels Apr 3, 2018

sekmiller added Status: Development and removed Status: This/Next Sprint labels Apr 23, 2018

landreev removed their assignment May 8, 2018

djbrooke added Status: QA and removed Status: Code Review labels May 8, 2018

kcondon self-assigned this May 8, 2018

djbrooke assigned sekmiller and unassigned kcondon May 9, 2018

djbrooke added Status: Development and removed Status: QA labels May 9, 2018

sekmiller added a commit that referenced this issue May 9, 2018

#3766 move path to files remove map layer

a0a69ec

sekmiller added a commit that referenced this issue May 9, 2018

#3766 fix revert file extension

90e21a6

sekmiller added a commit that referenced this issue May 9, 2018

#3766 add documentation for uningest api end point

6141171

sekmiller added Status: Code Review and removed Status: Development labels May 9, 2018

landreev added Status: QA and removed Status: Code Review labels May 9, 2018

landreev unassigned sekmiller May 9, 2018

kcondon self-assigned this May 9, 2018

sekmiller added a commit that referenced this issue May 10, 2018

#3766 call to WorldMap to delete map layer

611fb5b

sekmiller added a commit that referenced this issue May 11, 2018

#3766 - re-export metadata after file uningest

e0a71bc

sekmiller added a commit that referenced this issue May 11, 2018

#3766 add delete aux objects

14f9441

sekmiller added a commit that referenced this issue May 11, 2018

#3766 uningest published file

d6835bc

sekmiller added a commit that referenced this issue May 11, 2018

#3766 remove data tags from uningested file

9613c9d

kcondon closed this as completed May 11, 2018

kcondon removed the Status: QA label May 11, 2018

pdurbin added this to the 4.9 - Persistent Identifiers for Files milestone May 21, 2018

pdurbin mentioned this issue Jul 19, 2018

Cannot remove/delete failed ingest datafile #3893

Closed

mheppler mentioned this issue Jun 25, 2019

File Ingestion Issue #5970

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: As a data owner, I want to "uningest" a tabular data file so that files that should not have been ingested are saved appropriately #3766

API: As a data owner, I want to "uningest" a tabular data file so that files that should not have been ingested are saved appropriately #3766

landreev commented Apr 11, 2017

djbrooke commented Jan 25, 2018

djbrooke commented Feb 2, 2018

djbrooke commented Apr 4, 2018

landreev commented Apr 10, 2018

landreev commented Apr 10, 2018 •

edited

Loading

kcondon commented May 8, 2018 •

edited

Loading

landreev commented May 9, 2018

sekmiller commented May 9, 2018

API: As a data owner, I want to "uningest" a tabular data file so that files that should not have been ingested are saved appropriately #3766

API: As a data owner, I want to "uningest" a tabular data file so that files that should not have been ingested are saved appropriately #3766

Comments

landreev commented Apr 11, 2017

djbrooke commented Jan 25, 2018

djbrooke commented Feb 2, 2018

djbrooke commented Apr 4, 2018

landreev commented Apr 10, 2018

landreev commented Apr 10, 2018 • edited Loading

kcondon commented May 8, 2018 • edited Loading

landreev commented May 9, 2018

sekmiller commented May 9, 2018

landreev commented Apr 10, 2018 •

edited

Loading

kcondon commented May 8, 2018 •

edited

Loading