Skip to content

Commit

Permalink
Merge pull request #11054 from QualitativeDataRepository/IQSS/10108-S…
Browse files Browse the repository at this point in the history
…tataMimeTypeRefinementForDIrectUpload

IQSS/10108 Stata mimetype refinement for direct upload
  • Loading branch information
ofahimIQSS authored Feb 3, 2025
2 parents 18a837d + df068fa commit 6be4f20
Show file tree
Hide file tree
Showing 7 changed files with 362 additions and 86 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
The version of Stata files is now detected during S3 direct upload (as it was for normal uploads), allowing ingest of Stata 14 and 15 files that have been uploaded directly. See [the guides](https://dataverse-guide--11054.org.readthedocs.build/en/11054/developers/big-data-support.html#features-that-are-disabled-if-s3-direct-upload-is-enabled), #10108, and #11054.
2 changes: 1 addition & 1 deletion doc/sphinx-guides/source/api/native-api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3872,7 +3872,7 @@ The fully expanded example above (without environment variables) looks like this
Currently the following methods are used to detect file types:
- The file type detected by the browser (or sent via API).
- Custom code that reads the first few bytes. As explained at :ref:`s3-direct-upload-features-disabled`, this method of file type detection is not utilized during direct upload to S3, since by nature of direct upload Dataverse never sees the contents of the file. However, this code is utilized when the "redetect" API is used.
- Custom code that reads the first few bytes. As explained at :ref:`s3-direct-upload-features-disabled`, most of these methods are not utilized during direct upload to S3, since by nature of direct upload Dataverse never sees the contents of the file. However, this code is utilized when the "redetect" API is used.
- JHOVE: https://jhove.openpreservation.org . Note that the same applies about direct upload to S3 and the "redetect" API.
- The file extension (e.g. ".ipybn") is used, defined in a file called ``MimeTypeDetectionByFileExtension.properties``.
- The file name (e.g. "Dockerfile") is used, defined in a file called ``MimeTypeDetectionByFileName.properties``.
Expand Down
2 changes: 1 addition & 1 deletion doc/sphinx-guides/source/developers/big-data-support.rst
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ Features that are Disabled if S3 Direct Upload is Enabled
The following features are disabled when S3 direct upload is enabled.

- Unzipping of zip files. (See :ref:`compressed-files`.)
- Detection of file type based on JHOVE and custom code that reads the first few bytes. (See :ref:`redetect-file-type`.)
- Detection of file type based on JHOVE and custom code that reads the first few bytes except for the refinement of Stata file types to include the version. (See :ref:`redetect-file-type`.)
- Extraction of metadata from FITS files. (See :ref:`fits`.)
- Creation of NcML auxiliary files (See :ref:`netcdf-and-hdf5`.)
- Extraction of a geospatial bounding box from NetCDF and HDF5 files (see :ref:`netcdf-and-hdf5`) unless :ref:`dataverse.netcdf.geo-extract-s3-direct-upload` is set to true.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@
import static edu.harvard.iq.dataverse.util.FileUtil.MIME_TYPE_UNDETERMINED_DEFAULT;
import static edu.harvard.iq.dataverse.util.FileUtil.createIngestFailureReport;
import static edu.harvard.iq.dataverse.util.FileUtil.determineFileType;
import static edu.harvard.iq.dataverse.util.FileUtil.determineFileTypeByNameAndExtension;
import static edu.harvard.iq.dataverse.util.FileUtil.determineRemoteFileType;
import static edu.harvard.iq.dataverse.util.FileUtil.getFilesTempDirectory;
import static edu.harvard.iq.dataverse.util.FileUtil.saveInputStreamInTempFile;
import static edu.harvard.iq.dataverse.util.FileUtil.useRecognizedType;
Expand Down Expand Up @@ -574,6 +574,8 @@ public CreateDataFileResult execute(CommandContext ctxt) throws CommandException
} else {
// Direct upload.

finalType = StringUtils.isBlank(suppliedContentType) ? FileUtil.MIME_TYPE_UNDETERMINED_DEFAULT : suppliedContentType;

// Since this is a direct upload, and therefore no temp file associated
// with it, we may, OR MAY NOT know the size of the file. If this is
// a direct upload via the UI, the page must have already looked up
Expand All @@ -593,18 +595,6 @@ public CreateDataFileResult execute(CommandContext ctxt) throws CommandException
}
}

// Default to suppliedContentType if set or the overall undetermined default if a contenttype isn't supplied
finalType = StringUtils.isBlank(suppliedContentType) ? FileUtil.MIME_TYPE_UNDETERMINED_DEFAULT : suppliedContentType;
String type = determineFileTypeByNameAndExtension(fileName);
if (!StringUtils.isBlank(type)) {
//Use rules for deciding when to trust browser supplied type
if (useRecognizedType(finalType, type)) {
finalType = type;
}
logger.fine("Supplied type: " + suppliedContentType + ", finalType: " + finalType);
}


}

// Finally, if none of the special cases above were applicable (or
Expand Down Expand Up @@ -635,6 +625,30 @@ public CreateDataFileResult execute(CommandContext ctxt) throws CommandException
DataFile datafile = FileUtil.createSingleDataFile(version, newFile, newStorageIdentifier, fileName, finalType, newCheckSumType, newCheckSum);

if (datafile != null) {
if (newStorageIdentifier != null) {
// Direct upload case
// Improve the MIMEType
// Need the owner for the StorageIO class to get the file/S3 path from the
// storageIdentifier
// Currently owner is null, but using this flag will avoid making changes here
// if that isn't true in the future
boolean ownerSet = datafile.getOwner() != null;
if (!ownerSet) {
datafile.setOwner(version.getDataset());
}
String type = determineRemoteFileType(datafile, fileName);
if (!StringUtils.isBlank(type)) {
// Use rules for deciding when to trust browser supplied type
if (useRecognizedType(finalType, type)) {
datafile.setContentType(type);
}
logger.fine("Supplied type: " + suppliedContentType + ", finalType: " + finalType);
}
// Avoid changing
if (!ownerSet) {
datafile.setOwner(null);
}
}

if (warningMessage != null) {
createIngestFailureReport(datafile, warningMessage);
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -143,13 +143,29 @@ public String[] getTestFormatSet() {
return this.testFormatSet;
}

/*ToDo
* Rather than making these tests just methods, perhaps they could be implemented as
* classes inheriting a common interface. In addition to the existing ~test*format methods,
* the interface could include a method indicating whether the test requires
* the whole file or, if not, how many bytes are needed. That would make it easier to
* decide whether to use the test on direct/remote uploads (where retrieving a big file may not be worth it,
* but retrieving the 42 bytes needed for a stata check or the ~491 bytes needed for a por check) could be.
*
* Could also add a method to indicate which mimetypes the test can identify/refine which
* might make it possible to replace FileUtil.useRecognizedType(String, String) at some point.
*
* It might also make sense to make this interface broader than just the current ingestable types,
* e.g. to support the NetCDF, graphML and other checks in the same framework. (Some of these might only
* support using a file rather than a bytebuffer though.)
*/

// test methods start here ------------------------------------------------
/**
* test this byte buffer against SPSS-SAV spec
*
*
*/
public String testSAVformat(MappedByteBuffer buff) {
public String testSAVformat(ByteBuffer buff) {
String result = null;
buff.rewind();
boolean DEBUG = false;
Expand Down Expand Up @@ -192,7 +208,7 @@ public String testSAVformat(MappedByteBuffer buff) {
* test this byte buffer against STATA DTA spec
*
*/
public String testDTAformat(MappedByteBuffer buff) {
public String testDTAformat(ByteBuffer buff) {
String result = null;
buff.rewind();
boolean DEBUG = false;
Expand Down Expand Up @@ -311,7 +327,7 @@ public String testDTAformat(MappedByteBuffer buff) {
* test this byte buffer against SAS Transport(XPT) spec
*
*/
public String testXPTformat(MappedByteBuffer buff) {
public String testXPTformat(ByteBuffer buff) {
String result = null;
buff.rewind();
boolean DEBUG = false;
Expand Down Expand Up @@ -359,7 +375,7 @@ public String testXPTformat(MappedByteBuffer buff) {
* test this byte buffer against SPSS Portable (POR) spec
*
*/
public String testPORformat(MappedByteBuffer buff) {
public String testPORformat(ByteBuffer buff) {
String result = null;
buff.rewind();
boolean DEBUG = false;
Expand Down Expand Up @@ -525,7 +541,7 @@ public String testPORformat(MappedByteBuffer buff) {
* test this byte buffer against R data file
*
*/
public String testRDAformat(MappedByteBuffer buff) {
public String testRDAformat(ByteBuffer buff) {
String result = null;
buff.rewind();

Expand Down Expand Up @@ -607,11 +623,10 @@ public String testRDAformat(MappedByteBuffer buff) {

// public instance methods ------------------------------------------------
public String detectTabularDataFormat(File fh) {
boolean DEBUG = false;
String readableFormatType = null;

FileChannel srcChannel = null;
FileInputStream inp = null;

try {
// set-up a FileChannel instance for a given file object
inp = new FileInputStream(fh);
Expand All @@ -621,63 +636,7 @@ public String detectTabularDataFormat(File fh) {

// create a read-only MappedByteBuffer
MappedByteBuffer buff = srcChannel.map(FileChannel.MapMode.READ_ONLY, 0, buffer_size);

//this.printHexDump(buff, "hex dump of the byte-buffer");

buff.rewind();
dbgLog.fine("before the for loop");
for (String fmt : this.getTestFormatSet()) {

// get a test method
Method mthd = testMethods.get(fmt);
//dbgLog.info("mthd: " + mthd.getName());

try {
// invoke this method
Object retobj = mthd.invoke(this, buff);
String result = (String) retobj;

if (result != null) {
dbgLog.fine("result for (" + fmt + ")=" + result);
if (DEBUG) {
out.println("result for (" + fmt + ")=" + result);
}
if (readableFileTypes.contains(result)) {
readableFormatType = result;
}
dbgLog.fine("readableFormatType=" + readableFormatType);
} else {
dbgLog.fine("null was returned for " + fmt + " test");
if (DEBUG) {
out.println("null was returned for " + fmt + " test");
}
}
} catch (InvocationTargetException e) {
Throwable cause = e.getCause();
// added null check because of "homemade.zip" from https://redmine.hmdc.harvard.edu/issues/3273
if (cause.getMessage() != null) {
err.format(cause.getMessage());
e.printStackTrace();
} else {
dbgLog.info("cause.getMessage() was null for " + e);
e.printStackTrace();
}
} catch (IllegalAccessException e) {
e.printStackTrace();
} catch (BufferUnderflowException e){
dbgLog.info("BufferUnderflowException " + e);
e.printStackTrace();
}

if (readableFormatType != null) {
break;
}
}

// help garbage-collect the mapped buffer sooner, to avoid the jvm
// holding onto the underlying file unnecessarily:
buff = null;

return detectTabularDataFormat(buff);
} catch (FileNotFoundException fe) {
dbgLog.fine("exception detected: file was not foud");
fe.printStackTrace();
Expand All @@ -688,8 +647,73 @@ public String detectTabularDataFormat(File fh) {
IOUtils.closeQuietly(srcChannel);
IOUtils.closeQuietly(inp);
}
return null;
}

public String detectTabularDataFormat(ByteBuffer buff) {
boolean DEBUG = false;
String readableFormatType = null;

// this.printHexDump(buff, "hex dump of the byte-buffer");

buff.rewind();
dbgLog.fine("before the for loop");
for (String fmt : this.getTestFormatSet()) {

// get a test method
Method mthd = testMethods.get(fmt);
// dbgLog.info("mthd: " + mthd.getName());

try {
// invoke this method
Object retobj = mthd.invoke(this, buff);
String result = (String) retobj;

if (result != null) {
dbgLog.fine("result for (" + fmt + ")=" + result);
if (DEBUG) {
out.println("result for (" + fmt + ")=" + result);
}
if (readableFileTypes.contains(result)) {
readableFormatType = result;
}
dbgLog.fine("readableFormatType=" + readableFormatType);
} else {
dbgLog.fine("null was returned for " + fmt + " test");
if (DEBUG) {
out.println("null was returned for " + fmt + " test");
}
}
} catch (InvocationTargetException e) {
Throwable cause = e.getCause();
// added null check because of "homemade.zip" from
// https://redmine.hmdc.harvard.edu/issues/3273
if (cause.getMessage() != null) {
err.format(cause.getMessage());
e.printStackTrace();
} else {
dbgLog.info("cause.getMessage() was null for " + e);
e.printStackTrace();
}
} catch (IllegalAccessException e) {
e.printStackTrace();
} catch (BufferUnderflowException e) {
dbgLog.info("BufferUnderflowException " + e);
e.printStackTrace();
}

if (readableFormatType != null) {
break;
}
}

// help garbage-collect the mapped buffer sooner, to avoid the jvm
// holding onto the underlying file unnecessarily:
buff = null;

return readableFormatType;
}


/**
* identify the first 5 bytes
Expand Down Expand Up @@ -737,7 +761,7 @@ private long getBufferSize(FileChannel fileChannel) {
return BUFFER_SIZE;
}

private int getGzipBufferSize(MappedByteBuffer buff) {
private int getGzipBufferSize(ByteBuffer buff) {
int GZIP_BUFFER_SIZE = 120;
/*
note:
Expand Down
Loading

0 comments on commit 6be4f20

Please sign in to comment.