-
Notifications
You must be signed in to change notification settings - Fork 492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Develop S3 storage driver #3921
Comments
AWS import now points to a version of the sdk that uses all its own unique named dependencies as to not cause conflicts. As we go forward we should look into updating glassfish and using the normal SDK.
Allows you to upload a file and have it show up in an aws s3 bucket. Code is far far from complete
Here are some notes on the state of the S3 code as I leave for vacation: The code is definitely far far from complete, but I was able to port over some sample code from the S3 example and get it working in S3AccessIO.java. You should be able to upload a data file and have it show up in the S3 bucket (naming is not correct tho). Right now the code deletes the bucket before every upload, which obviously needs to change. Note in the pom that the import is for aws-java-sdk-bom . This version takes all the aws sdk dependencies and has renamed them so there are no package conflicts. We were having issues with Glassfish's version of jackson conflicting with minimum aws requirements. There may be a version of this dependency that only pulls in the S3 code, I didn't get to check. I'm not certain what approach we should take in regards to folder structure. AWS has a limit of 100 buckets per account so we probably only want one bucket for the full dataverse application (you can apply for an increase tho). There are no true folders in AWS, but if you name things with folder structure they show up as folders. |
@ferrys and I just discussed the code as of 39375c8. We found @ferrys is pretty sure that for Swift we should continue to store files in a single container rather than using Swift's folder structure features, which are complicated. I'm just concerned about having someday having duplicate file names that are in different directories. We don't know a lot about how S3 works but she said Swift also will have files show up in folders if you name them with slashes in them. |
I looked into both Swift and S3 in terms of folder structure and it seems like you can store multiple of the same file in the same bucket/container IF they are in different folders. So, we shouldn't have a problem there. @pdurbin I think we could definitely research altering the implementation of Swift as well, but I don't know much about how the Swift API deals with folders within containers, so I would say it is also out of scope. |
In order to configure the credentials for AWS, you need access to your AWS Access Key ID and your AWS Secret Access Key. Once you have them both, you should run |
-Dataset directory is still created locally when s3 configured though it is empty. Probably a holdover from putting export files in both places. otherwise ready to go. |
Issues: |
…ter. When AWS S3 storage support was introduced back in 2017 in IQSS#3921, the team experienced problems with the bundled Jackson library of Glassfish (2.3 instead of 2.6 minimum). By switching to the complete bundle, the bundled Jackson library was used and problems where avoided. This lead to a bigger WAR than necessary (~20 MB) and made a workaround necessary to remove some AWS specific `javamail.providers` to avoid email problems via WAR file manipulation. This commit: * removes the WAR file hacking * makes use of the S3 SDK part only, reducing the WAR size * enables proper <dependencyManagement> for the sake of avoiding dependency convergence problems. People unaware of direct and transitive dependencies and how to manage them are kindly requested to have a look at the Maven docs and tutorials: * https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html * https://www.davidjhay.com/maven-dependency-management * https://maven.apache.org/enforcer/maven-enforcer-plugin/index.html
…ter. When AWS S3 storage support was introduced back in 2017 in IQSS#3921, the team experienced problems with the bundled Jackson library of Glassfish (2.3 instead of 2.6 minimum). By switching to the complete bundle, the bundled Jackson library was used and problems where avoided. This lead to a bigger WAR than necessary (~20 MB) and made a workaround necessary to remove some AWS specific `javamail.providers` to avoid email problems via WAR file manipulation. This commit: * removes the WAR file hacking * makes use of the S3 SDK part only, reducing the WAR size * enables proper <dependencyManagement> for the sake of avoiding dependency convergence problems. People unaware of direct and transitive dependencies and how to manage them are kindly requested to have a look at the Maven docs and tutorials: * https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html * https://www.davidjhay.com/maven-dependency-management * https://maven.apache.org/enforcer/maven-enforcer-plugin/index.html
…ter. When AWS S3 storage support was introduced back in 2017 in IQSS#3921, the team experienced problems with the bundled Jackson library of Glassfish (2.3 instead of 2.6 minimum). By switching to the complete bundle, the bundled Jackson library was used and problems where avoided. This lead to a bigger WAR than necessary (~20 MB) and made a workaround necessary to remove some AWS specific `javamail.providers` to avoid email problems via WAR file manipulation. This commit: * removes the WAR file hacking * makes use of the S3 SDK part only, reducing the WAR size * enables proper <dependencyManagement> for the sake of avoiding dependency convergence problems. People unaware of direct and transitive dependencies and how to manage them are kindly requested to have a look at the Maven docs and tutorials: * https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html * https://www.davidjhay.com/maven-dependency-management * https://maven.apache.org/enforcer/maven-enforcer-plugin/index.html
…ter. When AWS S3 storage support was introduced back in 2017 in IQSS#3921, the team experienced problems with the bundled Jackson library of Glassfish (2.3 instead of 2.6 minimum). By switching to the complete bundle, the bundled Jackson library was used and problems where avoided. This lead to a bigger WAR than necessary (~20 MB) and made a workaround necessary to remove some AWS specific `javamail.providers` to avoid email problems via WAR file manipulation. This commit: * removes the WAR file hacking * makes use of the S3 SDK part only, reducing the WAR size * enables proper <dependencyManagement> for the sake of avoiding dependency convergence problems. People unaware of direct and transitive dependencies and how to manage them are kindly requested to have a look at the Maven docs and tutorials: * https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html * https://www.davidjhay.com/maven-dependency-management * https://maven.apache.org/enforcer/maven-enforcer-plugin/index.html
…ter. When AWS S3 storage support was introduced back in 2017 in IQSS#3921, the team experienced problems with the bundled Jackson library of Glassfish (2.3 instead of 2.6 minimum). By switching to the complete bundle, the bundled Jackson library was used and problems where avoided. This lead to a bigger WAR than necessary (~20 MB) and made a workaround necessary to remove some AWS specific `javamail.providers` to avoid email problems via WAR file manipulation. This commit: * removes the WAR file hacking * makes use of the S3 SDK part only, reducing the WAR size * enables proper <dependencyManagement> for the sake of avoiding dependency convergence problems. People unaware of direct and transitive dependencies and how to manage them are kindly requested to have a look at the Maven docs and tutorials: * https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html * https://www.davidjhay.com/maven-dependency-management * https://maven.apache.org/enforcer/maven-enforcer-plugin/index.html
…ter. When AWS S3 storage support was introduced back in 2017 in IQSS#3921, the team experienced problems with the bundled Jackson library of Glassfish (2.3 instead of 2.6 minimum). By switching to the complete bundle, the bundled Jackson library was used and problems where avoided. This lead to a bigger WAR than necessary (~20 MB) and made a workaround necessary to remove some AWS specific `javamail.providers` to avoid email problems via WAR file manipulation. This commit: * removes the WAR file hacking * makes use of the S3 SDK part only, reducing the WAR size * enables proper <dependencyManagement> for the sake of avoiding dependency convergence problems. People unaware of direct and transitive dependencies and how to manage them are kindly requested to have a look at the Maven docs and tutorials: * https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html * https://www.davidjhay.com/maven-dependency-management * https://maven.apache.org/enforcer/maven-enforcer-plugin/index.html
This came up during the meeting with the LTS, when the plans of moving the Dataverse production onto a cloud were discussed. S3 will have to be supported as the main production storage mechanism, since local disk space will no longer be available.
The existing swift driver can be used as a basic model.
#3919, just opened, is not a direct dependency, but, for the purposes of this production move both of the issues will need to be addressed at the same time. As we will not be able to continue storing the dataset-level files, like cached exports, etc., for the same reason as the above - as local file system space will no longer be there.
The text was updated successfully, but these errors were encountered: