Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8524 adding mechanism for storing tab. files with variable headers #10282

Merged
merged 17 commits into from
Feb 7, 2024

Conversation

landreev
Copy link
Contributor

@landreev landreev commented Jan 30, 2024

What this PR does / why we need it:

See the opening comment in the issue for the back story and the rationale for implementing this feature.
Note that the original title of the issue was "Stop storing ingested tabular data files WITHOUT the variable name header". I changed it to the current "Add mechanism for storing ..." etc. This is not something we can arbitrarily stop doing, the way we rearrange database tables with a deployment of a new version of the application. For example, the IQSS production instance has more than a TB of ingested tabular files. Adding these headers to all of these files in S3 storage would be a time-consuming effort, putting it mildly. Even for a site with 1/10 of that amount and the files stored on the filesystem, this would be too serious a task to be made a required upgrade step.

So what we want is an option to have tab. files stored with that header, alongside with any legacy files stored without it.

This PR adds the following:

  • A new configuration setting that makes ingest store produced tab-delimited files with headers.
  • Changes all the relevant code in the subsystems that access saved tab-delimited data as to produce the identical end results in terms of the content, transparent to the end user. I.e., regardless of how the file is stored, the "old" or the "new" way, the user downloading a tab-delimited file should be seeing the same exact thing as before (with the header), the subsetting code that extracts the individual variable columns for calculating summary stats should produce identical vectors, etc.
  • The Access API can now take advantage of Direct Download for S3-stored tab-delimited files saved with this extra line up top - since it no longer has to generate it and attach it on the fly. (This is the major/most important perk of this feature!).

What's NOT part of this PR:

  • There was not enough time in this sprint to provide an API for converting existing files. This will need to be addressed separately (opening a new issue). I'm hoping it can still be added as part of 6.2. An existing file could also be converted outside of the application; but that would be a bit of an adventure, and should not be recommended to other instances. (But note that this is still a useful feature, even when it applies to new files only. For example, it may come handy for the MOC-POC effort, if we need to read any tabular files from storage directly).

It's important to note that for an instance deploying this new code, but choosing not to enable the new setting, nothing changes in the way the ingest and everything else functions on their system.

Which issue(s) this PR closes:

Closes #8524

Special notes for your reviewer:

You will notice that my implementation makes changes in all the individual ingest plugins for all the supported file formats. Intuitively, it would be less code to change/fewer places for things to go wrong, to add this header in one place, somewhere in IngestServiceBean, once the tab-delimited file and the list of variables have been produced by a specific plugin. It just felt wrong to implement it like that, because there is really no efficient way to "insert" an extra line at the beginning of a file. In practice, the only way to do this is to re-write the entire file on disk: first write the extra header line into a new empty file, then add the rest of the content from the source file. Which amounts to temporarily doubling the size of this temp. data. So, I chose to instead modify the plugins to add the headers right away, as the tab-delimited files are first written on disk. This does result in more places to potentially introduce an error. But, as you will see, the actual changes are really trivial in each plugin.

Suggestions on how to test this:

Ingest a few files, with and without the new setting enabled. Compare both the end result for the user; and the physical files as stored on the filesystem.
The integration test added in the PR is an example of a simple test scenario.

Does this PR introduce a user interface change? If mockups are available, please link/include them here:

Is there a release notes update needed for this change?:

Additional documentation:

@coveralls
Copy link

coveralls commented Jan 30, 2024

Coverage Status

coverage: 20.268% (+0.1%) from 20.143%
when pulling e0dc198 on 8524-store-tabular-files-with-varheaders
into a3edca4 on develop.

This comment has been minimized.

This comment has been minimized.

@landreev landreev changed the title 8524 adding mechanism for store tab. files with variable headers 8524 adding mechanism for storing tab. files with variable headers Jan 31, 2024

This comment has been minimized.

@landreev landreev marked this pull request as ready for review January 31, 2024 16:00
@landreev landreev added the Size: 30 A percentage of a sprint. 21 hours. (formerly size:33) label Jan 31, 2024

This comment has been minimized.

This comment has been minimized.

…o the list of settings documented in the config guide while I was at it. #8524
… etc. git history can be consulted if anyone is curious about what we used to do here. #8524

This comment has been minimized.

1 similar comment

This comment has been minimized.

This comment has been minimized.

Copy link

github-actions bot commented Feb 1, 2024

📦 Pushed preview images as

ghcr.io/gdcc/dataverse:8524-store-tabular-files-with-varheaders
ghcr.io/gdcc/configbaker:8524-store-tabular-files-with-varheaders

🚢 See on GHCR. Use by referencing with full name as printed above, mind the registry name.

@sekmiller sekmiller assigned sekmiller and unassigned sekmiller Feb 1, 2024
@sekmiller sekmiller removed their assignment Feb 6, 2024
@stevenwinship stevenwinship self-assigned this Feb 7, 2024
@stevenwinship stevenwinship merged commit bec3945 into develop Feb 7, 2024
20 checks passed
@stevenwinship stevenwinship deleted the 8524-store-tabular-files-with-varheaders branch February 7, 2024 16:51
@pdurbin pdurbin added this to the 6.2 milestone Feb 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Size: 30 A percentage of a sprint. 21 hours. (formerly size:33)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add mechanism for storing ingested tabular data files WITHOUT the variable name header
5 participants