Add yellow-fever to manifest #1083

genehack · 2024-12-05T19:25:32Z

Also bumps resource index revision number in testing and production configs.

Note: needs to merge semi-concurrently with the work for nextstrain/yellow-fever#25.

joverlee521

The changes look good to me, but I was extremely confused why yellow-fever was not showing up at https://nextstrain-s-add-yellow-zarguy.herokuapp.com/pathogens

Turns out the current yellow-fever dataset on S3 does not have a version id

$ aws s3api list-object-versions --bucket nextstrain-data --prefix yellow-fever_meta.json
{
    "Versions": [
        {
            "ETag": "\"1390b7046efbd1f0150bbf2656206c0c\"",
            "Size": 758,
            "StorageClass": "STANDARD",
            "Key": "yellow-fever_meta.json",
            "VersionId": "null",
            "IsLatest": true,
            "LastModified": "2017-05-03T21:30:48.000Z"
        }
    ],
    "RequestCharged": null
}

So it gets filtered out during the resource index generation. Is there a way to add a version id to an S3 object or do we have to download and re-upload it?

jameshadfield · 2024-12-06T01:57:21Z

Turns out the current yellow-fever dataset on S3 does not have a version id

(From memory) that's because it predates us using versioning in the bucket. I'd suggest uploading a current YF build (that one's from 2017) to the bucket.

joverlee521 · 2024-12-06T17:56:51Z

Turns out the current yellow-fever dataset on S3 does not have a version id

(From memory) that's because it predates us using versioning in the bucket. I'd suggest uploading a current YF build (that one's from 2017) to the bucket.

Ah that makes sense! If we want to retain the version history on nextstrain.org/pathogens that links back to the 2017 build, it would still need the version id though.

data/manifest_core.json

genehack · 2024-12-06T19:06:44Z

Turns out the current yellow-fever dataset on S3 does not have a version id

(From memory) that's because it predates us using versioning in the bucket. I'd suggest uploading a current YF build (that one's from 2017) to the bucket.

Ah that makes sense! If we want to retain the version history on nextstrain.org/pathogens that links back to the 2017 build, it would still need the version id though.

As this Stack Overflow post points out, the object does have a version id: it's the literal string "null" (and a pox upon the AWS engineer who decided to do this) — so, depending on how our code expects that version id to work, things might work out okay?

Also bumps resource index revision number in testing and production configs, and updates an old comment that was not accurate now that yellow-fever is in the manifest.

jameshadfield · 2024-12-06T20:09:02Z

Ah that makes sense! If we want to retain the version history on nextstrain.org/pathogens that links back to the 2017 build, it would still need the version id though

Only Amazon S3 generates version IDs, and they cannot be edited, and similarly for last modification date. So we can re-upload, and it'll get a (new) version ID, but then it'll look like a 2024 dataset to the resource indexer. If you really want to show this as from 2017 you could extend the indexer to allow a "special cases" mapping of versionId to date.

joverlee521 · 2024-12-06T21:13:41Z

As this Stack Overflow post points out, the object does have a version id: it's the literal string "null" (and a pox upon the AWS engineer who decided to do this) — so, depending on how our code expects that version id to work, things might work out okay?

Ah yes, the distinction between null and "null" 🤦‍♀️

I forgot the resource indexer doesn't check the actual S3 objects. It pulls metadata from the S3 inventory, which reports the "null" version id as an empty string "" in the third column of the CSV:

"nextstrain-data","yellow-fever_meta.json","","true","false","2017-05-03T21:30:48.000Z","1390b7046efbd1f0150bbf2656206c0c"

The empty string matches the falsy check and gets filtered out of the resource index.

So we can re-upload, and it'll get a (new) version ID, but then it'll look like a 2024 dataset to the resource indexer. If you really want to show this as from 2017 you could extend the indexer to allow a "special cases" mapping of versionId to date.

@genehack how important is it to show the dataset is from 2017 in the snapshots chart?

genehack · 2024-12-06T22:53:34Z

Ah yes, the distinction between null and "null" 🤦‍♀️

You ~~love~~ hate to see it, right?

So we can re-upload, and it'll get a (new) version ID, but then it'll look like a 2024 dataset to the resource indexer. If you really want to show this as from 2017 you could extend the indexer to allow a "special cases" mapping of versionId to date.

@genehack how important is it to show the dataset is from 2017 in the snapshots chart?

I mean, it would certainly be nice to do so — I wonder if the user metadata thing mentioned in that SO post would be a way to get a version ID without changing the last modified date…

I may experiment with that over the weekend…

genehack · 2024-12-09T23:20:34Z

@genehack how important is it to show the dataset is from 2017 in the snapshots chart?

I mean, it would certainly be nice to do so — I wonder if the user metadata thing mentioned in that SO post would be a way to get a version ID without changing the last modified date…

I may experiment with that over the weekend…

So, I have mucked around with this in a test bucket, and done some more web searching, and TL;DR I don't see any way to get a version ID on that that S3 object that doesn't also change the last-modified date. I will chat with @trvrb to confirm this, but I suspect if the two options are "display the old dataset but with a modern date" and "don't display the old dataset at all", the latter is better.

I wonder if the resource indexer could be updated so that it didn't turn the "null" string into the empty string, and instead passed on it on through into the CSV? @joverlee521, thoughts?

joverlee521 · 2024-12-09T23:39:40Z

I wonder if the resource indexer could be updated so that it didn't turn the "null" string into the empty string, and instead passed on it on through into the CSV?

The CSV is generated by AWS S3 inventory so we don't have control over the empty string version id. I found some docs on converting the empty strings to null strings, but it seems like a manual operation not a config setting.

genehack · 2024-12-10T00:05:50Z

The CSV is generated by AWS S3 inventory so we don't have control over the empty string version id. I found some docs on converting the empty strings to null strings, but it seems like a manual operation not a config setting.

womp womp

tsibley · 2024-12-11T16:54:59Z

If we want to preserve that original yellow-fever version, we can make our indexer not filter out objects with VersionId == "" in the S3 inventory.

genehack · 2024-12-11T23:50:26Z

If we want to preserve that original yellow-fever version, we can make our indexer not filter out objects with VersionId == "" in the S3 inventory.

…which will also surface all the other things that are currently dropped due to being old/unversioned. That feels like the pull tab on a can of worms?

genehack · 2024-12-16T23:14:52Z

After reflection and much discussion, the consensus (which I agree with) was that surfacing that old data was not going to be worth the trouble. Going ahead and merging this; will enable yellow-fever automation and fire off a build next.

genehack requested a review from joverlee521 December 5, 2024 19:25

nextstrain-bot temporarily deployed to nextstrain-s-add-yellow-zarguy December 5, 2024 19:25 Inactive

genehack mentioned this pull request Dec 5, 2024

Final finalization nextstrain/yellow-fever#25

Closed

6 tasks

joverlee521 reviewed Dec 6, 2024

View reviewed changes

victorlin reviewed Dec 6, 2024

View reviewed changes

data/manifest_core.json Show resolved Hide resolved

data/manifest_core.json Show resolved Hide resolved

Add yellow-fever to manifest

86a6850

Also bumps resource index revision number in testing and production configs, and updates an old comment that was not accurate now that yellow-fever is in the manifest.

genehack force-pushed the add-yellow-fever-to-manifest branch from 26b9565 to 86a6850 Compare December 6, 2024 19:13

genehack temporarily deployed to nextstrain-s-add-yellow-zarguy December 6, 2024 19:13 Inactive

victorlin approved these changes Dec 6, 2024

View reviewed changes

genehack merged commit a0cde63 into master Dec 16, 2024
8 checks passed

genehack deleted the add-yellow-fever-to-manifest branch December 16, 2024 23:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add yellow-fever to manifest #1083

Add yellow-fever to manifest #1083

genehack commented Dec 5, 2024

joverlee521 left a comment

jameshadfield commented Dec 6, 2024

joverlee521 commented Dec 6, 2024

genehack commented Dec 6, 2024

jameshadfield commented Dec 6, 2024

joverlee521 commented Dec 6, 2024

genehack commented Dec 6, 2024

genehack commented Dec 9, 2024

joverlee521 commented Dec 9, 2024

genehack commented Dec 10, 2024

tsibley commented Dec 11, 2024

genehack commented Dec 11, 2024

genehack commented Dec 16, 2024

Add yellow-fever to manifest #1083

Add yellow-fever to manifest #1083

Conversation

genehack commented Dec 5, 2024

joverlee521 left a comment

Choose a reason for hiding this comment

jameshadfield commented Dec 6, 2024

joverlee521 commented Dec 6, 2024

genehack commented Dec 6, 2024

jameshadfield commented Dec 6, 2024

joverlee521 commented Dec 6, 2024

genehack commented Dec 6, 2024

genehack commented Dec 9, 2024

joverlee521 commented Dec 9, 2024

genehack commented Dec 10, 2024

tsibley commented Dec 11, 2024

genehack commented Dec 11, 2024

genehack commented Dec 16, 2024