Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

track images referenced inside of (iDigBio) Darwin Core archives #71

Closed
jhpoelen opened this issue Sep 2, 2020 · 8 comments
Closed
Assignees
Labels
enhancement New feature or request

Comments

@jhpoelen
Copy link
Member

jhpoelen commented Sep 2, 2020

Darwin Core archives may contain urls to images. These urls reference images that are typically stored outside of the darwin core archive.

Currently, Preston does not track images referenced inside darwin core archives.

Suggest to extend Preston to include image tracking.

Additionally, extend support to include tracking of iDigBio thumbnail or web optimized images in addition the raw image referenced in the indexed dataset. (separate into different issue if needed)

@jhpoelen jhpoelen changed the title track images referenced inside of (idigbio) Darwin Core archives track images referenced inside of (iDigBio) Darwin Core archives Sep 2, 2020
@jhpoelen
Copy link
Member Author

jhpoelen commented Sep 2, 2020

note that -
https://api.idigbio.org/v2/media/ab0e3f3d-b758-418b-8cf1-ed85d893fd65?size=webview

redirects to:

https://s.idigbio.org/idigbio-images-prod-webview/400c92a515dbeefb3eef8526dbcbb5e2.jpg

resolve to same content with hash://sha256/5b960b8282c39f835d65b8574ef1c02a0ed387802f57077250ce926678f11f3b .

Similarly,
https://api.idigbio.org/v2/media/ab0e3f3d-b758-418b-8cf1-ed85d893fd65?size=thumbnail

resolves to
https://s.idigbio.org/idigbio-images-prod-thumbnail/400c92a515dbeefb3eef8526dbcbb5e2.jpg

with hash://sha256/37b4504da1b471f35acd0b39a3e7bb5a3f711d60d2eec22220fffc79e0d69b15

@jhpoelen
Copy link
Member Author

jhpoelen commented Sep 2, 2020

Also,

https://api.idigbio.org/v2/media/ab0e3f3d-b758-418b-8cf1-ed85d893fd65?size=fullsize

resolves to

https://s.idigbio.org/idigbio-images-prod-fullsize/400c92a515dbeefb3eef8526dbcbb5e2.jpg

with hash://sha256/e35df5a12bc6b4f977b815c9ec5d35dedbab1ba01b6aaef23db10ecc4ca4d7c8

And, the raw original image at

http://data.huh.harvard.edu/23097057-6561-4dbe-81ec-1be2133f1b7d/image

has same hash as the fullsize image served by iDigBio:

hash;//sha256/e35df5a12bc6b4f977b815c9ec5d35dedbab1ba01b6aaef23db10ecc4ca4d7c8

With byte count:

37802973

@jhpoelen
Copy link
Member Author

jhpoelen commented Sep 2, 2020

related api call

https://api.idigbio.org/v2/view/records/c98df68c-a32b-4c8e-b9e8-e57d20e67dea

{
  "data": {
    "dcterms:accessRights": "https://huh.harvard.edu/access-digital-reproductions-works-public-domain",
    "dcterms:language": "en",
    "dcterms:license": "https://huh.harvard.edu/pages/use",
    "dcterms:modified": "2020-02-11 16:50:48.0",
    "dcterms:references": "http://data.huh.harvard.edu/23097057-6561-4dbe-81ec-1be2133f1b7d/image",
    "dcterms:rightsHolder": "President and Fellows of Harvard College",
    "dcterms:type": "http://purl.org/dc/dcmitype/PhysicalObject",
    "dwc:Identification": [
      {
        "coreid": "23097057-6561-4dbe-81ec-1be2133f1b7d",
        "dwc:family": "Orchidaceae",
        "dwc:genus": "Aa",
        "dwc:scientificName": "Aa fiebrigii (Schlechter) Schlechter",
        "dwc:scientificNameAuthorship": "(Schlechter) Schlechter",
        "dwc:specificEpithet": "fiebrigii",
        "dwc:taxonRank": "Species"
      }
    ],
    "dwc:ResourceRelationship": [
      {
        "coreid": "23097057-6561-4dbe-81ec-1be2133f1b7d",
        "dwc:relatedResourceID": "http://purl.oclc.org/net/edu.harvard.huh/guid/uuid/23097057-6561-4dbe-81ec-1be2133f1b7d",
        "dwc:relationshipOfResource": "sameAs"
      }
    ],
    "dwc:associatedMedia": "[see Simple Media extension]",
    "dwc:basisOfRecord": "PreservedSpecimen",
    "dwc:catalogNumber": "barcode-02162424",
    "dwc:collectionCode": "AMES",
    "dwc:collectionID": "urn:lsid:biocol.org:col:15408",
    "dwc:continent": "South America",
    "dwc:country": "Argentina",
    "dwc:countryCode": "AR",
    "dwc:datasetName": "Harvard University Herbaria: All Records",
    "dwc:disposition": "in collection",
    "dwc:dynamicProperties": "{\"huh_taxonomic_group\": \"Vascular\", \"huh_project_id\": 9, \"huh_project_name\": \"Plants on Edge/Endless Forms\"}",
    "dwc:eventDate": "1927-01",
    "dwc:family": "Orchidaceae",
    "dwc:fieldNumber": "4782",
    "dwc:genus": "Aa",
    "dwc:higherGeography": "South America;Argentina;Tucumán;",
    "dwc:institutionCode": "AMES",
    "dwc:institutionID": "urn:lsid:biocol.org:col:15408",
    "dwc:locality": "Dept. Chichigasti, Santa Rosa.",
    "dwc:month": "1",
    "dwc:occurrenceID": "23097057-6561-4dbe-81ec-1be2133f1b7d",
    "dwc:occurrenceStatus": "present",
    "dwc:otherCatalogNumbers": "AMES-accession-39219",
    "dwc:preparations": "Sheet",
    "dwc:recordNumber": "4782",
    "dwc:recordedBy": "Sant. Venturi",
    "dwc:reproductiveCondition": "NotDetermined",
    "dwc:scientificName": "Aa fiebrigii (Schlechter) Schlechter",
    "dwc:scientificNameAuthorship": "(Schlechter) Schlechter",
    "dwc:sex": "undetermined",
    "dwc:specificEpithet": "fiebrigii",
    "dwc:stateProvince": "Tucumán",
    "dwc:verbatimElevation": "3600 m.",
    "dwc:verbatimLocality": "Dept. Chichigasti, Santa Rosa.",
    "dwc:year": "1927",
    "id": "23097057-6561-4dbe-81ec-1be2133f1b7d"
  },
  "etag": "e5133811b819f5c82fd3c66c027a97dfc00de4dc",
  "links": {
    "mediarecords": [
      "https://api.idigbio.org/v2/view/mediarecord/ab0e3f3d-b758-418b-8cf1-ed85d893fd65"
    ],
    "recordsets": [
      "https://api.idigbio.org/v2/view/recordsets/7450a9e3-ef95-4f9e-8260-09b498d2c5e6"
    ]
  },
  "modified": "2020-08-23T07:11:28.837888",
  "recordIds": [
    "7450a9e3-ef95-4f9e-8260-09b498d2c5e6\\23097057-6561-4dbe-81ec-1be2133f1b7d",
    "7450a9e3-ef95-4f9e-8260-09b498d2c5e6\\http://purl.oclc.org/net/edu.harvard.huh/guid/uuid/23097057-6561-4dbe-81ec-1be2133f1b7d"
  ],
  "type": "records",
  "uuid": "c98df68c-a32b-4c8e-b9e8-e57d20e67dea",
  "version": 0
}

and https://api.idigbio.org/v2/view/mediarecords/ab0e3f3d-b758-418b-8cf1-ed85d893fd65

{
  "data": {
    "coreid": "23097057-6561-4dbe-81ec-1be2133f1b7d",
    "dcterms:created": "2020-01-14 00:00:00.0",
    "dcterms:format": "image/jpeg",
    "dcterms:identifier": "http://data.huh.harvard.edu/23097057-6561-4dbe-81ec-1be2133f1b7d/image",
    "dcterms:license": "https://huh.harvard.edu/access-digital-reproductions-works-public-domain",
    "dcterms:references": "http://data.huh.harvard.edu/23097057-6561-4dbe-81ec-1be2133f1b7d",
    "dcterms:rightsHolder": "President and Fellows of Harvard College",
    "dcterms:type": "StillImage"
  },
  "etag": "f381092ccb300cf531878cb39ed3be62f19de77f",
  "links": {
    "records": [
      "https://api.idigbio.org/v2/view/record/c98df68c-a32b-4c8e-b9e8-e57d20e67dea"
    ],
    "recordsets": [
      "https://api.idigbio.org/v2/view/recordsets/7450a9e3-ef95-4f9e-8260-09b498d2c5e6"
    ]
  },
  "modified": "2020-08-23T07:11:28.837888",
  "recordIds": [
    "7450a9e3-ef95-4f9e-8260-09b498d2c5e6\\media\\http://data.huh.harvard.edu/23097057-6561-4dbe-81ec-1be2133f1b7d/image"
  ],
  "type": "mediarecords",
  "uuid": "ab0e3f3d-b758-418b-8cf1-ed85d893fd65",
  "version": 0
}

@jhpoelen
Copy link
Member Author

jhpoelen commented Sep 2, 2020

Note that image retrieval is by md5 of the image:

$ curl -L "https://s.idigbio.org/idigbio-images-prod-fullsize/400c92a515dbeefb3eef8526dbcbb5e2.jpg" | md5sum
400c92a515dbeefb3eef8526dbcbb5e2

with url pattern:

https://s.idigbio.org/idigbio-images-prod-fullsize/[md5 of image].jpg

@jhpoelen
Copy link
Member Author

That that, for some reason, the iDigBio API says 502 internal server error, on serving beyond 100k items. See known issue iDigBio/idigbio-search-api#32 .

@jhpoelen
Copy link
Member Author

Using newly introduced preston dwc-stream (#148)

you can now track images inside darwin core archives as demonstrated in the UCSB-ICZ example at #148 (comment) -

example for extracting image urls for UC Santa Barbara's @seltmann invertebrate zoology collection -

preston track "https://serv.biokic.asu.edu/ecdysis/content/dwca/UCSB-IZC_DwC-A.zip"\
 | preston dwc-stream\
 | grep "http://rs.tdwg.org/ac/terms/Multimedia"\
 | jq --raw-output '.["http://rs.tdwg.org/ac/terms/goodQualityAccessURI"], .["http://rs.tdwg.org/ac/terms/accessURI"]'\
 | sort\
 | uniq
> ucsb-izc-image-urls.txt

with

$ cat ucsb-izc-image-urls.txt
45530
$ head  ucsb-izc-image-urls.txt
https://serv.biokic.asu.edu/imglib/ecdysis/UCSB_IZC/UCSB-IZC00000/UCSB-IZC_00000001.jpg
https://serv.biokic.asu.edu/imglib/ecdysis/UCSB_IZC/UCSB-IZC00000/UCSB-IZC00000001.jpg
https://serv.biokic.asu.edu/imglib/ecdysis/UCSB_IZC/UCSB-IZC00000/UCSB-IZC_00000002.jpg
https://serv.biokic.asu.edu/imglib/ecdysis/UCSB_IZC/UCSB-IZC00000/UCSB-IZC00000002.jpg
https://serv.biokic.asu.edu/imglib/ecdysis/UCSB_IZC/UCSB-IZC00000/UCSB-IZC_00000003.jpg
https://serv.biokic.asu.edu/imglib/ecdysis/UCSB_IZC/UCSB-IZC00000/UCSB-IZC00000003.jpg
https://serv.biokic.asu.edu/imglib/ecdysis/UCSB_IZC/UCSB-IZC00000/UCSB-IZC_00000003_lg.jpg
https://serv.biokic.asu.edu/imglib/ecdysis/UCSB_IZC/UCSB-IZC00000/UCSB-IZC_00000004.jpg
https://serv.biokic.asu.edu/imglib/ecdysis/UCSB_IZC/UCSB-IZC00000/UCSB-IZC00000004.jpg
https://serv.biokic.asu.edu/imglib/ecdysis/UCSB_IZC/UCSB-IZC00000/UCSB-IZC_00000004_lg.jpg

ucsb-izc-image-urls.txt

Now, tracking all image urls . . . would be:

cat ucsb-izc-image-urls.txt\
| xargs -L100 preston track 

So, putting it together, you'd be able to track the UCSB-IZC and its images using:

preston track "https://serv.biokic.asu.edu/ecdysis/content/dwca/UCSB-IZC_DwC-A.zip"\
>  | preston dwc-stream\
>  | grep "http://rs.tdwg.org/ac/terms/Multimedia"\
>  | jq --raw-output '.["http://rs.tdwg.org/ac/terms/goodQualityAccessURI"], .["http://rs.tdwg.org/ac/terms/accessURI"]'\
>  | sort\
>  | uniq\
>  | xargs -L100 preston track

Originally posted by @jhpoelen in #148 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant