[Proposal]: Dump ReferenceFileSystem spec for ZarrTiffStore that can be read natively as zarr #56

manzt · 2021-01-23T21:00:58Z

Thank you so much for your work on this project. I just came across experimental aszarr and ZarrTiffStore and am so excited! I'd written some one-off stores wrapping tifffile to read different pyramidal images as zarr (for napari), but having this in tifffile is incredible!

I'm curious if you've seen the proposed JSON specification for describing a ReferenceFileSystem? Asking naively, and a bit selfishly, would it be possible to detect whether a ZarrTiffStore can be natively read by zarr and "export" one of these references?

I work on web-based tools for visualizing OME-TIFF / Zarr data, and it would be really useful to quickly create these references.

Here is an example viewing a multiscale tiff on the web using zarr.js, and this is the python script I wrote with the newest version of tifffile to generate the reference. I wonder if there is some way to generalize this script, but don't have the familiarly with the underlying formats to know if this is a silly idea.

I notice that the TiffZarrStore handles all compression, so I know at least you need to detect whether the chunk compression is supported in Zarr.

The text was updated successfully, but these errors were encountered:

cgohlke · 2021-01-24T07:11:09Z

I am aware of ReferenceFileSystem. I was holding back on implementing it because of the experimental status and the many "features" found in TIFF that can't be mapped to zarr (AFAIK):

incomplete chunks
compressors, e.g. LZW, JPEG, JPEG2000
"filters", e.g. floating point predictor, bitorder reversal, packed integers
dtypes, e.g. float24
multi-file

manzt · 2021-01-24T18:33:47Z

Makes sense, thanks for the response. I think the multi-file issue could be accommodated by the spec, but agree the other "features" are incompatible. It's unrealistic to try to map all tiffs to zarr, but it would be useful to translate the reference store for the subset that can be mapped. e.g.

from tifffile import imread
from lib_i_wish_existed import TiffStore2Zarr

with imread("data.tiff", aszarr=True) as store:
  converter = TiffStore2Zarr(store, tiff_url)
  ref = converter.translate() # raises exception if tiff can't be mapped directly to zarr
  with open("data.tiff_offsets.json", "w") as f:
    json.dump(ref, f)

I'm just leaving the above for reference. Maybe there is a way that tifffile could consolidate certain "features" for the pages in a store to make this (in)compatibility with zarr easier to detect. Either way, the current zarr additions to tifffile have made it substantially easier explore this idea, so thanks a lot!

joshmoore · 2021-01-25T12:50:49Z

@cgohlke, if I may add: fsspec-reference-maker is definitely experimental but having your input at this stage would be invaluable. From zarr-developers/zarr-python#556 (comment), if there are any spec changes that would help to support viable TIFF-edge cases, it'd be good to capture them.

(And either way, I'm still excited by aszarr.)

martindurant · 2021-01-27T20:13:09Z

Also, adding additional numcodecs would be generally useful and, as far as I understand, not hard where existing python or C libs are available.
(EDIT: mean codecs and transforms in the numcodecs library, on which zarr depends, if that wasn't obvious)

jakirkham · 2021-01-27T20:54:23Z

Just to add to Martin's comment, Numcodecs ships both conda and wheel binary packages. So hopefully this makes it a bit easier to use downstream without needing to worry about compiling. We are also looking into making Numcodecs a pure Python package now that Blosc has wheels (in addition to conda) packages.

cgohlke · 2021-03-13T01:06:33Z

I have started working on this issue but have some trouble testing. The following code runs on my system without raising an exception but the data returned by zarr seems random and the web server logs do not show any access to the .tif file. The file is an zlib compressed pyramidal OME-TIFF. I verified that the offsets and byte counts in the JSON file are correct. A manual range request using the requests library works. Any idea? Is there a way to test the ReferenceFileSystem on a local file system?

import zarr  # 2.6.1
import fsspec  # 0.8.7
import tifffile  # 2021.3.dev

localpath = ''
filename = 'test.ome.tif'
url = 'https://www.lfd.uci.edu/'

# create the reference file
with tifffile.imread(localpath + filename, aszarr=True) as store:
    with open(localpath + filename + '.json', 'w') as fh:
        store.write_fsspec(fh, url)

# open the reference file from web server
mapper = fsspec.get_mapper(
    'reference://',
    references=url + filename + '.json',
    target_protocol='https',
)

zgrp = zarr.open(mapper, mode='r')
print(zgrp[9].info)  # print info of last level
im = zgrp[9][:]  # <- random data

Name               : /9
Type               : zarr.core.Array
Data type          : uint8
Shape              : (142, 136, 3)
Chunk shape        : (256, 256, 3)
Order              : C
Read-only          : True
Compressor         : Zlib(level=1)
Store type         : fsspec.mapping.FSMap
No. bytes          : 57936 (56.6K)
Chunks initialized : 1/1

martindurant · 2021-03-13T02:55:22Z

I have so far looked at one key:

"9/0.0": ["https://www.lfd.uci.edu/test.ome.tif", 6276016176, 33105]
(the one you used, the very last key!)

If I directly make a HTTPFileSystem and access this:

out = h.cat_file("https://www.lfd.uci.edu/test.ome.tif", 6276016176, 6276016176+33105)
bb = numcodecs.Zlib().decode(out)
assert len(bb) == 256*256*3
assert mapper["9/0.0"] == out

(if it were random data, no way zlib would happen to give that number of bytes out)

I notice from the zarr info (taken from mapper["9/.zarray"]) that the chunk is larger than the whole array, which is suspicious. This is the last of the datasets, and the number of bytes appears to match the chunksize, which would be more pixels than are needed. This whole chunk appears to end only 665bytes before the end of the whole file (I don't know if TIFF has some footer metadata).

Are other chunks coming through correctly? For the other blocks I am seeing a lot of zeros.

cgohlke · 2021-03-13T03:20:04Z

(if it were random data, no way zlib would happen to give that number of bytes out)

yes, by random I meant that I got different numbers every time.

the chunk is larger than the whole array, which is suspicious.

That should be OK according to the TIFF specification. The chunk data in the file should be complete.

I don't know if TIFF has some footer metadata.

The OME-XML is written at the end.

Are other chunks coming through correctly?

No, I tried two other levels.

As mentioned, the requests library works:

from matplotlib import pyplot
import requests
import zlib
headers = {'Range': 'bytes=6276016176-6276049280'}
r = requests.get('https://www.lfd.uci.edu/test.ome.tif', headers=headers, stream=True)
data = b''.join(chunk for chunk in r.iter_content(1024))
d = zlib.decompress(data)
im = numpy.frombuffer(d, dtype='uint8').reshape(256, 256, 3)
pyplot.imshow(im)
pyplot.show()

cgohlke · 2021-03-13T16:24:07Z

Apparently the key in the reference JSON file must be "9/0.0.0". Using "9/0.0", a KeyError is raised but ignored and an empty, uninitialized array returned.

cgohlke · 2021-03-13T16:37:34Z

It works now. I can visualize the multiscales zarr Group created from the fsspec ReferenceFileSystem using napari.

martindurant · 2021-03-13T17:55:03Z

Apparently the key in the reference JSON file must be "9/0.0.0". Using "9/0.0", a KeyError is raised but ignored and an empty, uninitialized array returned.

Is this correct, then? Indeed, there are thee dimensions, even if the last dimension only ever ha one chunk. zarr doesn't know about images having a colour dimension.

The KeyError would be the right thing to raise (the key is not found in the mapper), and zarr interprets this as "file missing, so use the default fill value", in this case zeros. Zarr allows you to have arrays where most/none of the blocks have data on disk, so the logical size can be much bigger in than the stored size.
The mapper allows you to specify which exceptions will be translated into KeyError: argument missing_exception, which by default includes FileNotFoundError.

cgohlke · 2021-03-13T18:32:20Z

Thank you! Makes sense. I'm setting the fill value to None/null. That's why the chunks were uninitialized.

cgohlke · 2021-03-15T19:39:15Z

I ran into another issue during testing of multi-series TIFF files. It seems not possible to use more than one FSMap instance (?). In the following example, the remote TIFF file is never accessed because of a silent RuntimeError: Timeout context manager should be used inside a task. The mapper created first works. The second mapper doesn't. What am I missing?

import fsspec
import zarr

# map series 1
mapper1 = fsspec.get_mapper(
    'reference://',
    references='http://localhost:8080/test_zarr_fsspec.ome.tif.s1.json',
    target_protocol='http',
)
# map series2
mapper2 = fsspec.get_mapper(
    'reference://',
    references='http://localhost:8080/test_zarr_fsspec.ome.tif.s2.json',
    target_protocol='http',
)
za = zarr.open(mapper2, mode='r')
print(za.info)
print(za[:])    # <- zeroed data

Output

Type               : zarr.core.Array
Data type          : uint8
Shape              : (3, 219, 301)
Chunk shape        : (1, 219, 301)
Order              : C
Read-only          : True
Compressor         : None
Store type         : fsspec.mapping.FSMap
No. bytes          : 197757 (193.1K)
Chunks initialized : 3/3

[[[0 0 0 ... 0 0 0]
<snip>
  [0 0 0 ... 0 0 0]]]

test_zarr_fsspec.ome.tif.s1.json

{
  ".zattrs": "{}",
  ".zarray": "{\n \"chunks\": [\n  219,\n  301,\n  3\n ],\n \"compressor\": null,\n \"dtype\": \"|u1\",\n \"fill_value\": 0,\n \"filters\": null,\n \"order\": \"C\",\n \"shape\": [\n  219,\n  301,\n  3\n ],\n \"zarr_format\": 2\n}",
  "0.0.0": ["http://localhost:8080/test_zarr_fsspec.ome.tif", 261136, 197757]
}

test_zarr_fsspec.ome.tif.s2.json

{
  ".zattrs": "{}",
  ".zarray": "{\n \"chunks\": [\n  1,\n  219,\n  301\n ],\n \"compressor\": null,\n \"dtype\": \"|u1\",\n \"fill_value\": 0,\n \"filters\": null,\n \"order\": \"C\",\n \"shape\": [\n  3,\n  219,\n  301\n ],\n \"zarr_format\": 2\n}",
  "0.0.0": ["http://localhost:8080/test_zarr_fsspec.ome.tif", 459136, 65919],
  "1.0.0": ["http://localhost:8080/test_zarr_fsspec.ome.tif", 525055, 65919],
  "2.0.0": ["http://localhost:8080/test_zarr_fsspec.ome.tif", 590974, 65919]
}

Traceback from re-raising the RuntimeError in fsspec\mapping.py:

  File "test_issue56.py", line 16, in <module>
    print(za[:])
  File "X:\Python38\lib\site-packages\zarr\core.py", line 571, in __getitem__
    return self.get_basic_selection(selection, fields=fields)
  File "X:\Python38\lib\site-packages\zarr\core.py", line 696, in get_basic_selection
    return self._get_basic_selection_nd(selection=selection, out=out,
  File "X:\Python38\lib\site-packages\zarr\core.py", line 739, in _get_basic_selection_nd
    return self._get_selection(indexer=indexer, out=out, fields=fields)
  File "X:\Python38\lib\site-packages\zarr\core.py", line 1034, in _get_selection
    self._chunk_getitems(lchunk_coords, lchunk_selection, out, lout_selection,
  File "X:\Python38\lib\site-packages\zarr\core.py", line 1691, in _chunk_getitems
    cdatas = self.chunk_store.getitems(ckeys, on_error="omit")
  File "X:\Python38\lib\site-packages\fsspec\mapping.py", line 91, in getitems
    raise out['0.0.0']  # re-raise RuntimeError
  File "X:\Python38\lib\site-packages\fsspec\implementations\reference.py", line 90, in _cat_file
    return await self.fs._cat_file(url, start=start, end=end)
  File "X:\Python38\lib\site-packages\fsspec\implementations\http.py", line 168, in _cat_file
    async with self.session.get(url, **kw) as r:
  File "X:\Python38\lib\site-packages\aiohttp\client.py", line 1117, in __aenter__
    self._resp = await self._coro
  File "X:\Python38\lib\site-packages\aiohttp\client.py", line 448, in _request
    with timer:
  File "X:\Python38\lib\site-packages\aiohttp\helpers.py", line 635, in __enter__
    raise RuntimeError(
RuntimeError: Timeout context manager should be used inside a task

martindurant · 2021-03-16T19:19:42Z

That exception is a new one for me, and doesn't make much sense to me...

I have been trying to simplify the async handling in fsspec, would you mind trying with the fsspec/filesystem_spec#572 version of fsspec ( git+https://github.com/martindurant/filesystem_spec.git@ioloop_massage2 )?

cgohlke · 2021-03-16T19:45:32Z

fsspec/filesystem_spec#572 fixes the issue for me. The tests pass now. Thank you very much!

martindurant · 2021-03-16T20:02:00Z

PS: I don't know if you have been following fsspec/kerchunk#17 , which establishes a more formal spec for the content of the references JSON file, with some features to make that file more compact. The ReferenceFileSystem implementation ( PR ) will be backwards compatible.

cgohlke · 2021-03-16T20:08:52Z

Yes, I've seen version 1 of the specification. Using a template for the URL will make the file more compact. But for now I'm going to release tifffile with experimental version 0 support.

cgohlke · 2021-03-16T21:26:15Z

Tifffile-2021.3.16 adds a store method (ZarrTiffStore.write_fsspec) and a script (tiff2fsspec) to write ReferenceFileSystem JSON files for TIFF files:

with tifffile.imread(tiff_filename, aszarr=True) as store:
    store.write_fsspec(tiff_filename + '.json', url)

$ python -m tifffile.tiff2fsspec --help
usage: tiff2fsspec [-h] [--out OUT] [--series SERIES] [--level LEVEL] [--key KEY] [--chunkmode CHUNKMODE] tifffile url

Write fsspec ReferenceFileSystem for TIFF file.

positional arguments:
  tifffile              path to the local TIFF input file
  url                   remote URL of TIFF file without file name

optional arguments:
  -h, --help            show this help message and exit
  --out OUT             path to the JSON output file
  --series SERIES       index of series in file
  --level LEVEL         index of level in series
  --key KEY             index of page in file or series
  --chunkmode CHUNKMODE
                        mode used for chunking {None, pages}

A ValueError is raised if the TIFF file uses a feature that is not supported by zarr or numcodecs, e.g.:

PackBits, LZW, JPEG, or JPEG2000 compression
any "filters", e.g. predictors, bitorder, packed integers
float24 dtype
JPEGTables
incomplete chunks, e.g. if imagelength % rowsperstrip != 0

The JSON files can get quite large. One of the local WSI test files contains over 23 million tiles and the JSON file is larger than 1.5 GB.

manzt · 2021-03-17T13:47:25Z

@cgohlke Thanks for the release! I tried out the CLI for a couple of images and it worked well. One issue is that I don't think endianness in the .zarray metadata reflects the endianness on disk. I have a >u16 multiscale ome-tiff and had to manually swap bytes returned from zarr.js

Interactive notebook: https://observablehq.com/d/16524d8e7fd4f9ef

I have shared the reference in a gist.

I think this is likely due to ZarrStore using sys.byteorder:

tifffile/tifffile/tifffile.py

Lines 8155 to 8161 in b69ddd4

    
           def _dtype(dtype): 
        
               """Return dtype as string with native byte order.""" 
        
               if dtype.itemsize == 1: 
        
                   byteorder = '|' 
        
               else: 
        
                   byteorder = {'big': '>', 'little': '<'}[sys.byteorder] 
        
               return byteorder + dtype.str[1:]

martindurant · 2021-03-17T14:05:16Z

I would love to see this functionality in a blog article somewhere

cgohlke · 2021-03-17T19:58:23Z

One issue is that I don't think endianness in the .zarray metadata reflects the endianness on disk.

You are right. Fixed in v2021.3.17.

cgohlke added the enhancement New feature or request label Jan 24, 2021

cgohlke closed this as completed Mar 16, 2021

This was referenced Mar 18, 2021

pip prod(deps): bump tifffile from 2020.7.4 to 2021.3.17 facade-technologies-inc/facile#449

Closed

Bump tifffile from 2021.3.5 to 2021.3.17 in /requirements 4DNucleome/PartSeg#226

Closed

Bump tifffile from 2021.3.16 to 2021.3.17 ggirelli/radiantkit#155

Merged

dependabot bot mentioned this issue Mar 18, 2021

Bump tifffile from 2021.3.16 to 2021.3.17 mstoelzle/solving-occlusion#30

Merged

manzt mentioned this issue Mar 18, 2021

Add experimental FileReferenceStore hms-dbmi/vizarr#82

Merged

gigony mentioned this issue Mar 31, 2022

[FEA] Support Cloud-native filesystem (S3, GCS) rapidsai/cucim#255

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal]: Dump ReferenceFileSystem spec for ZarrTiffStore that can be read natively as zarr #56

[Proposal]: Dump ReferenceFileSystem spec for ZarrTiffStore that can be read natively as zarr #56

manzt commented Jan 23, 2021 •

edited

Loading

cgohlke commented Jan 24, 2021

manzt commented Jan 24, 2021 •

edited

Loading

joshmoore commented Jan 25, 2021

martindurant commented Jan 27, 2021 •

edited

Loading

jakirkham commented Jan 27, 2021

cgohlke commented Mar 13, 2021 •

edited

Loading

martindurant commented Mar 13, 2021

cgohlke commented Mar 13, 2021

cgohlke commented Mar 13, 2021

cgohlke commented Mar 13, 2021

martindurant commented Mar 13, 2021

cgohlke commented Mar 13, 2021

cgohlke commented Mar 15, 2021

martindurant commented Mar 16, 2021

cgohlke commented Mar 16, 2021

martindurant commented Mar 16, 2021

cgohlke commented Mar 16, 2021

cgohlke commented Mar 16, 2021

manzt commented Mar 17, 2021

martindurant commented Mar 17, 2021

cgohlke commented Mar 17, 2021

[Proposal]: Dump ReferenceFileSystem spec for ZarrTiffStore that can be read natively as zarr #56

[Proposal]: Dump ReferenceFileSystem spec for ZarrTiffStore that can be read natively as zarr #56

Comments

manzt commented Jan 23, 2021 • edited Loading

cgohlke commented Jan 24, 2021

manzt commented Jan 24, 2021 • edited Loading

joshmoore commented Jan 25, 2021

martindurant commented Jan 27, 2021 • edited Loading

jakirkham commented Jan 27, 2021

cgohlke commented Mar 13, 2021 • edited Loading

martindurant commented Mar 13, 2021

cgohlke commented Mar 13, 2021

cgohlke commented Mar 13, 2021

cgohlke commented Mar 13, 2021

martindurant commented Mar 13, 2021

cgohlke commented Mar 13, 2021

cgohlke commented Mar 15, 2021

martindurant commented Mar 16, 2021

cgohlke commented Mar 16, 2021

martindurant commented Mar 16, 2021

cgohlke commented Mar 16, 2021

cgohlke commented Mar 16, 2021

manzt commented Mar 17, 2021

martindurant commented Mar 17, 2021

cgohlke commented Mar 17, 2021

manzt commented Jan 23, 2021 •

edited

Loading

manzt commented Jan 24, 2021 •

edited

Loading

martindurant commented Jan 27, 2021 •

edited

Loading

cgohlke commented Mar 13, 2021 •

edited

Loading