Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Proposal]: Dump ReferenceFileSystem spec for ZarrTiffStore that can be read natively as zarr #56

Closed
manzt opened this issue Jan 23, 2021 · 21 comments
Labels
enhancement New feature or request

Comments

@manzt
Copy link

manzt commented Jan 23, 2021

Thank you so much for your work on this project. I just came across experimental aszarr and ZarrTiffStore and am so excited! I'd written some one-off stores wrapping tifffile to read different pyramidal images as zarr (for napari), but having this in tifffile is incredible!

I'm curious if you've seen the proposed JSON specification for describing a ReferenceFileSystem? Asking naively, and a bit selfishly, would it be possible to detect whether a ZarrTiffStore can be natively read by zarr and "export" one of these references?

I work on web-based tools for visualizing OME-TIFF / Zarr data, and it would be really useful to quickly create these references.

Here is an example viewing a multiscale tiff on the web using zarr.js, and this is the python script I wrote with the newest version of tifffile to generate the reference. I wonder if there is some way to generalize this script, but don't have the familiarly with the underlying formats to know if this is a silly idea.

I notice that the TiffZarrStore handles all compression, so I know at least you need to detect whether the chunk compression is supported in Zarr.

@cgohlke
Copy link
Owner

cgohlke commented Jan 24, 2021

I am aware of ReferenceFileSystem. I was holding back on implementing it because of the experimental status and the many "features" found in TIFF that can't be mapped to zarr (AFAIK):

  • incomplete chunks
  • compressors, e.g. LZW, JPEG, JPEG2000
  • "filters", e.g. floating point predictor, bitorder reversal, packed integers
  • dtypes, e.g. float24
  • multi-file

@manzt
Copy link
Author

manzt commented Jan 24, 2021

Makes sense, thanks for the response. I think the multi-file issue could be accommodated by the spec, but agree the other "features" are incompatible. It's unrealistic to try to map all tiffs to zarr, but it would be useful to translate the reference store for the subset that can be mapped. e.g.

from tifffile import imread
from lib_i_wish_existed import TiffStore2Zarr

with imread("data.tiff", aszarr=True) as store:
  converter = TiffStore2Zarr(store, tiff_url)
  ref = converter.translate() # raises exception if tiff can't be mapped directly to zarr
  with open("data.tiff_offsets.json", "w") as f:
    json.dump(ref, f)

I'm just leaving the above for reference. Maybe there is a way that tifffile could consolidate certain "features" for the pages in a store to make this (in)compatibility with zarr easier to detect. Either way, the current zarr additions to tifffile have made it substantially easier explore this idea, so thanks a lot!

@cgohlke cgohlke added the enhancement New feature or request label Jan 24, 2021
@joshmoore
Copy link

@cgohlke, if I may add: fsspec-reference-maker is definitely experimental but having your input at this stage would be invaluable. From zarr-developers/zarr-python#556 (comment), if there are any spec changes that would help to support viable TIFF-edge cases, it'd be good to capture them.

(And either way, I'm still excited by aszarr.)

@martindurant
Copy link

martindurant commented Jan 27, 2021

Also, adding additional numcodecs would be generally useful and, as far as I understand, not hard where existing python or C libs are available.
(EDIT: mean codecs and transforms in the numcodecs library, on which zarr depends, if that wasn't obvious)

@jakirkham
Copy link

Just to add to Martin's comment, Numcodecs ships both conda and wheel binary packages. So hopefully this makes it a bit easier to use downstream without needing to worry about compiling. We are also looking into making Numcodecs a pure Python package now that Blosc has wheels (in addition to conda) packages.

@cgohlke
Copy link
Owner

cgohlke commented Mar 13, 2021

I have started working on this issue but have some trouble testing. The following code runs on my system without raising an exception but the data returned by zarr seems random and the web server logs do not show any access to the .tif file. The file is an zlib compressed pyramidal OME-TIFF. I verified that the offsets and byte counts in the JSON file are correct. A manual range request using the requests library works. Any idea? Is there a way to test the ReferenceFileSystem on a local file system?

import zarr  # 2.6.1
import fsspec  # 0.8.7
import tifffile  # 2021.3.dev

localpath = ''
filename = 'test.ome.tif'
url = 'https://www.lfd.uci.edu/'

# create the reference file
with tifffile.imread(localpath + filename, aszarr=True) as store:
    with open(localpath + filename + '.json', 'w') as fh:
        store.write_fsspec(fh, url)

# open the reference file from web server
mapper = fsspec.get_mapper(
    'reference://',
    references=url + filename + '.json',
    target_protocol='https',
)

zgrp = zarr.open(mapper, mode='r')
print(zgrp[9].info)  # print info of last level
im = zgrp[9][:]  # <- random data
Name               : /9
Type               : zarr.core.Array
Data type          : uint8
Shape              : (142, 136, 3)
Chunk shape        : (256, 256, 3)
Order              : C
Read-only          : True
Compressor         : Zlib(level=1)
Store type         : fsspec.mapping.FSMap
No. bytes          : 57936 (56.6K)
Chunks initialized : 1/1

@martindurant
Copy link

I have so far looked at one key:

"9/0.0": ["https://www.lfd.uci.edu/test.ome.tif", 6276016176, 33105]
(the one you used, the very last key!)

If I directly make a HTTPFileSystem and access this:

out = h.cat_file("https://www.lfd.uci.edu/test.ome.tif", 6276016176, 6276016176+33105)
bb = numcodecs.Zlib().decode(out)
assert len(bb) == 256*256*3
assert mapper["9/0.0"] == out

(if it were random data, no way zlib would happen to give that number of bytes out)

I notice from the zarr info (taken from mapper["9/.zarray"]) that the chunk is larger than the whole array, which is suspicious. This is the last of the datasets, and the number of bytes appears to match the chunksize, which would be more pixels than are needed. This whole chunk appears to end only 665bytes before the end of the whole file (I don't know if TIFF has some footer metadata).

Are other chunks coming through correctly? For the other blocks I am seeing a lot of zeros.

@cgohlke
Copy link
Owner

cgohlke commented Mar 13, 2021

(if it were random data, no way zlib would happen to give that number of bytes out)

yes, by random I meant that I got different numbers every time.

the chunk is larger than the whole array, which is suspicious.

That should be OK according to the TIFF specification. The chunk data in the file should be complete.

I don't know if TIFF has some footer metadata.

The OME-XML is written at the end.

Are other chunks coming through correctly?

No, I tried two other levels.

As mentioned, the requests library works:

from matplotlib import pyplot
import requests
import zlib
headers = {'Range': 'bytes=6276016176-6276049280'}
r = requests.get('https://www.lfd.uci.edu/test.ome.tif', headers=headers, stream=True)
data = b''.join(chunk for chunk in r.iter_content(1024))
d = zlib.decompress(data)
im = numpy.frombuffer(d, dtype='uint8').reshape(256, 256, 3)
pyplot.imshow(im)
pyplot.show()

@cgohlke
Copy link
Owner

cgohlke commented Mar 13, 2021

Apparently the key in the reference JSON file must be "9/0.0.0". Using "9/0.0", a KeyError is raised but ignored and an empty, uninitialized array returned.

@cgohlke
Copy link
Owner

cgohlke commented Mar 13, 2021

It works now. I can visualize the multiscales zarr Group created from the fsspec ReferenceFileSystem using napari.

@martindurant
Copy link

Apparently the key in the reference JSON file must be "9/0.0.0". Using "9/0.0", a KeyError is raised but ignored and an empty, uninitialized array returned.

Is this correct, then? Indeed, there are thee dimensions, even if the last dimension only ever ha one chunk. zarr doesn't know about images having a colour dimension.

The KeyError would be the right thing to raise (the key is not found in the mapper), and zarr interprets this as "file missing, so use the default fill value", in this case zeros. Zarr allows you to have arrays where most/none of the blocks have data on disk, so the logical size can be much bigger in than the stored size.
The mapper allows you to specify which exceptions will be translated into KeyError: argument missing_exception, which by default includes FileNotFoundError.

@cgohlke
Copy link
Owner

cgohlke commented Mar 13, 2021

Thank you! Makes sense. I'm setting the fill value to None/null. That's why the chunks were uninitialized.

@cgohlke
Copy link
Owner

cgohlke commented Mar 15, 2021

I ran into another issue during testing of multi-series TIFF files. It seems not possible to use more than one FSMap instance (?). In the following example, the remote TIFF file is never accessed because of a silent RuntimeError: Timeout context manager should be used inside a task. The mapper created first works. The second mapper doesn't. What am I missing?

import fsspec
import zarr

# map series 1
mapper1 = fsspec.get_mapper(
    'reference://',
    references='http://localhost:8080/test_zarr_fsspec.ome.tif.s1.json',
    target_protocol='http',
)
# map series2
mapper2 = fsspec.get_mapper(
    'reference://',
    references='http://localhost:8080/test_zarr_fsspec.ome.tif.s2.json',
    target_protocol='http',
)
za = zarr.open(mapper2, mode='r')
print(za.info)
print(za[:])    # <- zeroed data

Output

Type               : zarr.core.Array
Data type          : uint8
Shape              : (3, 219, 301)
Chunk shape        : (1, 219, 301)
Order              : C
Read-only          : True
Compressor         : None
Store type         : fsspec.mapping.FSMap
No. bytes          : 197757 (193.1K)
Chunks initialized : 3/3

[[[0 0 0 ... 0 0 0]
<snip>
  [0 0 0 ... 0 0 0]]]

test_zarr_fsspec.ome.tif.s1.json

{
  ".zattrs": "{}",
  ".zarray": "{\n \"chunks\": [\n  219,\n  301,\n  3\n ],\n \"compressor\": null,\n \"dtype\": \"|u1\",\n \"fill_value\": 0,\n \"filters\": null,\n \"order\": \"C\",\n \"shape\": [\n  219,\n  301,\n  3\n ],\n \"zarr_format\": 2\n}",
  "0.0.0": ["http://localhost:8080/test_zarr_fsspec.ome.tif", 261136, 197757]
}

test_zarr_fsspec.ome.tif.s2.json

{
  ".zattrs": "{}",
  ".zarray": "{\n \"chunks\": [\n  1,\n  219,\n  301\n ],\n \"compressor\": null,\n \"dtype\": \"|u1\",\n \"fill_value\": 0,\n \"filters\": null,\n \"order\": \"C\",\n \"shape\": [\n  3,\n  219,\n  301\n ],\n \"zarr_format\": 2\n}",
  "0.0.0": ["http://localhost:8080/test_zarr_fsspec.ome.tif", 459136, 65919],
  "1.0.0": ["http://localhost:8080/test_zarr_fsspec.ome.tif", 525055, 65919],
  "2.0.0": ["http://localhost:8080/test_zarr_fsspec.ome.tif", 590974, 65919]
}

Traceback from re-raising the RuntimeError in fsspec\mapping.py:

  File "test_issue56.py", line 16, in <module>
    print(za[:])
  File "X:\Python38\lib\site-packages\zarr\core.py", line 571, in __getitem__
    return self.get_basic_selection(selection, fields=fields)
  File "X:\Python38\lib\site-packages\zarr\core.py", line 696, in get_basic_selection
    return self._get_basic_selection_nd(selection=selection, out=out,
  File "X:\Python38\lib\site-packages\zarr\core.py", line 739, in _get_basic_selection_nd
    return self._get_selection(indexer=indexer, out=out, fields=fields)
  File "X:\Python38\lib\site-packages\zarr\core.py", line 1034, in _get_selection
    self._chunk_getitems(lchunk_coords, lchunk_selection, out, lout_selection,
  File "X:\Python38\lib\site-packages\zarr\core.py", line 1691, in _chunk_getitems
    cdatas = self.chunk_store.getitems(ckeys, on_error="omit")
  File "X:\Python38\lib\site-packages\fsspec\mapping.py", line 91, in getitems
    raise out['0.0.0']  # re-raise RuntimeError
  File "X:\Python38\lib\site-packages\fsspec\implementations\reference.py", line 90, in _cat_file
    return await self.fs._cat_file(url, start=start, end=end)
  File "X:\Python38\lib\site-packages\fsspec\implementations\http.py", line 168, in _cat_file
    async with self.session.get(url, **kw) as r:
  File "X:\Python38\lib\site-packages\aiohttp\client.py", line 1117, in __aenter__
    self._resp = await self._coro
  File "X:\Python38\lib\site-packages\aiohttp\client.py", line 448, in _request
    with timer:
  File "X:\Python38\lib\site-packages\aiohttp\helpers.py", line 635, in __enter__
    raise RuntimeError(
RuntimeError: Timeout context manager should be used inside a task

@martindurant
Copy link

That exception is a new one for me, and doesn't make much sense to me...

I have been trying to simplify the async handling in fsspec, would you mind trying with the fsspec/filesystem_spec#572 version of fsspec ( git+https://github.com/martindurant/filesystem_spec.git@ioloop_massage2 )?

@cgohlke
Copy link
Owner

cgohlke commented Mar 16, 2021

fsspec/filesystem_spec#572 fixes the issue for me. The tests pass now. Thank you very much!

@martindurant
Copy link

PS: I don't know if you have been following fsspec/kerchunk#17 , which establishes a more formal spec for the content of the references JSON file, with some features to make that file more compact. The ReferenceFileSystem implementation ( PR ) will be backwards compatible.

@cgohlke
Copy link
Owner

cgohlke commented Mar 16, 2021

Yes, I've seen version 1 of the specification. Using a template for the URL will make the file more compact. But for now I'm going to release tifffile with experimental version 0 support.

@cgohlke
Copy link
Owner

cgohlke commented Mar 16, 2021

Tifffile-2021.3.16 adds a store method (ZarrTiffStore.write_fsspec) and a script (tiff2fsspec) to write ReferenceFileSystem JSON files for TIFF files:

with tifffile.imread(tiff_filename, aszarr=True) as store:
    store.write_fsspec(tiff_filename + '.json', url)
$ python -m tifffile.tiff2fsspec --help
usage: tiff2fsspec [-h] [--out OUT] [--series SERIES] [--level LEVEL] [--key KEY] [--chunkmode CHUNKMODE] tifffile url

Write fsspec ReferenceFileSystem for TIFF file.

positional arguments:
  tifffile              path to the local TIFF input file
  url                   remote URL of TIFF file without file name

optional arguments:
  -h, --help            show this help message and exit
  --out OUT             path to the JSON output file
  --series SERIES       index of series in file
  --level LEVEL         index of level in series
  --key KEY             index of page in file or series
  --chunkmode CHUNKMODE
                        mode used for chunking {None, pages}

A ValueError is raised if the TIFF file uses a feature that is not supported by zarr or numcodecs, e.g.:

  1. PackBits, LZW, JPEG, or JPEG2000 compression
  2. any "filters", e.g. predictors, bitorder, packed integers
  3. float24 dtype
  4. JPEGTables
  5. incomplete chunks, e.g. if imagelength % rowsperstrip != 0

The JSON files can get quite large. One of the local WSI test files contains over 23 million tiles and the JSON file is larger than 1.5 GB.

@manzt
Copy link
Author

manzt commented Mar 17, 2021

@cgohlke Thanks for the release! I tried out the CLI for a couple of images and it worked well. One issue is that I don't think endianness in the .zarray metadata reflects the endianness on disk. I have a >u16 multiscale ome-tiff and had to manually swap bytes returned from zarr.js

Interactive notebook: https://observablehq.com/d/16524d8e7fd4f9ef

image

I have shared the reference in a gist.

I think this is likely due to ZarrStore using sys.byteorder:

tifffile/tifffile/tifffile.py

Lines 8155 to 8161 in b69ddd4

def _dtype(dtype):
"""Return dtype as string with native byte order."""
if dtype.itemsize == 1:
byteorder = '|'
else:
byteorder = {'big': '>', 'little': '<'}[sys.byteorder]
return byteorder + dtype.str[1:]

@martindurant
Copy link

I would love to see this functionality in a blog article somewhere

@cgohlke
Copy link
Owner

cgohlke commented Mar 17, 2021

One issue is that I don't think endianness in the .zarray metadata reflects the endianness on disk.

You are right. Fixed in v2021.3.17.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants