Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zarrquet: Create a referenced based filesystem to read virtual parquet from Zarr. #4

Open
alxmrs opened this issue Jan 23, 2024 · 25 comments

Comments

@alxmrs
Copy link
Owner

alxmrs commented Jan 23, 2024

https://arrow.apache.org/docs/python/filesystems.html

Like Kerchunk, it should be possible to create a virtual parquet dataset from an existing Zarr store. Here, it would also provide a table-like view of a dense array — like the rest of qarray.

Ideally, this could be persisted in a file system without creating extra space (just metadata). Then, one could use tools like Pandas, DuckDB, Apache Arrow, etc., to query raster data as tables, maybe even outside of Python.

My bet here is that the LOC to prototype this is going to be small. Getting it performant is another matter. I wonder if this would be a justification for BigQuery or other similarly high powered query engine? BQ reads parquet!

@alxmrs
Copy link
Owner Author

alxmrs commented Jan 23, 2024

cc: @cisaacstern

@alxmrs
Copy link
Owner Author

alxmrs commented Mar 17, 2024

I’ve been doing some background reading on parquet. Here are some useful things I’ve learned:

  • Arrow is the in memory version of the file format.
  • The project is 11 or so years old and comes from the Hadoop ecosystem (i.e. is well battle tested)
  • A core feature is column oriented chunks. Like, the format groups rows together abstractly but puts each column of a set of rows in a logical, if not physical grouping (column chunks can be one or more file).
    • It seems like Zarr chunks would map on to parquet well IFF they can be interpreted as column chunks. If I have the right mental model, I think unraveled Zarr would be a compatible with parquet.
  • Martin’s fastparquet is a good reference to understand the file format nitty-gritty: https://github.com/dask/fastparquet

@alxmrs
Copy link
Owner Author

alxmrs commented Mar 17, 2024

The thrift definition is especially helpful to read.

https://github.com/dask/fastparquet/blob/main/fastparquet/parquet.thrift

Some initial thoughts:

  • is it possible that a Zarr chunk in a bucket could be the filepath of a column chunk?
  • If we can make calculating the metadata optional, maybe we could get slower, but readable zarrquet.

@alxmrs
Copy link
Owner Author

alxmrs commented Mar 17, 2024

Reading the Data Page V1 algorithm in these sources (https://github.com/dask/fastparquet/blob/main/fastparquet/core.py), I think most Zarr would be represented in a Plain Thrift encoding.

@alxmrs alxmrs changed the title Create a referenced based filesystem to read virtual parquet from Zarr. Zarrquet: Create a referenced based filesystem to read virtual parquet from Zarr. Mar 17, 2024
@alxmrs
Copy link
Owner Author

alxmrs commented Mar 18, 2024

I've just started chatting with @martindurant in a side channel (Martin, I'm happy to bring the discussion here). He has a few thoughts on the matter:

  • First, he's been interested in this kind of feature for a long time (Contributing a CSV module [RE: dask dataframe read_csv] fsspec/kerchunk#66). This doesn't surprise me at all :)
  • Feather is probably a better fit for mapping to Zarr than parquet, but parquet is more ubiquitous for representing tables
  • It is quite possible to represent Zarr as parquet, say, without nested columns or other corner case features.

I'm really happy to have a vote of confidence on this approach from a domain expert!

@alxmrs
Copy link
Owner Author

alxmrs commented Mar 27, 2024

I've discussed how to surface Zarrquet to databases with Martin a bit more. Here's the best path forward from the discussion:

  • Create a PyArrow or fsspec FileSystem to read Zarrquet in Python
    • Test it on DuckDB's Python interface
  • For all other DBs, we'll use this Python client as an adaptor to a Parquet file interface in a proxy server.
    • TBD if this would only act as a local server or something friendly to distributed reads.
    • A good examples to follow (thanks Martin): https://github.com/bluesky/tiled, WebHDFS.

@jatorre
Copy link

jatorre commented Apr 9, 2024

So, hi all, let me jump in here, because I think we are going on the same direction...

In Zarrquet we will store a chunk as a row, and bands as columns with xarrays data encoded. So the schema on Parquet is kind of Zarr compliance, and the content of the bands is zarrays.

Or in a virtual way, you are thinking on creating a "virtual" parquet file with the contents of a geozarr. And then BigQuery using a external table (parquet virtual tables), you end up creating a Virtual BigQuery table that is a connected to a Virtual Parquet File that is backed up by Geozarr stored in Cloud Storage?

@martindurant
Copy link

Yes, this is totally doable, to make a python file-like where the buffers are fetched from zarr chunks, and the parquet thrift metadata are made dynamically at runtime (or, optionally, stored elsewhere, which might be the easier way to start).

BigQuery using a external table

How does one persuade bigquery to read from a python file? We can get pyarrow or even duckdb to do this, but ...

xarrays data encoded.

Meaning simple things like gzip/zstd compression?

@martindurant
Copy link

Maybe you are thinking to host an HTTP server which pipes through virtual parquet bytes?

@alxmrs
Copy link
Owner Author

alxmrs commented Apr 9, 2024

you end up creating a Virtual BigQuery table that is a connected to a Virtual Parquet File that is backed up by Geozarr stored in Cloud Storage?

Yes, this is the end game.

Maybe you are thinking to host an HTTP server which pipes through virtual parquet bytes?

Yes — and the big unknown for me is the cloud architecture question. How do you build a proxy GCS bucket? I was imagining an elastic load balancer would play a key role.

@jatorre
Copy link

jatorre commented Apr 9, 2024

I am thinking less virtual in a way, what about the other no dynamic way.

In Zarrquet we will store a chunk as a row, and bands as columns with xarrays data encoded. So the schema on Parquet is kind of Zarr compliance, and the content of the bands is zarrays.

As a static file.

@alxmrs
Copy link
Owner Author

alxmrs commented Apr 10, 2024

This idea has a lot of merit. We did something similar over here: https://github.com/google/weather-tools/blob/main/weather_mv/README.md

The best place for the static approach (copying Zarr into Parquet) would be as an output sink for weather-mv.

In fact, this repo was inspired by this weather tool. The goal here is to find a way to provide a similar experience to the one offered by weather-mv bq without doubling storage or waiting for a large ETL job to finish.

@alxmrs
Copy link
Owner Author

alxmrs commented Apr 10, 2024

So the schema on Parquet is kind of Zarr compliance, and the content of the bands is zarrays.

Wait a second! I re-read this — I think I may have misunderstood what you meant. Are you saying that we can accomplish the goal of no-copy Zarr/Parquet with a static file?

I have began thinking this is possible! It looks like there is room in Parquet’s thrift format to point to Zarr array blobs as the backing data.

I have some notes about this around here: #4 (comment)

@martindurant
Copy link

Wait a second! I re-read this — I think I may have misunderstood what you meant. Are you saying that we can accomplish the goal of no-copy Zarr/Parquet with a static file?

You can store the thrift serialised objects in file(s) rather than make them on the fly; but no, parquet does not let you point to binary buffers in some other file. It does in theory allow for each "column chunk" (thrift header + repetition/definition + buffer) to be in a different file, but in practice every parquet file except _metadata has a complete thrift footer and whole row groups.

@martindurant
Copy link

Also, I was chatting about this yesterday: it is totally possible to store N-D arrays in parquet despite its columnar nature: a 3D array would be list(list(number)), and the required rep/def levels would be easy to generate, as they would be regularly repeating. In V2 pages, the buffer is compressed separately from the rep/defs, so streaming data from chunks would work fine. The question is, do any parquet frameworks expect this sort of data?

@martindurant
Copy link

Also work noting (sorry for the spamming): fastparquet allows you to directly instantiate and serialised thrift objects. I don't believe arrow has any such ability. python-thrift also allows this, but is much slower.

@jatorre
Copy link

jatorre commented Apr 10, 2024 via email

@martindurant
Copy link

I am a bit lost here since I thought we would be able to store the arrays
as arrow.

What do you mean by "arrow" here? Parquet is not arrow; feather2 essentially is.

Not the same but geoarrow is doing this for vector data. I assumed using
nested arrays in arrow was going to be no problem.

Yes, the parquet format can store nested data, which would include geo structures (like list of points -> polygon, for each row). But this is presumably not what you have in zarr, right? Zarr does not have a concept of variable-length structures in each array position, except for string, JSON and pickle encoding (i.e., python objects, not what you want)

So: please show me some example zarr data you'd like to view, and I'll see if I can make a python file-like to hand to pyarrow. But I'll be using fastparquet to make the thrift, since I don't believe there's any other way.

@alxmrs
Copy link
Owner Author

alxmrs commented Apr 13, 2024

parquet does not let you point to binary buffers in some other file. It does in theory allow for each "column chunk" (thrift header + repetition/definition + buffer) to be in a different file, but in practice every parquet file except _metadata has a complete thrift footer and whole row groups.

Thanks Martin. That saves me from having to discover this first hand! Do you know of any format that would be a good fit, that allows for extra file storage? Feather, Avro, ORC, maybe?

The question is, do any parquet frameworks expect this sort of data?

Yes, this is really interesting! I agree, though, I don't see a use case for it for me in the near term.

I assumed using nested arrays in arrow was going to be no problem.

In a larger context, I see this work as largely experimental. It very well could not pan out as I find more information! Take these plans with a grain of salt.

@martindurant
Copy link

Do you know of any format that would be a good fit, that allows for extra file storage?

Sorry, no

@alxmrs
Copy link
Owner Author

alxmrs commented May 26, 2024

Given zarr-developers/zarr-specs#287, it looks like future versions of Zarr will allow an fsspec interface beyond Python. It seems like a proxy server won’t be required.

@alxmrs
Copy link
Owner Author

alxmrs commented Jun 27, 2024

Pathway to Postgres integration: https://github.com/paradedb/paradedb/tree/dev/pg_lakehouse#overview

@martindurant
Copy link

You want postgres to query zarr data chunks with a parquet dynamic ephemeral intermediate representation? Actually, teaching duckDB about zarr might be simpler :) .

@alxmrs
Copy link
Owner Author

alxmrs commented Jul 17, 2024

@TomNicholas I just saw your tweet related to “headless data” and Pangeo. https://x.com/TEGNicholasCode/status/1813299299201294583

I wanted to reply: I think something like “Zarrquet” could be a good fit for the SQL interoperability box for pangeo.

@TomNicholas
Copy link

Thanks for tagging me @alxmrs - I think this also relates closely to something @manzt and I were talking about at SciPy. Basically using SQL tools to analyze Zarr stores on a chunk-level

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants