Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Also support BlockArrays #16

Open
Luapulu opened this issue Nov 8, 2020 · 9 comments
Open

Also support BlockArrays #16

Luapulu opened this issue Nov 8, 2020 · 9 comments

Comments

@Luapulu
Copy link

Luapulu commented Nov 8, 2020

In light of this discourse discussion, could we also support BlockArrays in ArrayInterface.jl? I have time weekend after next to take a stab at implementing the suggested changes in that post.

Ideally, make efficient BlockArrays possible through traits, rather than having to subtype a AbstractBlockArray. Note, that I'm not familiar with the details here at all. Rerouting reduce operations, broadcasting, etc. through some set of traits might already be basically implemented or trivially easy to add. Or it might not even make sense.

tagging @chriselrod

@Tokazama
Copy link
Member

Tokazama commented Nov 8, 2020

It's probably time packages like BlockArrays.jl start depending on ArrayInterface.jl instead of us defining them here.

@chriselrod
Copy link
Contributor

I'll generalize contiguous_axis and contiguous_batch_size by letting them return tuples to represent block arrays.

@chriselrod
Copy link
Contributor

chriselrod commented Nov 8, 2020

The more interesting change will be having this work with ArrayInterface.getindex and ArrayInterface.setindex!. I haven't tried if contiguous_axis and contiguous_batch_size are actually supported yet.

This will be a breaking change, but I don't believe anyone else has used them yet. (I'll still bump the major version number, just saying that it wont cause problems for anyone using ArrayInterface AFAIK.)

@Tokazama
Copy link
Member

Tokazama commented Nov 8, 2020

The last thing I need in order to get all of the stuff in "stridelayout.jl" to work with the stuff in "indexing.jl" is replacing this internal method.

@chriselrod
Copy link
Contributor

I need to walk back my earlier comment. I'd want equally sized blocks.
I think BlockArray(rand(4, 4), [2,2], [1,1,2]) is out of scope of at least what contiguous_axis and contiguous_batch_size are meant for.

@Luapulu
Copy link
Author

Luapulu commented Nov 9, 2020

So, would the block size have to divide the array size? Or could the last blocks along each axis be shorter? Because the second option would be needed for chunked data on disk for example.

@chriselrod
Copy link
Contributor

The last block can be shorter.

@chriselrod
Copy link
Contributor

chriselrod commented Nov 9, 2020

But I also need to ask to clarify -- what is the actual memory layout you have in mind?

Do you only intend the array to be iterated in a certain way, or do you intend the memory layout to be in blocks?

@Luapulu
Copy link
Author

Luapulu commented Nov 9, 2020

So, for my use case, namely HDF5Arrays, I have both chunked and unchunked datasets. Chunked datasets are stored in separate blocks on disk, (each is internally contiguous). Unchunked datasets are stored in a fully contiguous manner but it may still be worthwhile to access these chunk by chunk to avoid many single disk read operations in favour of fewer larger reads.

There’s a little bit more of a wrinkle, since HDF5 is actually row-major but HDF5.jl reverses the column to row order when reading / writing so that, in memory / on disk a julia array and a HDF5 dataset have the same memory layout.

One more issue: caching. HDF5 internally has a chunk cache, where recently accessed chunks are cached, allowing for faster reads if you keep working with the same chunks. I may also do some benchmarking and find that an explicit in-memory buffer in julia may be worth it. The buffer may be the size of one chunk, or it may be larger. In the case if an unchunked dataset on disk, there wouldn’t be any chunks of course, so the size of the buffer could be anything.

Footnote here: this is why I want to avoid subtyping. Ideally, I’d like one HDF5Array superclass with a number of different implementations as sub types. But subtyping an AbstractBlockArray would force non-chunked arrays to also be BlockArrays. Sure, there are workarounds, I can avoid a superclass and just have a Union of the different implemented types and call It AnyHDF5Array or I can subtype a BlockArray and simply have unchunked arrays consist of a single block, effectively making it non-blocked.

@ChrisRackauckas ChrisRackauckas transferred this issue from JuliaArrays/ArrayInterface.jl Feb 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants