[r] Port blockwise iterator/reader to R #2152

mojaveazure · 2024-02-17T01:38:47Z

Implement the blockwise iterator and reader for the R API

This PR parallels #1792; it implements new classes for blockwise iteration through a SOMA sparse nd-array. Blockwise iteration is implemented through SOMASparseNDArrayRead$blockwise() (paralleling the Python implenetation) and enabled for Arrow tables ($blockwise()$tables()) and COO sparse matrices ($blockwise()$sparse_matrix())

New classes:

CoordsStrider: new class to iterate through coordinate similar to Python's _coords_strider
SOMASparseNDArrayReadBase: base class for sparse array reads
SOMASparseNDArrayBlockwiseRead: new reader class for blockwise iterated reads
BlockwiseReadIterBase: base class for blockwise iteration
BlockwiseTableReadIter: blockwise iterator returning Arrow tables
BlockwiseSparseReadIter: blockwise iterator returning sparse matrices

New SOMA methods:

SOMASparseNDArrayRead$blockwse(): perform a blockwise read

resolves #1853

codecov · 2024-02-17T01:48:40Z

Codecov Report

Merging #2152 (a1f726d) into main (7b7f81e) will increase coverage by 4.26%.
Report is 1 commits behind head on main.
The diff coverage is 38.50%.

❗ Current head a1f726d differs from pull request most recent head a2778d5. Consider uploading reports for the commit a2778d5 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2152      +/-   ##
==========================================
+ Coverage   65.46%   69.73%   +4.26%     
==========================================
  Files         143       55      -88     
  Lines       12805     4794    -8011     
  Branches      510        0     -510     
==========================================
- Hits         8383     3343    -5040     
+ Misses       4334     1451    -2883     
+ Partials       88        0      -88

Flag	Coverage Δ
libtiledbsoma	`?`
python	`?`
r	`69.73% <38.50%> (-2.17%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
python_api	`∅ <ø> (∅)`
libtiledbsoma	`∅ <ø> (∅)`

eddelbuettel · 2024-02-27T19:25:12Z

apis/r/R/BlockwiseIter.R

-      #  return(NULL)
-      #}
+      message("blockwise read next")
+      if (is.null(private$soma_reader_pointer)) {


Nice catch.

apis/r/R/BlockwiseIter.R

apis/r/R/utils-readerTransformers.R

aaronwolen · 2024-03-01T00:00:52Z

[sc-42211]

shortcut-integration · 2024-03-01T00:00:55Z

This pull request has been linked to Shortcut Story #42211: [r] Port blockwise sparse iterator from Python to R.

mlin · 2024-03-08T08:42:46Z

apis/r/R/SOMASparseNDArrayRead.R

+        "'size' must be a single integer value" = is.null(size) ||
+          rlang::is_integerish(size, 1L, finite = TRUE) ||
+          (inherits(size, 'integer64') && length(size) == 1L && is.finite(size)),
+        "'reindex_disable_on_axis' must be  avector of integers" = is.null(reindex_disable_on_axis) ||


Suggested change

"'reindex_disable_on_axis' must be avector of integers" = is.null(reindex_disable_on_axis) ||

"'reindex_disable_on_axis' must be a vector of integers" = is.null(reindex_disable_on_axis) ||

Thanks @mlin, corrected in 839411a

mlin · 2024-03-08T09:28:03Z

@mojaveazure @eddelbuettel Thanks for all the work on this! One high-level question at this stage:

The Python .blockwise().scipy() iterator returns a csr_matrix (when stepping by the rows) or a csc_matrix (stepping by the columns, less common) where here it looks like we're always returning a COO (dgTmatrix). I personally don't have a strong opinion that the R version needs to reflect that exactly; but I thought we should at least discuss it here.

In Python (scipy) coo_matrix doesn't support all the operations that csr_matrix does, vector math ops in particular; so the latter is usually a little more convenient for the user.

However, also in Python, the internal representation of csr_matrix unfortunately uses memory proportional to the largest row index, not the number of actual rows. The result is that the reindexing feature is practically necessary for large datasets, to keep the memory usage of the returned csr_matrix objects under control. The reindexing is also practically helpful just to know what you're getting in each block.

I emphasized in Python because of course I don't know if we're working under the same design constraints in R.

eddelbuettel · 2024-03-08T12:22:58Z

@mlin There is a (currently unused) argument repr for the sparse matrix layout. We use 'T' (Matrix code for COO, internally called 'dgTMatrix') as it is the format on disc. The very simple change

modified   apis/r/R/BlockwiseIter.R
@@ -193,7 +193,7 @@ BlockwiseSparseReadIter <- R6::R6Class(
       coords,
       axis,
       ...,
-      repr = "T",
+      repr = c("T","R","C"),
       reindex_disable_on_axis = NULL
     ) {
       super$initialize(

permits return also of, respectively, a row- or column-compressed variant (dgRMatrix or dgCMatrix) corresponding to those compression formats.

mojaveazure · 2024-03-08T15:01:06Z

@mlin I would vote for only returning a COO matrix. The existing SparseReadIter class hardcodes a return representation of COO and the current blockwise behavior is in-line with that https://github.com/single-cell-data/TileDB-SOMA/blob/main/apis/r/R/SparseReadIter.R#L20-L31

I can see a case for removing the repr parameter from the blockwise side, espeically since the existing SparseReadIter does not have a repr parameter, but until we get alternative representations flowing through the existing SparseReadIter I think the blockwise should only return COO

mojaveazure · 2024-03-08T15:04:20Z

Also, @eddelbuettel's fix isn't as simple as expanding the allowed values to repr; it needs to be plumbed through and constantly checked for (Matrix has a bad habit of silently changing the representations based on what it deems most efficient; for example, conat() uses + to grow the resulting matrix, and row-based growth often results in CSC/CsparseMatrix getting cast to CSR/RsparseMatrix)

eddelbuettel · 2024-03-08T15:24:46Z

(Well it passed the unit test where we do $read_next()$concat() -- that appears to wire through.)

But thumbs for 'simpler is better'. COO seems fine as default.

mlin · 2024-03-08T19:59:48Z

@mojaveazure @eddelbuettel Ok, no objection from me to COO-only in principle, subject to:

What are we thinking about reindexing? As I mentioned, in Python it's practically essential, albeit because of an implementation detail in csr_matrix.

Besides that peculiarity, it's still useful for the iterator to tell you the range of the major axis (the one being strided) you're getting in each block. Reindexing is one, not the only, way of providing that info. Without, imagine getting the sparse matrix with a huge shape, and you know only one small block/stripe in it is populated, but you don't know where that is.

Then the minor axis- there are some use cases for which it is and isn't helpful to reindex that too. I'm less concerned about that.

mojaveazure · 2024-03-08T21:48:25Z

Reindexing would be useful, and essential to offering CSC/CSR output in R as well. However, that has yet to be ported to R, so I don't think we can offer that now.

As for knowing the range of the major axis, that poses different problem in R, and one I hadn't thought of. R does not allow tuple outputs, and has no native unpacking like Python. The ways around this that I can think of are:

returning a list with the values in one entry and the major-axis indices in another (list(block = read_next(), indices = indices, axis = axis)); this would change the return type and make it harder to move code from the existing iterators to the blockwise iterator
store the indices as an attribute on the return values; this would allow the return type of the blockwise iterators to be the same as the existing iterators

block <- read_next()
attr(block, "indices") <- indices
attr(block, "axis") <- axis

offer a mode to extract just the indices returned

block <- read_next()
if (axis == 0) {
  return(block[indices, ])
else {
  return(block[, indices])
}

As for minor-axis, this PR currently doesn't do anything there as reindexing is not a part of this PR

aaronwolen · 2024-03-09T14:18:26Z

As we discussed, overall this looks great!

I do think we need to think carefully about @mlin's comment:

...it's still useful for the iterator to tell you the range of the major axis (the one being strided) you're getting in each block.

and your suggestion to return a list with the values might be the way to go. Perhaps @bkmartinjr and/or @pablo-gar could weigh in since this is used in the cellxgene-census package.

eddelbuettel · 2024-03-09T15:52:01Z

The single-return-object constraint is indeed real. In other (simpler) contexts I have often used the list() method but its lack of elegance is really stark. Given how refined a class structure we built (see below), it seems options 2 or 3 above are more natural?

…Iter$concat()` Plumb through `BlockwiseTableIter$conat()` and `BlockwiseTableIter$private$soma_reader_transform()` Slight rejiggering of `read_next()` to avoid multiple `$read_complete()` checks Improve `BlockwiseReadIterBase$read_next()` checks

Amazingly, I can't spell 'array' properly 🤦

…vate` Update docs

Delay registration of `nextElem.CoordsStrider()` and `hasNext.CoordsStrider()`

…r$next_element()`

Have `SparseReadIter$concat()` use new helper function

eddelbuettel · 2024-04-03T22:20:41Z

@mojaveazure Thanks for the rebase, that was on my TODO as well but it has been a busy day. Looks like we inherited some good state from main which is nice.

Bump develop version [ci skip]

johnkerl · 2024-04-04T14:21:42Z

Regarding
#2152 (review)
we were waiting on:

Having the reindexer in this PR, or another;
Redness of CI

As per discussion above we decided collectively that the reindexer will be a follow-on PR, and the CI issue has been resolved via #2363 which has been merged to main and which this current PR is now rebased on top of.

So I believe we are good to merge this PR.

Connect the re-indexer to the blockwise iterator, allowing reads to be re-indexed on-the-fly. This PR parallels #1792 and completes #2152 and #2637; in addition, provides new shorthand for `reindex_disable_on_axis`: - `TRUE`: disable re-indexing on all axes - `FALSE: re-index on all axes - `NA`: re-index only on major axis, disable re-indexing on all axes (default) `BlockwiseTableReadIter$concat()` and `BlockwiseSparseReadIter$concat()` are disabled when re-indexing is requested (paralleling Python) `BlockwiseSparseReadIter` now accepts `repr = "R"` or `repr = "C"` under certain circumstances: - axis 0 (`soma_dim_0`) must be re-indexed to allow `repr = "R"` - axis 1 (`soma_dim_1`) must be re-indexed to allow `repr = "C"` `repr` of `"T"` is allowed in all circumstances and continues to be the default Two new fields are available to blockwise iterators: - `$axes_to_reindex`: a vector of minor axes slated to be re-indexed - `$reindexable`: status indicator stating if _any_ axis (major or minor) is slated to be re-indexed resolves #2671

Connect the re-indexer to the blockwise iterator, allowing reads to be re-indexed on-the-fly. This PR parallels #1792 and completes #2152 and #2637; in addition, provides new shorthand for `reindex_disable_on_axis`: - `TRUE`: disable re-indexing on all axes - `FALSE: re-index on all axes - `NA`: re-index only on major axis, disable re-indexing on all axes (default) `BlockwiseTableReadIter$concat()` and `BlockwiseSparseReadIter$concat()` are disabled when re-indexing is requested (paralleling Python) `BlockwiseSparseReadIter` now accepts `repr = "R"` or `repr = "C"` under certain circumstances: - axis 0 (`soma_dim_0`) must be re-indexed to allow `repr = "R"` - axis 1 (`soma_dim_1`) must be re-indexed to allow `repr = "C"` `repr` of `"T"` is allowed in all circumstances and continues to be the default Two new fields are available to blockwise iterators: - `$axes_to_reindex`: a vector of minor axes slated to be re-indexed - `$reindexable`: status indicator stating if _any_ axis (major or minor) is slated to be re-indexed resolves #2671 Co-authored-by: Paul Hoffman <mojaveazure@users.noreply.github.com>

mojaveazure added the r-api label Feb 17, 2024

mojaveazure assigned eddelbuettel and mojaveazure Feb 17, 2024

mojaveazure force-pushed the ph/feat/blockwise-reader branch from b019048 to 0a4cd6c Compare February 19, 2024 23:24

johnkerl changed the title ~~[r] [WIP] Blockwise Reader~~ [r] [WIP] Blockwise reader Feb 21, 2024

eddelbuettel reviewed Feb 27, 2024

View reviewed changes

apis/r/R/BlockwiseIter.R Outdated Show resolved Hide resolved

eddelbuettel reviewed Feb 27, 2024

View reviewed changes

apis/r/R/utils-readerTransformers.R Outdated Show resolved Hide resolved

mojaveazure force-pushed the ph/feat/blockwise-reader branch from 0a5ac3b to 15f50f9 Compare March 1, 2024 23:05

mojaveazure changed the title ~~[r] [WIP] Blockwise reader~~ [r] Port blockwise iterator/reader to R Mar 4, 2024

mojaveazure marked this pull request as ready for review March 4, 2024 21:10

mojaveazure requested review from mlin, aaronwolen, johnkerl and bkmartinjr March 4, 2024 21:11

mojaveazure force-pushed the ph/feat/blockwise-reader branch from a1b24a5 to f975c61 Compare March 7, 2024 22:22

mlin reviewed Mar 8, 2024

View reviewed changes

mojaveazure requested a review from pablo-gar March 11, 2024 18:58

mojaveazure and others added 17 commits April 3, 2024 16:12

Use new utility function shared w/ blockwise table iter

bfacfd0

Update docs

ba5c250

Fix typo

63b37da

Amazingly, I can't spell 'array' properly 🤦

Move $reset() and $set_dim_points() to `BlockwiseReadIterBase$pri…

5e5ea8f

…vate` Update docs

Add $length() method for CoordsStrider

22b2286

sparse (via new length()) and concat for blockwise read iters

bb90d34

Move iterators and itertools to Suggests

50fc01b

Delay registration of `nextElem.CoordsStrider()` and `hasNext.CoordsStrider()`

New tests

535737b

Update docs

f6e37e5

Switch to snake-case for CoordsStrider$has_next() and `CoordsStride…

03b6f63

…r$next_element()`

Make tests more explicit with expect_s3_class()/expect_s4_class()

664d5e0

Clean up soma_array_to_sparse_matrix_concat()

a276084

Have `SparseReadIter$concat()` use new helper function

Fix typo

57dd698

Add error messages to stopifnot() calls

f801834

Correct typo spotted by @mlin

143f225

Fix rebase errors

a1f726d

mojaveazure force-pushed the ph/feat/blockwise-reader branch from f4d27c1 to a1f726d Compare April 3, 2024 20:19

johnkerl self-requested a review April 3, 2024 20:23

johnkerl approved these changes Apr 4, 2024

View reviewed changes

Update changelog

a2778d5

Bump develop version [ci skip]

mojaveazure merged commit d34a341 into main Apr 4, 2024

mojaveazure deleted the ph/feat/blockwise-reader branch April 4, 2024 14:22

mojaveazure mentioned this pull request Jun 14, 2024

[r] Connect re-indexer to blockwise iterator #2742

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[r] Port blockwise iterator/reader to R #2152

[r] Port blockwise iterator/reader to R #2152

mojaveazure commented Feb 17, 2024 •

edited

Loading

codecov bot commented Feb 17, 2024 •

edited

Loading

eddelbuettel Feb 27, 2024

aaronwolen commented Mar 1, 2024

shortcut-integration bot commented Mar 1, 2024

mlin Mar 8, 2024

eddelbuettel Mar 8, 2024

mlin commented Mar 8, 2024

eddelbuettel commented Mar 8, 2024

mojaveazure commented Mar 8, 2024

mojaveazure commented Mar 8, 2024

eddelbuettel commented Mar 8, 2024

mlin commented Mar 8, 2024

mojaveazure commented Mar 8, 2024

aaronwolen commented Mar 9, 2024

eddelbuettel commented Mar 9, 2024

eddelbuettel commented Apr 3, 2024 •

edited

Loading

johnkerl commented Apr 4, 2024

	"'reindex_disable_on_axis' must be avector of integers" = is.null(reindex_disable_on_axis) \|\|
	"'reindex_disable_on_axis' must be a vector of integers" = is.null(reindex_disable_on_axis) \|\|

[r] Port blockwise iterator/reader to R #2152

[r] Port blockwise iterator/reader to R #2152

Conversation

mojaveazure commented Feb 17, 2024 • edited Loading

codecov bot commented Feb 17, 2024 • edited Loading

Codecov Report

eddelbuettel Feb 27, 2024

Choose a reason for hiding this comment

aaronwolen commented Mar 1, 2024

shortcut-integration bot commented Mar 1, 2024

mlin Mar 8, 2024

Choose a reason for hiding this comment

eddelbuettel Mar 8, 2024

Choose a reason for hiding this comment

mlin commented Mar 8, 2024

eddelbuettel commented Mar 8, 2024

mojaveazure commented Mar 8, 2024

mojaveazure commented Mar 8, 2024

eddelbuettel commented Mar 8, 2024

mlin commented Mar 8, 2024

mojaveazure commented Mar 8, 2024

aaronwolen commented Mar 9, 2024

eddelbuettel commented Mar 9, 2024

eddelbuettel commented Apr 3, 2024 • edited Loading

johnkerl commented Apr 4, 2024

mojaveazure commented Feb 17, 2024 •

edited

Loading

codecov bot commented Feb 17, 2024 •

edited

Loading

eddelbuettel commented Apr 3, 2024 •

edited

Loading