Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[r] Port blockwise iterator/reader to R #2152

Merged
merged 35 commits into from
Apr 4, 2024
Merged

Conversation

mojaveazure
Copy link
Member

@mojaveazure mojaveazure commented Feb 17, 2024

Implement the blockwise iterator and reader for the R API

This PR parallels #1792; it implements new classes for blockwise iteration through a SOMA sparse nd-array. Blockwise iteration is implemented through SOMASparseNDArrayRead$blockwise() (paralleling the Python implenetation) and enabled for Arrow tables ($blockwise()$tables()) and COO sparse matrices ($blockwise()$sparse_matrix())

New classes:

  • CoordsStrider: new class to iterate through coordinate similar to Python's _coords_strider
  • SOMASparseNDArrayReadBase: base class for sparse array reads
  • SOMASparseNDArrayBlockwiseRead: new reader class for blockwise iterated reads
  • BlockwiseReadIterBase: base class for blockwise iteration
  • BlockwiseTableReadIter: blockwise iterator returning Arrow tables
  • BlockwiseSparseReadIter: blockwise iterator returning sparse matrices

New SOMA methods:

  • SOMASparseNDArrayRead$blockwse(): perform a blockwise read

resolves #1853

Copy link

codecov bot commented Feb 17, 2024

Codecov Report

Merging #2152 (a1f726d) into main (7b7f81e) will increase coverage by 4.26%.
Report is 1 commits behind head on main.
The diff coverage is 38.50%.

❗ Current head a1f726d differs from pull request most recent head a2778d5. Consider uploading reports for the commit a2778d5 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2152      +/-   ##
==========================================
+ Coverage   65.46%   69.73%   +4.26%     
==========================================
  Files         143       55      -88     
  Lines       12805     4794    -8011     
  Branches      510        0     -510     
==========================================
- Hits         8383     3343    -5040     
+ Misses       4334     1451    -2883     
+ Partials       88        0      -88     
Flag Coverage Δ
libtiledbsoma ?
python ?
r 69.73% <38.50%> (-2.17%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
python_api ∅ <ø> (∅)
libtiledbsoma ∅ <ø> (∅)

@johnkerl johnkerl changed the title [r] [WIP] Blockwise Reader [r] [WIP] Blockwise reader Feb 21, 2024
# return(NULL)
#}
message("blockwise read next")
if (is.null(private$soma_reader_pointer)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch.

@aaronwolen
Copy link
Member

[sc-42211]

Copy link

This pull request has been linked to Shortcut Story #42211: [r] Port blockwise sparse iterator from Python to R.

@mojaveazure mojaveazure changed the title [r] [WIP] Blockwise reader [r] Port blockwise iterator/reader to R Mar 4, 2024
@mojaveazure mojaveazure marked this pull request as ready for review March 4, 2024 21:10
"'size' must be a single integer value" = is.null(size) ||
rlang::is_integerish(size, 1L, finite = TRUE) ||
(inherits(size, 'integer64') && length(size) == 1L && is.finite(size)),
"'reindex_disable_on_axis' must be avector of integers" = is.null(reindex_disable_on_axis) ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"'reindex_disable_on_axis' must be avector of integers" = is.null(reindex_disable_on_axis) ||
"'reindex_disable_on_axis' must be a vector of integers" = is.null(reindex_disable_on_axis) ||

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mlin, corrected in 839411a

@mlin
Copy link
Member

mlin commented Mar 8, 2024

@mojaveazure @eddelbuettel Thanks for all the work on this! One high-level question at this stage:

The Python .blockwise().scipy() iterator returns a csr_matrix (when stepping by the rows) or a csc_matrix (stepping by the columns, less common) where here it looks like we're always returning a COO (dgTmatrix). I personally don't have a strong opinion that the R version needs to reflect that exactly; but I thought we should at least discuss it here.

In Python (scipy) coo_matrix doesn't support all the operations that csr_matrix does, vector math ops in particular; so the latter is usually a little more convenient for the user.

However, also in Python, the internal representation of csr_matrix unfortunately uses memory proportional to the largest row index, not the number of actual rows. The result is that the reindexing feature is practically necessary for large datasets, to keep the memory usage of the returned csr_matrix objects under control. The reindexing is also practically helpful just to know what you're getting in each block.

I emphasized in Python because of course I don't know if we're working under the same design constraints in R.

@eddelbuettel
Copy link
Contributor

@mlin There is a (currently unused) argument repr for the sparse matrix layout. We use 'T' (Matrix code for COO, internally called 'dgTMatrix') as it is the format on disc. The very simple change

modified   apis/r/R/BlockwiseIter.R
@@ -193,7 +193,7 @@ BlockwiseSparseReadIter <- R6::R6Class(
       coords,
       axis,
       ...,
-      repr = "T",
+      repr = c("T","R","C"),
       reindex_disable_on_axis = NULL
     ) {
       super$initialize(

permits return also of, respectively, a row- or column-compressed variant (dgRMatrix or dgCMatrix) corresponding to those compression formats.

@mojaveazure
Copy link
Member Author

@mlin I would vote for only returning a COO matrix. The existing SparseReadIter class hardcodes a return representation of COO and the current blockwise behavior is in-line with that https://github.com/single-cell-data/TileDB-SOMA/blob/main/apis/r/R/SparseReadIter.R#L20-L31

I can see a case for removing the repr parameter from the blockwise side, espeically since the existing SparseReadIter does not have a repr parameter, but until we get alternative representations flowing through the existing SparseReadIter I think the blockwise should only return COO

@mojaveazure
Copy link
Member Author

Also, @eddelbuettel's fix isn't as simple as expanding the allowed values to repr; it needs to be plumbed through and constantly checked for (Matrix has a bad habit of silently changing the representations based on what it deems most efficient; for example, conat() uses + to grow the resulting matrix, and row-based growth often results in CSC/CsparseMatrix getting cast to CSR/RsparseMatrix)

@eddelbuettel
Copy link
Contributor

(Well it passed the unit test where we do $read_next()$concat() -- that appears to wire through.)

But thumbs for 'simpler is better'. COO seems fine as default.

@mlin
Copy link
Member

mlin commented Mar 8, 2024

@mojaveazure @eddelbuettel Ok, no objection from me to COO-only in principle, subject to:

What are we thinking about reindexing? As I mentioned, in Python it's practically essential, albeit because of an implementation detail in csr_matrix.

Besides that peculiarity, it's still useful for the iterator to tell you the range of the major axis (the one being strided) you're getting in each block. Reindexing is one, not the only, way of providing that info. Without, imagine getting the sparse matrix with a huge shape, and you know only one small block/stripe in it is populated, but you don't know where that is.

Then the minor axis- there are some use cases for which it is and isn't helpful to reindex that too. I'm less concerned about that.

@mojaveazure
Copy link
Member Author

Reindexing would be useful, and essential to offering CSC/CSR output in R as well. However, that has yet to be ported to R, so I don't think we can offer that now.

As for knowing the range of the major axis, that poses different problem in R, and one I hadn't thought of. R does not allow tuple outputs, and has no native unpacking like Python. The ways around this that I can think of are:

  • returning a list with the values in one entry and the major-axis indices in another (list(block = read_next(), indices = indices, axis = axis)); this would change the return type and make it harder to move code from the existing iterators to the blockwise iterator
  • store the indices as an attribute on the return values; this would allow the return type of the blockwise iterators to be the same as the existing iterators
block <- read_next()
attr(block, "indices") <- indices
attr(block, "axis") <- axis
  • offer a mode to extract just the indices returned
block <- read_next()
if (axis == 0) {
  return(block[indices, ])
else {
  return(block[, indices])
}

As for minor-axis, this PR currently doesn't do anything there as reindexing is not a part of this PR

@aaronwolen
Copy link
Member

As we discussed, overall this looks great!

I do think we need to think carefully about @mlin's comment:

...it's still useful for the iterator to tell you the range of the major axis (the one being strided) you're getting in each block.

and your suggestion to return a list with the values might be the way to go. Perhaps @bkmartinjr and/or @pablo-gar could weigh in since this is used in the cellxgene-census package.

@eddelbuettel
Copy link
Contributor

The single-return-object constraint is indeed real. In other (simpler) contexts I have often used the list() method but its lack of elegance is really stark. Given how refined a class structure we built (see below), it seems options 2 or 3 above are more natural?

tldsm

@johnkerl johnkerl self-requested a review April 3, 2024 20:23
@eddelbuettel
Copy link
Contributor

eddelbuettel commented Apr 3, 2024

@mojaveazure Thanks for the rebase, that was on my TODO as well but it has been a busy day. Looks like we inherited some good state from main which is nice.

Bump develop version

[ci skip]
@johnkerl
Copy link
Member

johnkerl commented Apr 4, 2024

Regarding
#2152 (review)
we were waiting on:

  • Having the reindexer in this PR, or another;
  • Redness of CI

As per discussion above we decided collectively that the reindexer will be a follow-on PR, and the CI issue has been resolved via #2363 which has been merged to main and which this current PR is now rebased on top of.

So I believe we are good to merge this PR.

@mojaveazure mojaveazure merged commit d34a341 into main Apr 4, 2024
@mojaveazure mojaveazure deleted the ph/feat/blockwise-reader branch April 4, 2024 14:22
mojaveazure added a commit that referenced this pull request Jun 14, 2024
Connect the re-indexer to the blockwise iterator, allowing reads to be re-indexed on-the-fly. This PR parallels #1792 and completes #2152 and #2637; in addition, provides new shorthand for `reindex_disable_on_axis`:
 - `TRUE`: disable re-indexing on all axes
 - `FALSE: re-index on all axes
 - `NA`: re-index only on major axis, disable re-indexing on all axes (default)

`BlockwiseTableReadIter$concat()` and `BlockwiseSparseReadIter$concat()` are disabled when re-indexing is requested (paralleling Python)

`BlockwiseSparseReadIter` now accepts `repr = "R"` or `repr = "C"` under certain circumstances:
 - axis 0 (`soma_dim_0`) must be re-indexed to allow `repr = "R"`
 - axis 1 (`soma_dim_1`) must be re-indexed to allow `repr = "C"`

`repr` of `"T"` is allowed in all circumstances and continues to be the default

Two new fields are available to blockwise iterators:
 - `$axes_to_reindex`: a vector of minor axes slated to be re-indexed
 - `$reindexable`: status indicator stating if _any_ axis (major or minor) is slated to be re-indexed

resolves #2671
johnkerl pushed a commit that referenced this pull request Jun 17, 2024
Connect the re-indexer to the blockwise iterator, allowing reads to be re-indexed on-the-fly. This PR parallels #1792 and completes #2152 and #2637; in addition, provides new shorthand for `reindex_disable_on_axis`:
 - `TRUE`: disable re-indexing on all axes
 - `FALSE: re-index on all axes
 - `NA`: re-index only on major axis, disable re-indexing on all axes (default)

`BlockwiseTableReadIter$concat()` and `BlockwiseSparseReadIter$concat()` are disabled when re-indexing is requested (paralleling Python)

`BlockwiseSparseReadIter` now accepts `repr = "R"` or `repr = "C"` under certain circumstances:
 - axis 0 (`soma_dim_0`) must be re-indexed to allow `repr = "R"`
 - axis 1 (`soma_dim_1`) must be re-indexed to allow `repr = "C"`

`repr` of `"T"` is allowed in all circumstances and continues to be the default

Two new fields are available to blockwise iterators:
 - `$axes_to_reindex`: a vector of minor axes slated to be re-indexed
 - `$reindexable`: status indicator stating if _any_ axis (major or minor) is slated to be re-indexed

resolves #2671
github-actions bot pushed a commit that referenced this pull request Jun 17, 2024
Connect the re-indexer to the blockwise iterator, allowing reads to be re-indexed on-the-fly. This PR parallels #1792 and completes #2152 and #2637; in addition, provides new shorthand for `reindex_disable_on_axis`:
 - `TRUE`: disable re-indexing on all axes
 - `FALSE: re-index on all axes
 - `NA`: re-index only on major axis, disable re-indexing on all axes (default)

`BlockwiseTableReadIter$concat()` and `BlockwiseSparseReadIter$concat()` are disabled when re-indexing is requested (paralleling Python)

`BlockwiseSparseReadIter` now accepts `repr = "R"` or `repr = "C"` under certain circumstances:
 - axis 0 (`soma_dim_0`) must be re-indexed to allow `repr = "R"`
 - axis 1 (`soma_dim_1`) must be re-indexed to allow `repr = "C"`

`repr` of `"T"` is allowed in all circumstances and continues to be the default

Two new fields are available to blockwise iterators:
 - `$axes_to_reindex`: a vector of minor axes slated to be re-indexed
 - `$reindexable`: status indicator stating if _any_ axis (major or minor) is slated to be re-indexed

resolves #2671
johnkerl pushed a commit that referenced this pull request Jun 17, 2024
Connect the re-indexer to the blockwise iterator, allowing reads to be re-indexed on-the-fly. This PR parallels #1792 and completes #2152 and #2637; in addition, provides new shorthand for `reindex_disable_on_axis`:
 - `TRUE`: disable re-indexing on all axes
 - `FALSE: re-index on all axes
 - `NA`: re-index only on major axis, disable re-indexing on all axes (default)

`BlockwiseTableReadIter$concat()` and `BlockwiseSparseReadIter$concat()` are disabled when re-indexing is requested (paralleling Python)

`BlockwiseSparseReadIter` now accepts `repr = "R"` or `repr = "C"` under certain circumstances:
 - axis 0 (`soma_dim_0`) must be re-indexed to allow `repr = "R"`
 - axis 1 (`soma_dim_1`) must be re-indexed to allow `repr = "C"`

`repr` of `"T"` is allowed in all circumstances and continues to be the default

Two new fields are available to blockwise iterators:
 - `$axes_to_reindex`: a vector of minor axes slated to be re-indexed
 - `$reindexable`: status indicator stating if _any_ axis (major or minor) is slated to be re-indexed

resolves #2671

Co-authored-by: Paul Hoffman <mojaveazure@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[r] Port blockwise sparse iterator from Python to R
7 participants