[c++/python] Add `SOMAArray::stats` to get query stats #1927

nguyenv · 2023-11-20T18:28:37Z

Issue and/or context:

Requested by @johnkerl. Context: #1932.

Changes:

Addition of SOMAArray:;stats which calls tiledb::Query::stats to get stats of query run
Add SOMAArray.stats Python binding

johnkerl

Marking as 'request changes' solely to put a 'pause' on this while discussing with @KiterLuc

My concerns -- not just about this PR; more to make sure this is really what the limitations are with core and that we can't do anything about it in TileDB-SOMA:

My mental model of how stats are supposed to work is
- (a) stats_enable
- (b) do something
- (c) do something else
- (d,e,f) make all sorts of nested whatever
- (g) stats_dump
- This produces one dump for whatever core accesses were done through all of points b-f

What appears to be the case now is this is broken, in the core API, in three ways:

The final stats_dump does not include query->stats() output for various instances of query -- some things appear in the final dump, others in the "side data" from a single query
We have to do exp.ms["RNA"].X["data"].stats() separately
Even then: on this PR as-is, we get the results of one query -- so in the case of X.read().concat() that takes three paginations, the stats will only have the results of the last query

If the core stats API is really such that we have no choice in TileDB-SOMA, we need a "caveat emptor" somewhere that stats calls will need to be inserted inside each iteration of X.read(), in particular, code will need to be modified in order to be profiled correctly.

eddelbuettel · 2023-11-20T18:50:18Z

At the risk of getting this wrong, there are actually (AFAIK) two APIs. The one you describe where we globally enable and collect. But there is also a second one, just on the query engine. TileDB-R calls the 'getter' if a toggle is on, and this functions independently of the global state. Mini example below the fold. And it seems to me that the PR hits the same accessor and should therefor work the same way.

> uri <- "/tmp/tiledb/penguins"
> res <- tiledb_array(uri, return_as="data.table", query_statistics=TRUE)[]    # ask for query stats
> stats <- attr(res, "query_statistics")      # we return JSON string as attribute of the return object
> str( jsonlite::fromJSON(stats) )
List of 2
 $ timers  :List of 36
  ..$ Context.StorageManager.Query.Reader.unfilter_coord_tiles.sum        : num 0.000301
  ..$ Context.StorageManager.Query.Reader.unfilter_coord_tiles.avg        : num 1e-04
  ..$ Context.StorageManager.Query.Reader.unfilter_attr_tiles.sum         : num 0.00264
  ..$ Context.StorageManager.Query.Reader.unfilter_attr_tiles.avg         : num 0.00033
  ..$ Context.StorageManager.Query.Reader.tile_offset_sizes.sum           : num 4.41e-06
  ..$ Context.StorageManager.Query.Reader.tile_offset_sizes.avg           : num 4.41e-06
  ..$ Context.StorageManager.Query.Reader.resize_fixed_results_to_copy.sum: num 1.85e-06
  ..$ Context.StorageManager.Query.Reader.resize_fixed_results_to_copy.avg: num 1.85e-06
  ..$ Context.StorageManager.Query.Reader.read_tiles.sum                  : num 0.000289
  ..$ Context.StorageManager.Query.Reader.read_tiles.avg                  : num 7.23e-05
  ..$ Context.StorageManager.Query.Reader.read_coordinate_tiles.sum       : num 0.000115
  ..$ Context.StorageManager.Query.Reader.read_coordinate_tiles.avg       : num 5.74e-05
  ..$ Context.StorageManager.Query.Reader.read_attribute_tiles.sum        : num 0.000183
  ..$ Context.StorageManager.Query.Reader.read_attribute_tiles.avg        : num 9.17e-05
  ..$ Context.StorageManager.Query.Reader.read_and_unfilter_coords.sum    : num 0.00045
  ..$ Context.StorageManager.Query.Reader.read_and_unfilter_coords.avg    : num 0.00045
  ..$ Context.StorageManager.Query.Reader.read_and_unfilter_attributes.sum: num 0.00284
  ..$ Context.StorageManager.Query.Reader.read_and_unfilter_attributes.avg: num 0.00284
  ..$ Context.StorageManager.Query.Reader.process_tiles.sum               : num 0.00323
  ..$ Context.StorageManager.Query.Reader.process_tiles.avg               : num 0.00323
  ..$ Context.StorageManager.Query.Reader.load_tile_var_sizes.sum         : num 3.21e-06
  ..$ Context.StorageManager.Query.Reader.load_tile_var_sizes.avg         : num 3.21e-06
  ..$ Context.StorageManager.Query.Reader.load_tile_offsets.sum           : num 0.000583
  ..$ Context.StorageManager.Query.Reader.load_tile_offsets.avg           : num 0.000194
  ..$ Context.StorageManager.Query.Reader.load_initial_data.sum           : num 0.000266
  ..$ Context.StorageManager.Query.Reader.load_initial_data.avg           : num 0.000266
  ..$ Context.StorageManager.Query.Reader.dowork.sum                      : num 0.00469
  ..$ Context.StorageManager.Query.Reader.dowork.avg                      : num 0.00469
  ..$ Context.StorageManager.Query.Reader.create_result_tiles.sum         : num 3.55e-06
  ..$ Context.StorageManager.Query.Reader.create_result_tiles.avg         : num 3.55e-06
  ..$ Context.StorageManager.Query.Reader.copy_fixed_data_tiles.sum       : num 0.000338
  ..$ Context.StorageManager.Query.Reader.copy_fixed_data_tiles.avg       : num 3.76e-05
  ..$ Context.StorageManager.Query.Reader.compute_tile_bitmaps.sum        : num 8.19e-05
  ..$ Context.StorageManager.Query.Reader.compute_tile_bitmaps.avg        : num 8.19e-05
  ..$ Context.StorageManager.Query.Reader.apply_query_condition.sum       : num 5.58e-07
  ..$ Context.StorageManager.Query.Reader.apply_query_condition.avg       : num 5.58e-07
 $ counters:List of 12
  ..$ Context.StorageManager.Query.Reader.result_num              : int 344
  ..$ Context.StorageManager.Query.Reader.read_unfiltered_byte_num: int 16856
  ..$ Context.StorageManager.Query.Reader.num_tiles_read          : int 9
  ..$ Context.StorageManager.Query.Reader.loop_num                : int 1
  ..$ Context.StorageManager.Query.Reader.internal_loop_num       : int 1
  ..$ Context.StorageManager.Query.Reader.ignored_tiles           : int 0
  ..$ Context.StorageManager.Query.Reader.dim_num                 : int 1
  ..$ Context.StorageManager.Query.Reader.dim_fixed_num           : int 1
  ..$ Context.StorageManager.Query.Reader.cell_num                : int 4128
  ..$ Context.StorageManager.Query.Reader.attr_num                : int 8
  ..$ Context.StorageManager.Query.Reader.attr_nullable_num       : int 5
  ..$ Context.StorageManager.Query.Reader.attr_fixed_num          : int 8
>

johnkerl · 2023-11-20T18:54:04Z

Thanks @eddelbuettel

We're agreed there are two APIs, one on the global state, and the other on the query

Above, I enumerated three concerns with that status quo (a status quo which I don't much like)

My articulated first choice/hope is that we can find a way to simply get all the stats through one interface

My articulated second choice/hope is that we need to communicate to all users of this that there are two paths now, and furthermore, that .concat() queries will need to be unrolled with stats() calls inserted within each loop iteration in order to get non-misleading results

nguyenv · 2023-11-20T19:07:44Z

I will hold off on making additional changes for now.

codecov-commenter · 2023-11-20T19:14:12Z

Codecov Report

Merging #1927 (8cee72f) into main (1ebc379) will not change coverage.
The diff coverage is n/a.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1927   +/-   ##
=======================================
  Coverage   89.76%   89.76%           
=======================================
  Files          34       34           
  Lines        3597     3597           
=======================================
  Hits         3229     3229           
  Misses        368      368

Flag	Coverage Δ
python	`89.76% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
python_api	`89.76% <ø> (ø)`
libtiledbsoma	`∅ <ø> (∅)`

johnkerl · 2023-11-21T22:18:44Z

I will hold off on making additional changes for now.

@nguyenv I think what we talked about in Slack will work, namely:

Take what you have now
Wire it up through read methods in the DataFrame, SparseNDArray, and DenseNDArray classes' read methods as an optional, backward-compatible argument
This will automatically solve the .concat problem: anyone who wants per-query stats will need to access them on each read(), which is a bit of a limitation, but one we're stuck with, and the beauty of this is that this naturally forces the caller to get stats on each read() iteration -- there's zero risk of a 3-page .read().concat() returning only the stats for the 3rd page
The only open question is how the user will get the stats-string
- Option 1: they don't -- we unconditionally print it out to stdout or stderr (hard-coded)
  - Not great, but, simple API
- Option 2: optarg to read takes a stream -- e.g. read(..., stats_stream=os.stdout, ...) or read(..., stats_stream=handle_they_have_opened, ...)
- Option 3: optarg to read is read(..., provided_stats=True) and the caller will need to do string = X.get_stats() after every .read()`
- Other options I'm missing -- ? 🤔

Also: this is a non-urgent PR. However, tiledb stats are basic to have, and important for supportability, so at some point we need to figure out a way to let people access query stats.

nguyenv · 2023-11-21T22:29:09Z

I like option 3 because it's most similar to how the global stats work where first you enable stats and then get the stats in another call.

eddelbuettel · 2023-11-21T22:31:32Z

Had the some instinctive reaction yesterday when that came up: one query statistic dump per read attempt during the iterations, resulting in one json string per read attempt. If the user is iterating over reads, the burden may be in the reader to fetch the individual strings which should work both R and Python. And just how concat() builds up the overall result data structure it should be able to construct a vector (or list) of json strings.

johnkerl · 2023-11-23T00:58:05Z

@nguyenv new info on #1932 ... I think there's a straight-up bug here (I don't know where) and we should not be having to create a separate per-query stats accessor. Details on #1932 but TL;DR query stats do appear in global stats for the same data & same query in TileDB-Py and TileDB-SOMA-R, just not in TileDB-SOMA-Py ...

nguyenv · 2023-12-04T15:51:06Z

Some observations:

If we open a SOMAArray and take global stats, then we get the query stats:

    clib.Stats.enable()
    sr = clib.SOMAArray(tmp_path.as_posix())
    sr.read_next()
    print(clib.Stats.dump())

...
      "Context.StorageManager.Query.Reader.unfilter_coord_tiles.sum": 0.0030568,
      "Context.StorageManager.Query.Reader.unfilter_coord_tiles.avg": 0.0007642,
      "Context.StorageManager.Query.Reader.tile_offset_sizes.sum": 8.67e-05,
...

If we open a soma.DataFrame and take global stats, then we don't get the query stats:

    soma.pytiledbsoma.Stats.enable()
    with soma.DataFrame.open(tmp_path.as_posix()) as sidf:
        sidf.read().concat()
    print(soma.pytiledbsoma.Stats.dump())

This also does not get us the query stats (where sidf._soma_reader() is a clib.SOMAArray):

    with soma.DataFrame.open(tmp_path.as_posix()) as sidf:
        sidf._soma_reader().read_next()

[c++/python] Add SOMAArray::stats to get query stats

ab5712b

johnkerl requested changes Nov 20, 2023

View reviewed changes

Formatting

8cee72f

johnkerl self-requested a review November 21, 2023 22:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[c++/python] Add `SOMAArray::stats` to get query stats #1927

[c++/python] Add `SOMAArray::stats` to get query stats #1927

nguyenv commented Nov 20, 2023 •

edited by johnkerl

Loading

johnkerl left a comment •

edited

Loading

eddelbuettel commented Nov 20, 2023 •

edited

Loading

johnkerl commented Nov 20, 2023 •

edited

Loading

nguyenv commented Nov 20, 2023

codecov-commenter commented Nov 20, 2023 •

edited

Loading

johnkerl commented Nov 21, 2023 •

edited

Loading

nguyenv commented Nov 21, 2023

eddelbuettel commented Nov 21, 2023

johnkerl commented Nov 23, 2023

nguyenv commented Dec 4, 2023 •

edited

Loading

[c++/python] Add SOMAArray::stats to get query stats #1927

Are you sure you want to change the base?

[c++/python] Add SOMAArray::stats to get query stats #1927

Conversation

nguyenv commented Nov 20, 2023 • edited by johnkerl Loading

johnkerl left a comment • edited Loading

Choose a reason for hiding this comment

eddelbuettel commented Nov 20, 2023 • edited Loading

johnkerl commented Nov 20, 2023 • edited Loading

nguyenv commented Nov 20, 2023

codecov-commenter commented Nov 20, 2023 • edited Loading

Codecov Report

johnkerl commented Nov 21, 2023 • edited Loading

nguyenv commented Nov 21, 2023

eddelbuettel commented Nov 21, 2023

johnkerl commented Nov 23, 2023

nguyenv commented Dec 4, 2023 • edited Loading

[c++/python] Add `SOMAArray::stats` to get query stats #1927

[c++/python] Add `SOMAArray::stats` to get query stats #1927

nguyenv commented Nov 20, 2023 •

edited by johnkerl

Loading

johnkerl left a comment •

edited

Loading

eddelbuettel commented Nov 20, 2023 •

edited

Loading

johnkerl commented Nov 20, 2023 •

edited

Loading

codecov-commenter commented Nov 20, 2023 •

edited

Loading

johnkerl commented Nov 21, 2023 •

edited

Loading

nguyenv commented Dec 4, 2023 •

edited

Loading