Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[c++/python] Add SOMAArray::stats to get query stats #1927

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

nguyenv
Copy link
Member

@nguyenv nguyenv commented Nov 20, 2023

Issue and/or context:

Requested by @johnkerl. Context: #1932.

Changes:

  • Addition of SOMAArray:;stats which calls tiledb::Query::stats to get stats of query run
  • Add SOMAArray.stats Python binding

Copy link
Member

@johnkerl johnkerl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking as 'request changes' solely to put a 'pause' on this while discussing with @KiterLuc

My concerns -- not just about this PR; more to make sure this is really what the limitations are with core and that we can't do anything about it in TileDB-SOMA:

  • My mental model of how stats are supposed to work is
    • (a) stats_enable
    • (b) do something
    • (c) do something else
    • (d,e,f) make all sorts of nested whatever
    • (g) stats_dump
    • This produces one dump for whatever core accesses were done through all of points b-f

What appears to be the case now is this is broken, in the core API, in three ways:

  • The final stats_dump does not include query->stats() output for various instances of query -- some things appear in the final dump, others in the "side data" from a single query
  • We have to do exp.ms["RNA"].X["data"].stats() separately
  • Even then: on this PR as-is, we get the results of one query -- so in the case of X.read().concat() that takes three paginations, the stats will only have the results of the last query

If the core stats API is really such that we have no choice in TileDB-SOMA, we need a "caveat emptor" somewhere that stats calls will need to be inserted inside each iteration of X.read(), in particular, code will need to be modified in order to be profiled correctly.

@eddelbuettel
Copy link
Member

eddelbuettel commented Nov 20, 2023

At the risk of getting this wrong, there are actually (AFAIK) two APIs. The one you describe where we globally enable and collect. But there is also a second one, just on the query engine. TileDB-R calls the 'getter' if a toggle is on, and this functions independently of the global state. Mini example below the fold. And it seems to me that the PR hits the same accessor and should therefor work the same way.

> uri <- "/tmp/tiledb/penguins"
> res <- tiledb_array(uri, return_as="data.table", query_statistics=TRUE)[]    # ask for query stats
> stats <- attr(res, "query_statistics")      # we return JSON string as attribute of the return object
> str( jsonlite::fromJSON(stats) )
List of 2
 $ timers  :List of 36
  ..$ Context.StorageManager.Query.Reader.unfilter_coord_tiles.sum        : num 0.000301
  ..$ Context.StorageManager.Query.Reader.unfilter_coord_tiles.avg        : num 1e-04
  ..$ Context.StorageManager.Query.Reader.unfilter_attr_tiles.sum         : num 0.00264
  ..$ Context.StorageManager.Query.Reader.unfilter_attr_tiles.avg         : num 0.00033
  ..$ Context.StorageManager.Query.Reader.tile_offset_sizes.sum           : num 4.41e-06
  ..$ Context.StorageManager.Query.Reader.tile_offset_sizes.avg           : num 4.41e-06
  ..$ Context.StorageManager.Query.Reader.resize_fixed_results_to_copy.sum: num 1.85e-06
  ..$ Context.StorageManager.Query.Reader.resize_fixed_results_to_copy.avg: num 1.85e-06
  ..$ Context.StorageManager.Query.Reader.read_tiles.sum                  : num 0.000289
  ..$ Context.StorageManager.Query.Reader.read_tiles.avg                  : num 7.23e-05
  ..$ Context.StorageManager.Query.Reader.read_coordinate_tiles.sum       : num 0.000115
  ..$ Context.StorageManager.Query.Reader.read_coordinate_tiles.avg       : num 5.74e-05
  ..$ Context.StorageManager.Query.Reader.read_attribute_tiles.sum        : num 0.000183
  ..$ Context.StorageManager.Query.Reader.read_attribute_tiles.avg        : num 9.17e-05
  ..$ Context.StorageManager.Query.Reader.read_and_unfilter_coords.sum    : num 0.00045
  ..$ Context.StorageManager.Query.Reader.read_and_unfilter_coords.avg    : num 0.00045
  ..$ Context.StorageManager.Query.Reader.read_and_unfilter_attributes.sum: num 0.00284
  ..$ Context.StorageManager.Query.Reader.read_and_unfilter_attributes.avg: num 0.00284
  ..$ Context.StorageManager.Query.Reader.process_tiles.sum               : num 0.00323
  ..$ Context.StorageManager.Query.Reader.process_tiles.avg               : num 0.00323
  ..$ Context.StorageManager.Query.Reader.load_tile_var_sizes.sum         : num 3.21e-06
  ..$ Context.StorageManager.Query.Reader.load_tile_var_sizes.avg         : num 3.21e-06
  ..$ Context.StorageManager.Query.Reader.load_tile_offsets.sum           : num 0.000583
  ..$ Context.StorageManager.Query.Reader.load_tile_offsets.avg           : num 0.000194
  ..$ Context.StorageManager.Query.Reader.load_initial_data.sum           : num 0.000266
  ..$ Context.StorageManager.Query.Reader.load_initial_data.avg           : num 0.000266
  ..$ Context.StorageManager.Query.Reader.dowork.sum                      : num 0.00469
  ..$ Context.StorageManager.Query.Reader.dowork.avg                      : num 0.00469
  ..$ Context.StorageManager.Query.Reader.create_result_tiles.sum         : num 3.55e-06
  ..$ Context.StorageManager.Query.Reader.create_result_tiles.avg         : num 3.55e-06
  ..$ Context.StorageManager.Query.Reader.copy_fixed_data_tiles.sum       : num 0.000338
  ..$ Context.StorageManager.Query.Reader.copy_fixed_data_tiles.avg       : num 3.76e-05
  ..$ Context.StorageManager.Query.Reader.compute_tile_bitmaps.sum        : num 8.19e-05
  ..$ Context.StorageManager.Query.Reader.compute_tile_bitmaps.avg        : num 8.19e-05
  ..$ Context.StorageManager.Query.Reader.apply_query_condition.sum       : num 5.58e-07
  ..$ Context.StorageManager.Query.Reader.apply_query_condition.avg       : num 5.58e-07
 $ counters:List of 12
  ..$ Context.StorageManager.Query.Reader.result_num              : int 344
  ..$ Context.StorageManager.Query.Reader.read_unfiltered_byte_num: int 16856
  ..$ Context.StorageManager.Query.Reader.num_tiles_read          : int 9
  ..$ Context.StorageManager.Query.Reader.loop_num                : int 1
  ..$ Context.StorageManager.Query.Reader.internal_loop_num       : int 1
  ..$ Context.StorageManager.Query.Reader.ignored_tiles           : int 0
  ..$ Context.StorageManager.Query.Reader.dim_num                 : int 1
  ..$ Context.StorageManager.Query.Reader.dim_fixed_num           : int 1
  ..$ Context.StorageManager.Query.Reader.cell_num                : int 4128
  ..$ Context.StorageManager.Query.Reader.attr_num                : int 8
  ..$ Context.StorageManager.Query.Reader.attr_nullable_num       : int 5
  ..$ Context.StorageManager.Query.Reader.attr_fixed_num          : int 8
> 

@johnkerl
Copy link
Member

johnkerl commented Nov 20, 2023

Thanks @eddelbuettel

We're agreed there are two APIs, one on the global state, and the other on the query

Above, I enumerated three concerns with that status quo (a status quo which I don't much like)

My articulated first choice/hope is that we can find a way to simply get all the stats through one interface

My articulated second choice/hope is that we need to communicate to all users of this that there are two paths now, and furthermore, that .concat() queries will need to be unrolled with stats() calls inserted within each loop iteration in order to get non-misleading results

@nguyenv
Copy link
Member Author

nguyenv commented Nov 20, 2023

I will hold off on making additional changes for now.

@codecov-commenter
Copy link

codecov-commenter commented Nov 20, 2023

Codecov Report

Merging #1927 (8cee72f) into main (1ebc379) will not change coverage.
The diff coverage is n/a.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1927   +/-   ##
=======================================
  Coverage   89.76%   89.76%           
=======================================
  Files          34       34           
  Lines        3597     3597           
=======================================
  Hits         3229     3229           
  Misses        368      368           
Flag Coverage Δ
python 89.76% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
python_api 89.76% <ø> (ø)
libtiledbsoma ∅ <ø> (∅)

@johnkerl johnkerl self-requested a review November 21, 2023 22:12
@johnkerl
Copy link
Member

johnkerl commented Nov 21, 2023

I will hold off on making additional changes for now.

@nguyenv I think what we talked about in Slack will work, namely:

  • Take what you have now
  • Wire it up through read methods in the DataFrame, SparseNDArray, and DenseNDArray classes' read methods as an optional, backward-compatible argument
  • This will automatically solve the .concat problem: anyone who wants per-query stats will need to access them on each read(), which is a bit of a limitation, but one we're stuck with, and the beauty of this is that this naturally forces the caller to get stats on each read() iteration -- there's zero risk of a 3-page .read().concat() returning only the stats for the 3rd page
  • The only open question is how the user will get the stats-string
    • Option 1: they don't -- we unconditionally print it out to stdout or stderr (hard-coded)
      • Not great, but, simple API
    • Option 2: optarg to read takes a stream -- e.g. read(..., stats_stream=os.stdout, ...) or read(..., stats_stream=handle_they_have_opened, ...)
    • Option 3: optarg to read is read(..., provided_stats=True) and the caller will need to do string = X.get_stats() after every .read()`
    • Other options I'm missing -- ? 🤔

Also: this is a non-urgent PR. However, tiledb stats are basic to have, and important for supportability, so at some point we need to figure out a way to let people access query stats.

@nguyenv
Copy link
Member Author

nguyenv commented Nov 21, 2023

I like option 3 because it's most similar to how the global stats work where first you enable stats and then get the stats in another call.

@eddelbuettel
Copy link
Member

Had the some instinctive reaction yesterday when that came up: one query statistic dump per read attempt during the iterations, resulting in one json string per read attempt. If the user is iterating over reads, the burden may be in the reader to fetch the individual strings which should work both R and Python. And just how concat() builds up the overall result data structure it should be able to construct a vector (or list) of json strings.

@johnkerl
Copy link
Member

@nguyenv new info on #1932 ... I think there's a straight-up bug here (I don't know where) and we should not be having to create a separate per-query stats accessor. Details on #1932 but TL;DR query stats do appear in global stats for the same data & same query in TileDB-Py and TileDB-SOMA-R, just not in TileDB-SOMA-Py ...

@nguyenv
Copy link
Member Author

nguyenv commented Dec 4, 2023

Some observations:

If we open a SOMAArray and take global stats, then we get the query stats:

    clib.Stats.enable()
    sr = clib.SOMAArray(tmp_path.as_posix())
    sr.read_next()
    print(clib.Stats.dump())
...
      "Context.StorageManager.Query.Reader.unfilter_coord_tiles.sum": 0.0030568,
      "Context.StorageManager.Query.Reader.unfilter_coord_tiles.avg": 0.0007642,
      "Context.StorageManager.Query.Reader.tile_offset_sizes.sum": 8.67e-05,
...

If we open a soma.DataFrame and take global stats, then we don't get the query stats:

    soma.pytiledbsoma.Stats.enable()
    with soma.DataFrame.open(tmp_path.as_posix()) as sidf:
        sidf.read().concat()
    print(soma.pytiledbsoma.Stats.dump())

This also does not get us the query stats (where sidf._soma_reader() is a clib.SOMAArray):

    with soma.DataFrame.open(tmp_path.as_posix()) as sidf:
        sidf._soma_reader().read_next()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants