Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC on soma_joinid: identifier or index? #164

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

mlin
Copy link
Member

@mlin mlin commented May 10, 2023


Notice this defines `soma_joinid` merely as an *identifier*; not necessarily starting at zero or one, nor forming a contiguous range, nor even ascending with the data frame's row order. Those are properties associated with an *index*. Though the field is often populated like an index in practice, this need not be so. For example, we might create a `SOMAExperiment` by subsetting the variables of a larger existing one, preserving the original `soma_joinid` in the `var` data frame. Or, observables might be withdrawn from version to version of a `SOMAExperiment`, keeping the `soma_joinid` of the remaining entries in the `obs` data frame.

**However**, beyond this primary definition, `soma_joinid` is also used as the row & column numbers of any `X` matrix in a `SOMAExperiment`. It clearly *is* an index in this secondary role, with the constraint that the row & column numbers correspond to the `soma_joinid` values in the `obs` and `var` dataframes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
**However**, beyond this primary definition, `soma_joinid` is also used as the row & column numbers of any `X` matrix in a `SOMAExperiment`. It clearly *is* an index in this secondary role, with the constraint that the row & column numbers correspond to the `soma_joinid` values in the `obs` and `var` dataframes.
**However**, beyond this primary definition, `soma_joinid` is also used as the row and column numbers of any `X` matrix in a `SOMAExperiment`. It clearly *is* an index in this secondary role, with the constraint that the row & column numbers correspond to the `soma_joinid` values in the `obs` and `var` dataframes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If X is sparse then row and column numbers are not necessarily indices.

Only if it is dense do they necessarily become indices.


> Every `SOMADataFrame` must contain a column called `soma_joinid`, of type `int64` and domain `[0, 2^63-1]`. The `soma_joinid` column contains a unique value for each row in the `SOMADataFrame`, and intended to act as a join key for other objects, such as `SOMASparseNDArray`.

Notice this defines `soma_joinid` merely as an *identifier*; not necessarily starting at zero or one, nor forming a contiguous range, nor even ascending with the data frame's row order. Those are properties associated with an *index*. Though the field is often populated like an index in practice, this need not be so. For example, we might create a `SOMAExperiment` by subsetting the variables of a larger existing one, preserving the original `soma_joinid` in the `var` data frame. Or, observables might be withdrawn from version to version of a `SOMAExperiment`, keeping the `soma_joinid` of the remaining entries in the `obs` data frame.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the definition of index confusing - there is nothing about an index that requires it to be contiguous, starting at zero/one, etc.

I think you want to separate the concept of an "index" (aka key) from "offset" (in Pandas terminology, you could have a string index, but you always have an integer iloc (offset)).

Using index in this way confuses the DB organization with the in-memory representation.


Notice this defines `soma_joinid` merely as an *identifier*; not necessarily starting at zero or one, nor forming a contiguous range, nor even ascending with the data frame's row order. Those are properties associated with an *index*. Though the field is often populated like an index in practice, this need not be so. For example, we might create a `SOMAExperiment` by subsetting the variables of a larger existing one, preserving the original `soma_joinid` in the `var` data frame. Or, observables might be withdrawn from version to version of a `SOMAExperiment`, keeping the `soma_joinid` of the remaining entries in the `obs` data frame.

**However**, beyond this primary definition, `soma_joinid` is also used as the row & column numbers of any `X` matrix in a `SOMAExperiment`. It clearly *is* an index in this secondary role, with the constraint that the row & column numbers correspond to the `soma_joinid` values in the `obs` and `var` dataframes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with this. For sparse arrays (dataframe, sparsendarray), soma_dim_N is not an index, in the way you define it above (i.e. a physical offset). It is an identifier, just like the soma_joinid. The only constraints on soma_joinid/soma_dim_N are that they are uint64. They can be in any order, be non-contiguous, and span any range.

Common usage will use a contiguous range, but the spec absolutely (and intentionally) doesn't require that.

The one nuance here -- if you have a sparse X in an Experiment, you will only have implicit zeros where there is a defined soma_joinid pair in the obs/var DataFrame. In other words, any soma_dim_N not in the sparse array does not exist in the X index domain unless the joinid is defined in the corresponding obs/var dataframe.

Dense is another matter, and still has unresolved design issues due to the fact that we allow "sparse" joinid domains in the obs/var dataframes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkmartinjr @johnkerl We need to find the precise terminology to convey the relationship between soma_joinid (not an index) with the row/column numbers of a matrix (obviously indexed sequences; whether the matrix is represented sparsely is an implementation detail, in my view). The original draft is too strong in equating them, admitted, but we need to convey that the treatment of soma_joinid as matrix row/column numbers in a certain context imposes certain constraints on its initial, broad definition -- that is probably the most important point of the whole essay.


## 2. Dense array indexing

The definition of `soma_joinid` as an arbitrary integer identifier is problematic for constructing dense matrices indexed by this value, since the values needn't start at zero/one and may be much larger than the actual number of entries. [The role of dense matrices remains unresolved](https://github.com/single-cell-data/TileDB-SOMA/issues/1245) as of this writing, and sparse matrices have been suitable for our initial applications. To fully support dense representations in the future, we may need to introduce another column in the `obs`/`var` data frames that is explicitly the index (row number), and use that for dense indexing instead of `soma_joinid`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is also problematic for other reasons (see above). By not requiring the axis dataframes to define contiguous joinid domains, we implicitly make them incompatible with dense matrices

@bkmartinjr
Copy link
Member

ps. @mlin - re-reading my comments, this is complex enough topic that it may warrant a real-time conversation. I'm available if you think this would be useful.

@thetorpedodog
Copy link
Contributor

two non–content-related notes from me (these apply to both open RFCs right now so I am pasting this note in both places):

  1. Rebase so that you get the pre-commit format verification thingy
  2. (Optional) Consider formatting to one sentence per line. I did this in 3b5891c, but I didn’t formally write it down, so it’s not a hard Rule but it is a format I have found useful for editing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants