Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: [sc-57786] [rs] minimize set of functions which need to link with core tiledb to run #179

Merged
merged 36 commits into from
Oct 21, 2024

Conversation

rroelke
Copy link
Collaborator

@rroelke rroelke commented Oct 15, 2024

Story details: https://app.shortcut.com/tiledb-inc/story/57786

This pull request is a rather large "Part 1" of the linked story, whose goal is to restructure our crates in tiledb-rs so that the only code which requires linking with core tiledb is the code which actually calls the C APIs. Doing this will simplify the build process for users of the tiledb-rs crates which do not need to use the C APIs directly, such as a client which sends RPCs to a server in tiledb-cloud.

We have written handles to tiledb schema objects: struct Attribute, struct Dimension, struct Domain, struct Schema, and so on. When we turned our eye towards property-based testing, we concluded that we needed to write structures whose fields duplicate the fields of the objects inside of the core library (whereas the handles just provide methods to get or set them). But over time we have come to use these secondary structures more and more, as they are often more user-friendly. The goal of this "Part 1" pull request is to move these secondary structures to their own crate, which is tentatively named tiledb-serde in the diff.

Unfortunately this goal requires moving lots and lots of other pieces.

The new tiledb crates:

  • tiledb-common: contains enums and other utility pieces, most notably including Datatype, FilterData, and Range.
  • tiledb-serde: contains the secondary structures described above. The name comes from the idea that these structures can be used to derive Desesrialize and Serialize for the API handles. I think the name should be changed but I wanted to get things compiling and running again before taking that step.
  • tiledb-api: renamed from the previous monolithic tiledb.

Each of the above crates has features serde and proptest-strategies. The serde feature enables derived Deserialize and Serialize impls, but for tiledb-api this also adds tiledb-serde as a dependency. The proptest-strategies feature similarly defines Arbitrary impls (for the most part) and is used as a dev-depedency by the tiledb-api crate.

There is a bit of thiserror-ization as entities which previously used TileDBError were moved into a new crate which does not have TileDBError. But the tiledb-api crate itself still uses the large TileDBError enum.

Additionally, this pull request adds a new top-level directory to the workspace, test-utils. This moves the various disjoint items from the previous tiledb-test-utils crate into their own crates. All of the Cells testing code is also moved into a new crate cells in this directory. This was mostly done to split out strategy-ext but the rest followed from there.

For the most part, code has just moved around, without any changes, but there are some exceptions:

  • One difficulty with the crate splitting is our use of the PhysicalType trait. Because this trait is now defined in a different crate than tiledb-api, we cannot provide generic impls on PhysicalType for some of the query stuff. A new trait CellValue is introduced to work around this.
  • Similarly the cells crate cannot have its attach_read and attach_write methods anymore since those use structures from tiledb-api, which uses on cells as a test dependency. To resolve this the tiledb-api crate declares traits ToReadQuery and ToWriteQuery and provides impls for the items from cells.
  • We previously used FilterListData to represent a filter list handle for proptests; this has been replaced with Vec<FilterData>.

Copy link
Collaborator Author

@rroelke rroelke Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cells crate contains the structure definitions and trait impls of code which used to live in api/src/query/strategy.rs and api/src/query/write/strategy.rs. This code has been exported and is used in tables tests. This crate depends on tiledb-serde because it uses the schema to make some decisions about values and etc.

I don't think there is any new code in the cells crate itself, but splitting this out is what has led to the ToReadQuery and ToWriteQuery traits in tiledb-api.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new crate proptest-config just contains the configuration settings for proptest that we previously had. I moved it to its own crate because cells and tiledb-serde both use it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had some signal stuff in tiledb-test-utils, this crate contains that code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tiledb-test-utils previously had a bunch of generic strategy extensions which now live in this new crate.

Copy link
Collaborator Author

@rroelke rroelke Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This crate is for the URI generator stuff which previously lived in tiledb-test-utils.

Initially I had split all of the tiledb-test-utils modules into separate features rather than separate crates; but this was sort of gnarly for dependencies, which is when I moved them all to their own crates.

No new code in this crate, just moved around.

@@ -473,330 +476,18 @@ impl Builder {
}
}

/// Trait for data which can be used as a fill value for an attribute.
pub trait IntoFillValue {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was moved into tiledb-common because of the impls which build upon PhysicalType. I did this before adding the CellValue trait; that would probably be more appropriate to use and we could move this trait back into tiledb-api.

}
}

impl TryFrom<&Attribute> for AttributeData {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conversions between AttributeData and Attribute now live in the serde module, i.e. tiledb/api/src/array/attribute/serde.rs.

This is the case for all the other ThingData structs as well

@@ -170,275 +158,16 @@ impl PartialEq<Dimension> for Dimension {
}
}

#[derive(Clone, Debug, Deserialize, OptionSubset, PartialEq, Serialize)]
pub enum DimensionConstraints {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now in tiledb-common though perhaps it should go into tiledb-serde.

let c_name = cstring!(name);
let c_domain = constraints.domain_ptr();
let c_extent = constraints.extent_ptr();
let c_domain = dimension_constraints_go!(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the body of DimensionConstraints::domain_ptr.
It didn't really feel right to have an otherwise Rust-only crate provide methods which deal with raw pointers.

#[test]
fn subarray_strategy_dense() {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test doesn't actually use any tiledb APIs so it now lives in serde/src/array/dimension/strategy.rs

};

#[derive(Clone, Debug, PartialEq)]
pub enum Mode {
Copy link
Collaborator Author

@rroelke rroelke Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the enums have been moved to tiledb-common.

Except for Encryption which is not used by anything else yet. Though it does seem like something we might want to put in an RPC, so it probably should be moved there.


/// Represents tiledb (`Datatype`, `CellValNum`) compatibility for an arrow `DataType`.
#[derive(Clone, Debug, Eq, PartialEq)]
pub enum DatatypeToArrowResult {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this is moved to tiledb-common/src/datatype/arrow.rs


use crate::error::DatatypeErrorKind;
use crate::Result as TileDBResult;
impl ToStringCore for Datatype {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The enum is moved to tiledb-common/src/datatype/mod.rs.

What's here now is just trait impls for new traits ToStringCore and FromStringCore which call the tiledb core string functions.

Since these require linking with libtiledb, tiledb-common specifically does not want them. Hence Debug is derived and Display just uses that.

@@ -26,107 +29,6 @@ impl Drop for RawError {
}
}

#[derive(Clone, Debug)]
pub enum DatatypeErrorKind {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some variants are moved to tiledb_common::datatype::Error and others are now just Error variants.

#[derive(
Clone, Debug, Default, Deserialize, OptionSubset, PartialEq, Serialize,
)]
pub struct FilterListData(Vec<FilterData>);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In contrast to other items which are moved somewhere else, this one I just deleted and replaced uses of it with Vec<FilterData>.

#[derive(
Copy, Clone, Debug, Deserialize, Eq, OptionSubset, PartialEq, Serialize,
)]
pub enum CompressionType {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to tiledb_common::filter

use util::option::OptionSubset;

#[derive(Clone, Debug, Deserialize, OptionSubset, PartialEq, Serialize)]
pub enum Value {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is moved to tiledb_common. This module now just has functions for FFI.

@rroelke rroelke requested a review from davisp October 15, 2024 16:36
Copy link
Collaborator

@davisp davisp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops. I totes forgot that I've got a dinner thing tonight so I didn't get as far into this as I'd like.

After my skimming so far, about the only thing I wish were different is the proptest strategies in the pod crate being pushed into their own. Granted I'm well aware of why that didn't and might not ever happen so its not really important. And its not like this isn't an improvement anyway.

Also heads up that there's a bunch of modules in the API crate all called serde that look like they should have been renamed to match the s/serde/pod/ rename you did earlier.

Other than that, this all looks pretty good. But given how large it is I am gonna give it another look over tomorrow before I approve. Today I mostly just looked at the new structure and spot checked that things pretty much ended up where I expected and so far. Tomorrow I'll give a closer skim on the actual diff before +1'ing.

Good work though! This will be a nice clean up of some of our initial decisions made while earlier in our Rust learning adventure.

pub nullability: Option<bool>,
}

#[cfg(any(test, feature = "proptest-strategies"))]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason why you don't just move this impl to the strategy module?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a stylistic choice to have the impl in the same file that the struct is defined in.
I have a weak preference for this because it means only searching one file for the impls. Though admittendly one should probably build rustdocs for exploring the impls.

@rroelke
Copy link
Collaborator Author

rroelke commented Oct 18, 2024

Also heads up that there's a bunch of modules in the API crate all called serde that look like they should have been renamed to match the s/serde/pod/ rename you did earlier.

Good catch, fixed!

Copy link
Collaborator

@davisp davisp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Big change but everything looks good to me. The only thing I spotted that had me wondering are the extra features that don't feel very feature level. But those are easy enough to undo later that I'm not gonna request any changes there.

Other than that, all looked good. The error handling cleanup was quite nice to see throughout. I left a few notes, but I think more than half were basically just notes to myself.

+1

@@ -303,6 +291,19 @@ impl PartialEq<Attribute> for Attribute {
}
}

#[cfg(any(test, feature = "pod"))]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems slightly odd to have the pod crate as an optional dependency here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well... it definiitely makes use of the API way easier but it's not necessary to use it. If it's a feature than there is a leaner option to use tiledb as a dependency.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me it was just that it seems like the pod definitions are. the way to interact with the bindings, so making it a feature that requires opting into didn't feel quite right. And this isn't super duper well thought out. I was just thinking "How often are we going to not want those pod APIs?" and I'm reasonably sure "Almost never, and the cost of having them and not using them would likely be miniscule."

Also its fine for now. We can always go back and change things up if we decide that it really is unnecessary for some particular use case where it matters.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed we could - and everyone loves PRs with lots of red

}
}

impl TryFrom<Attribute> for AttributeData {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this was just moved from wherever it was before, but seeing a try_from for both a ref and consuming seems odd. I can't quite convince myself this version would never be needed, but it sure seems unlikely.

Not asking for a change here. I'm just noting something I saw for future me writing trait implementations.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really remember why I added this the first time, yeah. Probably trying to do something mildly clever with iterators.

#[cfg(any(test, feature = "proptest-strategies"))]
pub mod strategy;
#[cfg(any(test, feature = "pod"))]
pub mod pod;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not sure on this pod as feature thing. Also, we should probably standardize that all module definitions (that aren't tests) should go at the top of the file. Finding the code/test transition isn't the easiest thing to navigate to just to check if there's anything interesting going on. Don't worry about that as part of this PR. Just another thing I've noticed while reading through.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think we have some stylization to do

  • modules at the top
  • use declaration un-grouping

tiledb/api/src/filter/ftype.rs Outdated Show resolved Hide resolved
serde_json = { workspace = true, optional = true }
thiserror = { workspace = true }
tiledb-proc-macro = { workspace = true, optional = true }
tiledb-sys = { workspace = true }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm reading this PR pretty quickly, so I may have missed an actual ffi call, but I would only expect (and have only seen) references to various constants. We could pretty easily split those off into a tiledb-sys-defs crate to avoid this crate from requiring a link to libtiledb.so. Not needed for this PR, just a thought I had keeping in mind part of the motivation for this PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each time I add a new crate I have to add a new build.rs file for that crate. Isn't that what sets the linker path, not the presence of the sys crate?

That said I do think it might be nice to split out the typedefs and constants for sys

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's two halves: The dependency on tiledb-sys introduces the requirement that libtiledb.so is linked to the final executable. The build.rs in non-tiledb-sys crates tells the linker where to find it on the current system (assuming its not in a system standard path).

So basically, if we split out tiledb-sys-defs that contains just the constant definitions, then anything that doesn't depend on tiledb-sys (directly or indirectly) will no longer need to provide the build.rs.

Or in other words, the build.rs requirement is a symptom of what's actually linking to libtiledb.so, when we remove that need we should be able to remove a lot of the build.rs scripts from the crates that no longer need it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I see that more clearly now from looking at sys/build.rs. Definitely will follow up with that in a subsequent PR. Thanks!

Copy, Clone, Debug, Deserialize, Eq, OptionSubset, PartialEq, Serialize,
)]
#[derive(Copy, Clone, Debug, Eq, PartialEq)]
#[cfg_attr(feature = "option-subset", derive(OptionSubset))]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another feature that I'm not 100% sold on. Easy to change down the line, so not requesting a change. Just seems a bit "feature happy" is all.

tiledb/common/src/query/read.rs Outdated Show resolved Hide resolved
serde_json = { workspace = true, optional = true }
strategy-ext = { workspace = true, optional = true }
thiserror = { workspace = true }
tiledb-common = { workspace = true }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note to self that this hasn't removed the libtiledb.so linkage since we've got this transitive dependency. Though it looks pretty straightforward to cut that tie in a follow up PR that I assume is probably already half written waiting for me to approve this one. So on with the show!

tiledb/pod/src/lib.rs Outdated Show resolved Hide resolved
- move cfg impls into strategy.rs
- remove ToStringCore and FromStringCore
- fix 'cargo check --tests' from API crate
- Fix comment typo
- Remove empty pod mod query
@rroelke rroelke merged commit 363ad94 into main Oct 21, 2024
3 checks passed
@rroelke rroelke deleted the rr/sc-57786-api-crate-split branch October 21, 2024 18:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants