Skip to content

Commit

Permalink
Update nanoarrow and expand use (#663)
Browse files Browse the repository at this point in the history
* Update nanoarrow vendored files to nanoarrow 0.4.0

* Add nanoarrow.hpp

* Refactor arrow table export

valgrind clean when no dictionaries are used

* Remove some commented-out code

* Add nanoarrow to Imports

* Set ordered as needed

* Use a cutesy typedef nanoarrowXPtr to not have to use SEXP

* Also import from nanoarrow

* Remove one (interim) library call we no longer need

* Review comments leading to dead commented code removal [ci skip]

* Add NEWS entry and roll micro release [ci skip]
  • Loading branch information
eddelbuettel authored Feb 22, 2024
1 parent e7cf0e5 commit f6d12eb
Show file tree
Hide file tree
Showing 13 changed files with 1,033 additions and 310 deletions.
6 changes: 3 additions & 3 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Package: tiledb
Type: Package
Version: 0.24.0.2
Version: 0.24.0.3
Title: Modern Database Engine for Multi-Modal Data via Sparse and Dense Multidimensional Arrays
Authors@R: c(person("TileDB, Inc.", role = c("aut", "cph")),
person("Dirk", "Eddelbuettel", email = "dirk@tiledb.com", role = "cre"))
Expand All @@ -25,8 +25,8 @@ SystemRequirements: A C++17 compiler is required, and for macOS
available at GitHub and are used if no TileDB installation is
detected, and no other option to build or download was specified by
the user.
Imports: methods, Rcpp (>= 1.0.8), nanotime, spdl
LinkingTo: Rcpp, RcppInt64
Imports: methods, Rcpp (>= 1.0.8), nanotime, spdl, nanoarrow
LinkingTo: Rcpp, RcppInt64, nanoarrow
Suggests: tinytest, simplermarkdown, curl, bit64, Matrix, palmerpenguins, nycflights13, data.table, tibble, arrow
VignetteBuilder: simplermarkdown
Roxygen: list(markdown = TRUE)
Expand Down
1 change: 1 addition & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -386,6 +386,7 @@ importFrom(methods,setGeneric)
importFrom(methods,setMethod)
importFrom(methods,slot)
importFrom(methods,validObject)
importFrom(nanoarrow,as_nanoarrow_array)
importFrom(spdl,set_level)
importFrom(stats,na.omit)
importFrom(utils,head)
Expand Down
6 changes: 5 additions & 1 deletion NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,13 @@

* This release of the R package builds against [TileDB 2.20.1](https://github.com/TileDB-Inc/TileDB/releases/tag/2.20.1), and has also been tested against earlier releases as well as the development version (#661)

## Improvements

* The vendored [nanoarrow](https://github.com/apache/arrow-nanoarrow) sources have been update to release 0.4.0, and use of its facilities has been extended (#663)

## Bug Fixes

* The `tiledb_get_query_range_var()` accessor now correctly calls the range getter for variable-sized dimensions. (#662)
* The `tiledb_get_query_range_var()` accessor now correctly calls the range getter for variable-sized dimensions (#662)


# tiledb 0.24.0
Expand Down
18 changes: 1 addition & 17 deletions R/ArrowIO.R
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# MIT License
#
# Copyright (c) 2017-2023 TileDB Inc.
# Copyright (c) 2017-2024 TileDB Inc.
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
Expand Down Expand Up @@ -85,22 +85,6 @@ tiledb_arrow_schema_del <- function(ptr) {
.delete_arrow_schema_from_xptr(ptr)
}

##' @noRd
.check_arrow_pointers <- function(arrlst) {
stopifnot("First argument must be an external pointer to ArrowArray" = check_arrow_array_tag(arrlst[[1]]),
"Second argument must be an external pointer to ArrowSchema" = check_arrow_schema_tag(arrlst[[2]]))
}

##' @noRd
.as_arrow_table <- function(arrlst) {
.check_arrow_pointers(arrlst)
if (!requireNamespace("arrow", quietly=TRUE)) {
stop("This functionality requires the 'arrow' package to be installed.", call. = FALSE)
} else {
arrow::as_arrow_table(arrow::RecordBatch$import_from_c(arrlst[[1]], arrlst[[2]]))
}
}

##' @noRd
.tiledb_set_arrow_config <- function(ctx = tiledb_get_context()) {
cfg <- tiledb_config() # for var-num columns such as char we need these
Expand Down
10 changes: 8 additions & 2 deletions R/TileDBArray.R
Original file line number Diff line number Diff line change
Expand Up @@ -508,6 +508,7 @@ setValidity("tiledb_array", function(object) {
#' @param drop Optional logical switch to drop dimensions, default FALSE, currently unused.
#' @return The resulting elements in the selected format
#' @import nanotime
#' @importFrom nanoarrow as_nanoarrow_array
#' @aliases [,tiledb_array
#' @aliases [,tiledb_array-method
#' @aliases [,tiledb_array,ANY,tiledb_array-method
Expand Down Expand Up @@ -552,6 +553,9 @@ setMethod("[", "tiledb_array",
x@return_as, "' to be installed.", call. = FALSE)

use_arrow <- x@return_as == "arrow"
if (use_arrow) {
suppressMessages(do.call(rawToChar(as.raw(c(0x72, 0x65, 0x71, 0x75, 0x69, 0x72, 0x65))), list("nanoarrow")))
}

dims <- tiledb::dimensions(dom)
ndims <- length(dims)
Expand Down Expand Up @@ -941,8 +945,10 @@ setMethod("[", "tiledb_array",
if (status != "COMPLETE") spdl::debug("['['] query returned '{}'.", status)

if (use_arrow) {
rl <- libtiledb_to_arrow(abptr, qryptr, dictionaries)
at <- .as_arrow_table(rl)
## rl <- libtiledb_to_arrow(abptr, qryptr, dictionaries)
## at <- .as_arrow_table(rl)
na <- libtiledb_to_arrow(abptr, qryptr, dictionaries)
at <- arrow::as_arrow_table(na)

## special case from schema evolution could have added twice so correcting
for (n in colnames(at)) {
Expand Down
2 changes: 2 additions & 0 deletions inst/include/tiledb.h
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,8 @@ typedef struct query_buffer query_buf_t;
// map from buffer names to shared_ptr to column_buffer
typedef std::unordered_map<std::string, std::shared_ptr<tiledb::ColumnBuffer>> map_to_col_buf_t;

// some lipstick on the pig that is a SEXP -- allow the nanoarrow ArrowArray XPtr be typedef'ed
typedef SEXP nanoarrowXPtr;

// C++ compiler complains about missing delete functionality when we use tiledb_vfs_fh_t directly
struct vfs_fh {
Expand Down
5 changes: 2 additions & 3 deletions inst/tinytest/test_timetravel.R
Original file line number Diff line number Diff line change
Expand Up @@ -56,9 +56,8 @@ while (deltat < 30) {
isTRUE(all.equal(NROW(res5), 20)) && # expects 2 groups, 2 and 3, with 20 obs
isTRUE(all.equal(min(res5$grp), 2)) &&
isTRUE(all.equal(max(res5$grp), 3))) {
if (Sys.getenv("CI") != "") message("Success with gap time of ", deltat)
success <- TRUE
break
success <- TRUE
break
}
deltat <- deltat * 5
}
Expand Down
2 changes: 1 addition & 1 deletion src/RcppExports.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -115,7 +115,7 @@ BEGIN_RCPP
END_RCPP
}
// libtiledb_to_arrow
Rcpp::List libtiledb_to_arrow(Rcpp::XPtr<tiledb::ArrayBuffers> ab, Rcpp::XPtr<tiledb::Query> qry, Rcpp::List dicts);
nanoarrowXPtr libtiledb_to_arrow(Rcpp::XPtr<tiledb::ArrayBuffers> ab, Rcpp::XPtr<tiledb::Query> qry, Rcpp::List dicts);
RcppExport SEXP _tiledb_libtiledb_to_arrow(SEXP abSEXP, SEXP qrySEXP, SEXP dictsSEXP) {
BEGIN_RCPP
Rcpp::RObject rcpp_result_gen;
Expand Down
130 changes: 69 additions & 61 deletions src/arrowio.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,8 @@

#include "libtiledb.h"
#include "tiledb_version.h"
#include <nanoarrow.h> // for C interface to Arrow
#include "nanoarrow/r.h"
//#include <nanoarrow.h> // for C interface to Arrow

//#include <tiledb/arrowio>
#include "tiledb_arrowio.h"
Expand Down Expand Up @@ -253,7 +254,7 @@ inline void registerXptrFinalizer(SEXP s, R_CFinalizer_t f, bool onexit = true)
R_RegisterCFinalizerEx(s, f, onexit ? TRUE : FALSE);
}
extern "C" {
void ArrowArrayRelease(struct ArrowArray *array); // made non-static in nanoarrow.c
void ArrowArrayReleaseInternal(struct ArrowArray *array); // made non-static in nanoarrow.c
ArrowErrorCode ArrowArraySetStorageType(struct ArrowArray* array, // ditto
enum ArrowType storage_type);
ArrowErrorCode localArrowSchemaSetType(struct ArrowSchema* schema, enum ArrowType type);
Expand Down Expand Up @@ -303,7 +304,7 @@ Rcpp::XPtr<ArrowArray> array_setup_struct(Rcpp::XPtr<ArrowArray> arrxp, int64_t
array->buffers = NULL;
array->children = NULL;
array->dictionary = NULL;
array->release = &ArrowArrayRelease;
array->release = &ArrowArrayReleaseInternal;
array->private_data = NULL;

auto private_data = (struct ArrowArrayPrivateData*) ArrowMalloc(sizeof(struct ArrowArrayPrivateData));
Expand Down Expand Up @@ -353,26 +354,44 @@ Rcpp::XPtr<ArrowArray> array_setup_struct(Rcpp::XPtr<ArrowArray> arrxp, int64_t
return arrxp;
}

inline void exitIfError(const ArrowErrorCode ec, const std::string& msg) {
if (ec != NANOARROW_OK) Rcpp::stop(msg);
}

// Attaches a schema to an array external pointer. The nanoarrow R package
// attempts to do this whenever possible to avoid misinterpreting arrays.
void array_xptr_set_schema(SEXP array_xptr, SEXP schema_xptr) {
R_SetExternalPtrTag(array_xptr, schema_xptr);
}

// was: Rcpp::List
// [[Rcpp::export]]
Rcpp::List libtiledb_to_arrow(Rcpp::XPtr<tiledb::ArrayBuffers> ab,
Rcpp::XPtr<tiledb::Query> qry,
Rcpp::List dicts) {
nanoarrowXPtr libtiledb_to_arrow(Rcpp::XPtr<tiledb::ArrayBuffers> ab,
Rcpp::XPtr<tiledb::Query> qry,
Rcpp::List dicts) {
check_xptr_tag<tiledb::ArrayBuffers>(ab);
check_xptr_tag<tiledb::Query>(qry);
std::vector<std::string> names = ab->names();
auto ncol = names.size();
std::vector<std::string> dictnames = dicts.names();
Rcpp::XPtr<ArrowSchema> schemaxp = schema_owning_xptr();
Rcpp::XPtr<ArrowArray> arrayxp = array_owning_xptr();
schemaxp = schema_setup_struct(schemaxp, ncol);
arrayxp = array_setup_struct(arrayxp, ncol);

arrayxp->length = 0;
// Schema first
auto schemaxp = nanoarrow_schema_owning_xptr();
auto sch = nanoarrow_output_schema_from_xptr(schemaxp);
exitIfError(ArrowSchemaInitFromType(sch, NANOARROW_TYPE_STRUCT), "Bad schema init");
exitIfError(ArrowSchemaSetName(sch, ""), "Bad schema name");
exitIfError(ArrowSchemaAllocateChildren(sch, ncol), "Bad schema children alloc");

// Array second
auto arrayxp = nanoarrow_array_owning_xptr();
auto arr = nanoarrow_output_array_from_xptr(arrayxp);
exitIfError(ArrowArrayInitFromType(arr, NANOARROW_TYPE_STRUCT), "Bad array init");
exitIfError(ArrowArrayAllocateChildren(arr, ncol), "Bad array children alloc");

struct ArrowError ec;

arr->length = 0;
for (size_t i=0; i<ncol; i++) {
// this allocates, and properly wraps as external pointers controlling lifetime
Rcpp::XPtr<ArrowSchema> chldschemaxp = schema_owning_xptr();
Rcpp::XPtr<ArrowArray> chldarrayxp = array_owning_xptr();
bool is_factor = dicts[i] != R_NilValue;
bool is_ordered = false;
if (is_factor) {
Expand All @@ -390,62 +409,51 @@ Rcpp::List libtiledb_to_arrow(Rcpp::XPtr<tiledb::ArrayBuffers> ab,

spdl::info(tfm::format("[libtiledb_to_arrow] Incoming name %s length %d",
std::string(pp.second->name), pp.first->length));
memcpy((void*) chldschemaxp, pp.second.get(), sizeof(ArrowSchema));
memcpy((void*) chldarrayxp, pp.first.get(), sizeof(ArrowArray));
if (is_factor) {
// this could be rewritten if we generalized ColumnBuffer to allow passing of
memcpy((void*) sch->children[i], pp.second.get(), sizeof(ArrowSchema));
memcpy((void*) arr->children[i], pp.first.get(), sizeof(ArrowArray));
if (is_factor) { // create an arrow array of type string with the labels
// this could be rewritten if we generalized ColumnBuffer to allow passing of strings
std::vector<std::string> svec = Rcpp::as<std::vector<std::string>>(dicts[i]);
Rcpp::XPtr<ArrowSchema> dschxp = schema_owning_xptr();
Rcpp::XPtr<ArrowArray> darrxp = array_owning_xptr();
dschxp = schema_setup_struct(dschxp, 0);
darrxp = array_setup_struct(darrxp, 0);

dschxp->format = "u";
dschxp->flags |= ARROW_FLAG_NULLABLE;
auto darrxp = nanoarrow_array_owning_xptr();
auto darr = nanoarrow_output_array_from_xptr(darrxp);
exitIfError(ArrowArrayInitFromType(darr, NANOARROW_TYPE_STRING), "Bad string array init");
exitIfError(ArrowArrayStartAppending(darr), "Bad string array append init");
auto dschxp = nanoarrow_schema_owning_xptr();
auto dsch = nanoarrow_output_schema_from_xptr(dschxp);
exitIfError(ArrowSchemaInitFromType(dsch, NANOARROW_TYPE_STRING), "Bad string schema init");
exitIfError(ArrowSchemaSetName(dsch, ""), "Bad string schema name");
if (is_ordered) {
dschxp->flags |= ARROW_FLAG_DICTIONARY_ORDERED; // this line appears ignore
chldschemaxp->flags |= ARROW_FLAG_DICTIONARY_ORDERED; // this one matters more
dsch->flags |= ARROW_FLAG_DICTIONARY_ORDERED; // this line appears ignore
sch->children[i]->flags |= ARROW_FLAG_DICTIONARY_ORDERED; // this one matters more
}
darrxp->length = svec.size();
darrxp->null_count = 0;
darrxp->n_buffers = 3; // we always have three for dictionairies
darrxp->buffers = (const void**)malloc(sizeof(void*) * darrxp->n_buffers);
darrxp->buffers[0] = nullptr; // validity

size_t nv = svec.size();
std::string str = "";
std::vector<int32_t> offsets(nv+1);
int32_t cumlen = 0;
for (size_t i = 0; i < nv; i++) {
std::string s = svec[i];
offsets[i] = cumlen;
str += s;
cumlen += s.length();
for (auto str: svec) {
ArrowStringView asv = {str.data(), static_cast<int64_t>(str.size())};
exitIfError(ArrowArrayAppendString(darr, asv), "Bad string append");
}
offsets[nv] = cumlen;
darrxp->buffers[2] = (const char*)malloc(sizeof(char) * cumlen);
std::memcpy((void*) darrxp->buffers[2], str.data(), (sizeof(char) * cumlen));
darrxp->buffers[1] = (const char*)malloc(sizeof(int32_t) * (nv + 1));
std::memcpy((void*) darrxp->buffers[1], offsets.data(), (sizeof(int32_t) * (nv + 1)));

spdl::debug(tfm::format("[libtiledb_to_arrow] dict %s fmt %s -- len %d nbuf %d str %s",
names[i], dschxp->format, darrxp->length, darrxp->n_buffers, str));
chldschemaxp->dictionary = dschxp;
chldarrayxp->dictionary = darrxp;
}
if (NANOARROW_OK != ArrowArrayFinishBuildingDefault(darr, &ec))
Rcpp::stop(ec.message);

schemaxp->children[i] = chldschemaxp;
arrayxp->children[i] = chldarrayxp;
spdl::debug(tfm::format("[libtiledb_to_arrow] dict %s fmt %s -- len %d nbuf %d",
names[i], dsch->format, darr->length, darr->n_buffers));
sch->children[i]->dictionary = dsch;
arr->children[i]->dictionary = darr;
}

if (pp.first->length > arrayxp->length) {
spdl::debug(tfm::format("[libtiledb_to_arrow] Setting array length to %d", pp.first->length));
arrayxp->length = pp.first->length;
if (pp.first->length > arr->length) {
spdl::debug(tfm::format("[libtiledb_to_arrow] Setting array length to %d", pp.first->length));
arr->length = pp.first->length;
}
}
Rcpp::List as = Rcpp::List::create(Rcpp::Named("array_data") = arrayxp,
Rcpp::Named("schema") = schemaxp);
spdl::info("[libtiledb_to_arrow] After children loop");
//if (NANOARROW_OK != ArrowArrayFinishBuildingDefault(arr, &ec))
// Rcpp::stop(ec.message);
spdl::info("[libtiledb_to_arrow] ArrowArrayFinishBuildingDefault");

// Nanoarrow special: stick schema into xptr tag to return single SEXP
array_xptr_set_schema(arrayxp, schemaxp); // embed schema in array

spdl::trace("[libtiledb_to_arrow] returning from libtiledb_to_arrow");
return as;
return arrayxp;
}


Expand Down
1 change: 0 additions & 1 deletion src/libtiledb.h
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,6 @@ const tiledb_xptr_object tiledb_xptr_map_to_col_buf_t { 230 };
const tiledb_xptr_object tiledb_arrow_array_t { 300 };
const tiledb_xptr_object tiledb_arrow_schema_t { 310 };


// templated checkers for external pointer tags
template <typename T> const int32_t XPtrTagType = tiledb_xptr_default; // clang++ wants a value
template <> inline const int32_t XPtrTagType<tiledb::Array> = tiledb_xptr_object_array;
Expand Down
Loading

0 comments on commit f6d12eb

Please sign in to comment.