-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider allowing 1 dimensional [
to drop the geometry
column
#2131
Comments
Spatial data is not general data, and retaining the geometry column is essential if the data are still to be considered spatial. Re-ordering rows would be a key case here. So users of spatial data would have to choose positively to discard the geometry column first. This is much more sensible than say losing track of which census tracts are which. Obliging users to first use |
An alternative that might also work for us is for us to implement our own |
Perhaps. https://r-spatial.org/book/07-Introsf.html#subsetting is being published next month, the bookdown has been live for a long time, so it is too late to even try to adapt. @edzer you do the heavy lifting, what do you think? |
I can see your point, but think that the sticky geometry in |
I think the suggestion is not so much that users should use That said there are also good reasons for the current design choice as you mentioned and also the ship has sailed a long time ago. So I'm wondering if sf could use the same approach as data.table to enable and disable the custom |
could you point to an example where & how this is done? |
And can users be fully shielded from the unintended effects of the loading of packages by other packages? That is, can they use |
data.table calls it CEDTA: https://github.com/Rdatatable/data.table/blob/master/R/cedta.R The gist is something like this (untested, hopefully no typos): # Known namespaces that need sticky columns. This is set to the
# packages that are already depending on sf-awareness at the time
# the feature was introduced. New packages should use `make_sf_aware()`.
known_sf_aware <- c(".globalenv", "foo", "bar")
# Register environment as sf-aware. We use a lexical flag instead of
# inspecting the the name of the calling namespace and add it to
# `the$sf_aware` because the flag is more flexible. It makes it
# possible for arbitrary environments to opt into (or disable) the sf
# API like this:
#
# ```
# local({ make_sf_aware(); sf[...] })
# ```
#
# Packages should call it like this in their `.onLoad()` hook (and
# thus before their namespace is sealed):
#
# ```
# make_sf_aware(topenv(environment()))
# ```
#' @export
make_sf_aware <- function(env = parent.frame(), value = TRUE) {
env$.__sf_aware__. <- value
}
is_sf_aware <- function(env = parent.frame(2)) {
top <- topenv(env)
# Check for overrides
top_name <- env_name(top)
if (!is.null(top_name) && top_name %in% known_sf_aware) {
return(TRUE)
}
# Now check for the lexical flag. This would need to be rewritten
# without rlang as a loop over parents that stops at `topenv()`.
flag <- rlang::env_get(
env,
".__sf_aware__.",
default = FALSE,
inherit = TRUE,
last = top
)
stopifnot(
is.logical(flag) && length(flag) == 1 && !is.na(flag)
)
flag
}
env_name <- function(env) {
ns <- topenv(env)
if (isNamespace(ns)) {
return(getNamespaceName(ns))
}
if (identical(ns, globalenv())) {
return(".globalenv")
}
NULL
} Then you call
yup, the effect is entirely lexical because we're looking in the parent frame to determine sf-awareness. It's bounded by the top env, which is either a namespace or the global env. |
I would absolutely advise keeping this on the opt-out of |
The burden on the sf maintainer is only at the time of implementing the feature. New revdeps that need sf-awareness would call I'd argue that there is a burden either way. sf implements the data frame interface and so ends up being passed to all sorts of packages, including pure modelling and data-reshaping packages. From that point of view it makes sense to make the behaviour opt-in. We could also make the mechanism a little more complex and read the DESCRIPTION file to detect if sf is in Imports or Depends. If that's the case, this could default to sf-awareness. This would need to be cached to avoid the overhead but it's certainly feasible.
It really doesn't, cf the comment above |
Why should Is there a path through the behaviour of |
It'd be fine to have packages that have sf in
I think the main downside of this approach is discoverability. On the upside it should do the right thing in most cases if I'm not missing anything.
Sorry I don't understand what you mean. I'll say that in general we avoid |
It is not just important, it is essential that current behaviour is continued with no user and preferably no downstream maintainer intervention. Any change in behaviour is a show-stopper, especially as we try to migrate hundreds of packages (with often unresponsive maintainers) to use |
I agree with @lionel- that this is worth exploring, and, when done right, has the potential to help a lot of users using |
started reverse dependency checks... |
I hope the auguries were propitious! Is this in a branch that I could install to test ASDAR and SDSR? |
Here's a POC check: remotes::install_github("edzer/testingsf")
# Using github PAT from envvar GITHUB_PAT. Use `gitcreds::gitcreds_set()` and unset GITHUB_PAT in .Renviron (or elsewhere) if you want to use the more secure git credential store instead.
# Skipping install of 'testingsf' from a github remote, the SHA1 (5c9a2323) has not changed since last install.
# Use `force = TRUE` to force installation
Sys.setenv("ADD_SF_NAMESPACE"="true")
library(sf)
# Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.3.1; sf_use_s2() is TRUE
demo(nc, ask = FALSE, echo = FALSE)
print(attr(nc, ".sf_namespace")) # check for https://github.com/r-spatial/sf/pull/2212
# function ()
# NULL
# <bytecode: 0x5b6178ccb2c0>
# <environment: namespace:sf>
testingsf::report(nc) # number of columns in nc[1]: 1 under branch sf_aware, 2 under branch main
# ncol x[1]: 1 The branch can be installed directly: remotes::install_github("r-spatial/sf", "sf_aware") |
ASDAR OK. But ... SDSR ch 14:
So this breaks |
is OK, the damage is inside |
Fixed now by adding |
I remain concerned about |
SDSR Ch 14-17 now complete without error. Maybe the revdeps will show whether |
Packages added (for now) to the
Methods added to make this work:
Still open issues
|
I think I've done about five revdep checks on 800+ packages now, just to discover that there are new packages depending on
@lionel- how do you suggest we go about this? I'm reluctant to add these secondary dependencies to the sf-aware list; can we also leave that to the packages, say that |
Asking the question is answering it; this now allows e.g. package if (utils::packageVersion("sf") >= "1.0.17")
sf::sf_make_aware_pkg("rsample") to make sure that |
In dplyr and in many other places in the tidyverse, we program with 1 dimensional calls to
[
, such asdf["x"]
, where we expect that the result has exactly 1 column, and should be namedx
.In
?dplyr::dplyr_extending
, we discuss how this is one of the invariants that is required for compatibility with dplyr.But sf doesn't do this, and instead retains the
geometry
column as a "sticky" column:This has caused quite a bit of pain in dplyr over the years, and has recently also just bitten me again in hardhat, where I also use
df[cols]
as a way to select columns tidymodels/hardhat#228.In dplyr, algorithms underlying functions like
arrange()
anddistinct()
usedf[i]
to first select the columns to order by or compute the unique values of, so retaining extra columns here can be particularly problematic. I know sf has methods for these two verbs to work around this, but I think those could be avoided entirely if the geometry column wasn't sticky here.I think:
dplyr::select()
retain sticky columns makes for a great user experiencedf[i]
retain sticky columns makes for a painful programming experienceThis ^ is our general advice regarding sticky columns, and is how dplyr's grouped data frames work:
It is also how tsibble works with its index column.
Would you ever consider allowing
df[i]
to only return exactly the columns selected byi
? If the geometry column isn't selected, then the appropriate behavior would be to return a bare data frame or bare tibble depending on the underlying data structure.The text was updated successfully, but these errors were encountered: