Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider allowing 1 dimensional [ to drop the geometry column #2131

Open
DavisVaughan opened this issue Mar 23, 2023 · 25 comments
Open

Consider allowing 1 dimensional [ to drop the geometry column #2131

DavisVaughan opened this issue Mar 23, 2023 · 25 comments
Labels
feature a feature request or enhancement

Comments

@DavisVaughan
Copy link
Contributor

In dplyr and in many other places in the tidyverse, we program with 1 dimensional calls to [, such as df["x"], where we expect that the result has exactly 1 column, and should be named x.

In ?dplyr::dplyr_extending, we discuss how this is one of the invariants that is required for compatibility with dplyr.

But sf doesn't do this, and instead retains the geometry column as a "sticky" column:

library(sf)
#> Linking to GEOS 3.11.1, GDAL 3.6.0, PROJ 9.1.1; sf_use_s2() is TRUE

nrows <- 10
geometry = st_sfc(lapply(1:nrows, function(x) st_geometrycollection()))
df <- st_sf(id = 1:nrows, geometry = geometry)

df
#> Simple feature collection with 10 features and 1 field (with 10 geometries empty)
#> Geometry type: GEOMETRYCOLLECTION
#> Dimension:     XY
#> Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
#> CRS:           NA
#>    id                 geometry
#> 1   1 GEOMETRYCOLLECTION EMPTY
#> 2   2 GEOMETRYCOLLECTION EMPTY
#> 3   3 GEOMETRYCOLLECTION EMPTY
#> 4   4 GEOMETRYCOLLECTION EMPTY
#> 5   5 GEOMETRYCOLLECTION EMPTY
#> 6   6 GEOMETRYCOLLECTION EMPTY
#> 7   7 GEOMETRYCOLLECTION EMPTY
#> 8   8 GEOMETRYCOLLECTION EMPTY
#> 9   9 GEOMETRYCOLLECTION EMPTY
#> 10 10 GEOMETRYCOLLECTION EMPTY

# `geometry` sticks around
df["id"]
#> Simple feature collection with 10 features and 1 field (with 10 geometries empty)
#> Geometry type: GEOMETRYCOLLECTION
#> Dimension:     XY
#> Bounding box:  xmin: NA ymin: NA xmax: NA ymax: NA
#> CRS:           NA
#>    id                 geometry
#> 1   1 GEOMETRYCOLLECTION EMPTY
#> 2   2 GEOMETRYCOLLECTION EMPTY
#> 3   3 GEOMETRYCOLLECTION EMPTY
#> 4   4 GEOMETRYCOLLECTION EMPTY
#> 5   5 GEOMETRYCOLLECTION EMPTY
#> 6   6 GEOMETRYCOLLECTION EMPTY
#> 7   7 GEOMETRYCOLLECTION EMPTY
#> 8   8 GEOMETRYCOLLECTION EMPTY
#> 9   9 GEOMETRYCOLLECTION EMPTY
#> 10 10 GEOMETRYCOLLECTION EMPTY

This has caused quite a bit of pain in dplyr over the years, and has recently also just bitten me again in hardhat, where I also use df[cols] as a way to select columns tidymodels/hardhat#228.

In dplyr, algorithms underlying functions like arrange() and distinct() use df[i] to first select the columns to order by or compute the unique values of, so retaining extra columns here can be particularly problematic. I know sf has methods for these two verbs to work around this, but I think those could be avoided entirely if the geometry column wasn't sticky here.

I think:

  • Having dplyr::select() retain sticky columns makes for a great user experience
  • Having df[i] retain sticky columns makes for a painful programming experience

This ^ is our general advice regarding sticky columns, and is how dplyr's grouped data frames work:

library(dplyr, warn.conflicts = FALSE)

df <- tibble(g = 1, x = 2)
df <- group_by(df, g)

select(df, x)
#> Adding missing grouping variables: `g`
#> # A tibble: 1 × 2
#> # Groups:   g [1]
#>       g     x
#>   <dbl> <dbl>
#> 1     1     2

# Returns a bare tibble, not a grouped data frame
df["x"]
#> # A tibble: 1 × 1
#>       x
#>   <dbl>
#> 1     2

It is also how tsibble works with its index column.

Would you ever consider allowing df[i] to only return exactly the columns selected by i? If the geometry column isn't selected, then the appropriate behavior would be to return a bare data frame or bare tibble depending on the underlying data structure.

@rsbivand
Copy link
Member

Spatial data is not general data, and retaining the geometry column is essential if the data are still to be considered spatial. Re-ordering rows would be a key case here. So users of spatial data would have to choose positively to discard the geometry column first. This is much more sensible than say losing track of which census tracts are which. Obliging users to first use dplyr, then use dplyr::select() to do what is correct with spatial data, rather than using the general base-R selection by [, seems unduly invasive.

@DavisVaughan
Copy link
Contributor Author

DavisVaughan commented Mar 23, 2023

An alternative that might also work for us is for us to implement our own df_select(x, cols) in a low level package that can be used for column selection and would always return a bare data frame. In dplyr we could use that and then call dplyr::dplyr_reconstruct() after all the manipulation is done to attempt to restore back to an sf object at the end, if possible.

@rsbivand
Copy link
Member

Perhaps. https://r-spatial.org/book/07-Introsf.html#subsetting is being published next month, the bookdown has been live for a long time, so it is too late to even try to adapt. @edzer you do the heavy lifting, what do you think?

@edzer
Copy link
Member

edzer commented Mar 23, 2023

what do you think?

I can see your point, but think that the sticky geometry in [ is the number one property that makes an object of class c("sf", "data.frame") different from a data.frame; if you don't want it, simply use data.frame's with geometry columns (and discover how hard life can be). It's been there from day one, now about 7 years ago. I also think that a significant part of the user base likes sf because it doesn't tie you up on using tidyverse. Also look carefully at the illustration on https://r-spatial.github.io/sf/ .

@lionel-
Copy link
Contributor

lionel- commented Mar 23, 2023

Obliging users to first use dplyr, then use dplyr::select() to do what is correct with spatial data, rather than using the general base-R selection by [, seems unduly invasive.

I think the suggestion is not so much that users should use dplyr::select(), but that they should use a different generic than [ to get sticky columns, so that [ may preserve important invariants of base R programming. This generic could have been provided by sf.

That said there are also good reasons for the current design choice as you mentioned and also the ship has sailed a long time ago. So I'm wondering if sf could use the same approach as data.table to enable and disable the custom [ interface in unsuspecting namespaces?

@edzer
Copy link
Member

edzer commented Mar 23, 2023

disable the custom [ interface in unsuspecting namespaces?

could you point to an example where & how this is done?

@rsbivand
Copy link
Member

And can users be fully shielded from the unintended effects of the loading of packages by other packages? That is, can they use [ as before without anything unanticipated occurring?

@lionel-
Copy link
Contributor

lionel- commented Mar 24, 2023

data.table calls it CEDTA: https://github.com/Rdatatable/data.table/blob/master/R/cedta.R

The gist is something like this (untested, hopefully no typos):

# Known namespaces that need sticky columns. This is set to the
# packages that are already depending on sf-awareness at the time
# the feature was introduced. New packages should use `make_sf_aware()`.
known_sf_aware <- c(".globalenv", "foo", "bar")

# Register environment as sf-aware. We use a lexical flag instead of
# inspecting the the name of the calling namespace and add it to
# `the$sf_aware` because the flag is more flexible. It makes it
# possible for arbitrary environments to opt into (or disable) the sf
# API like this:
#
# ```
# local({ make_sf_aware(); sf[...] })
# ```
#
# Packages should call it like this in their `.onLoad()` hook (and
# thus before their namespace is sealed):
#
# ```
# make_sf_aware(topenv(environment()))
# ```

#' @export
make_sf_aware <- function(env = parent.frame(), value = TRUE) {
  env$.__sf_aware__. <- value
}

is_sf_aware <- function(env = parent.frame(2)) {
  top <- topenv(env)

  # Check for overrides
  top_name <- env_name(top)
  if (!is.null(top_name) && top_name %in% known_sf_aware) {
    return(TRUE)
  }

  # Now check for the lexical flag. This would need to be rewritten
  # without rlang as a loop over parents that stops at `topenv()`.
  flag <- rlang::env_get(
    env,
    ".__sf_aware__.",
    default = FALSE,
    inherit = TRUE,
    last = top
  )

  stopifnot(
    is.logical(flag) && length(flag) == 1 && !is.na(flag)
  )

  flag
}

env_name <- function(env) {
  ns <- topenv(env)

  if (isNamespace(ns)) {
    return(getNamespaceName(ns))
  }
  if (identical(ns, globalenv())) {
    return(".globalenv")
  }

  NULL
}

Then you call is_sf_aware() in your methods. If not aware, call the next method and return the result.

And can users be fully shielded from the unintended effects of the loading of packages by other packages? That is, can they use [ as before without anything unanticipated occurring?

yup, the effect is entirely lexical because we're looking in the parent frame to determine sf-awareness. It's bounded by the top env, which is either a namespace or the global env.

@rsbivand
Copy link
Member

rsbivand commented Mar 24, 2023

I would absolutely advise keeping this on the opt-out of sf behaviour side, not the opt-in side. In this case, you appear to require all packages in any way depending on sf to add code (and add it all together) in order to choose the right/legacy behaviour for spatial data, placing a burden on the sf maintainer to keep all revdeps compliant, crucially all new ones. What will happen in existing workflows is unknown, as the sf namespace may be loaded later (Suggests:). This also seems to require the import of the rlang namespace, which is not acceptable in general (sf only Suggests: rlang).

@lionel-
Copy link
Contributor

lionel- commented Mar 24, 2023

, placing a burden on the sf maintainer to keep all revdeps compliant, crucially all new ones.

The burden on the sf maintainer is only at the time of implementing the feature. New revdeps that need sf-awareness would call make_sf_aware() as I mentioned in the comment. This setup works well in the data.table community so I'd recommend to at least consider it.

I'd argue that there is a burden either way. sf implements the data frame interface and so ends up being passed to all sorts of packages, including pure modelling and data-reshaping packages. From that point of view it makes sense to make the behaviour opt-in.

We could also make the mechanism a little more complex and read the DESCRIPTION file to detect if sf is in Imports or Depends. If that's the case, this could default to sf-awareness. This would need to be cached to avoid the overhead but it's certainly feasible.

This also seems to require the import of the rlang namespace, which is not acceptable in general (sf only Suggests: rlang).

It really doesn't, cf the comment above rlang::env_get(). I guess I should have taken the time to write the for loop instead of relying on a comment...

@rsbivand
Copy link
Member

Why should sf's downstream consumers need to do anything? I much prefer any consumers that need to remove awareness, to do that themselves locally. In many cases, sf is Suggests:, not Depends: or Imports:. As we try to migrate a user base of > 1000 packages from rgdal and rgeos by mid 2023, we are using sf as Suggests: in sp (June 2023) to access PROJ/GEOS/GDAL. The behaviour of sf has to stay as stable as possible at least until end 2024. Requiring many hundreds of packages to adapt is hard enough without getting them to opt into something which for most ecologists etc. will be incomprehensible.

Is there a path through the behaviour of sf objects when [..., drop=TRUE] on the dplyr side when one is completely sure that the geometry column must be discarded?

@lionel-
Copy link
Contributor

lionel- commented Mar 24, 2023

It'd be fine to have packages that have sf in Suggests default to sf-awareness too if that's important for backward compatibility. I understand that you don't want to break existing packages, that's a very reasonable position. So in that scheme we'd have:

  • Packages that don't depend on sf in any way always use the df interface.
  • Packages that depend on sf in any way default to the sf interface. They opt out by calling sf::make_sf_aware(ns, value = FALSE), which I guess would become sf::make_sf_unaware(ns).

I think the main downside of this approach is discoverability. On the upside it should do the right thing in most cases if I'm not missing anything.

Is there a path through the behaviour of sf objects when [..., drop=TRUE] on the dplyr side when one is completely sure that the geometry column must be discarded?

Sorry I don't understand what you mean. I'll say that in general we avoid drop = TRUE when working with data frames in order to preserve the length invariants. Also this discussion is not about dplyr or tidyr where we're happy to use extra precautions, but about how subclasses of data.frame with special interfaces generally interact with other packages in the ecosystem.

@rsbivand
Copy link
Member

It is not just important, it is essential that current behaviour is continued with no user and preferably no downstream maintainer intervention. Any change in behaviour is a show-stopper, especially as we try to migrate hundreds of packages (with often unresponsive maintainers) to use sf or terra from rgeos and rgdal in a very tight time-frame. This is not the right time to introduce insidious breaking changes into the sf ecosystem. Fortunately terra does not touch any tidy packages and is S4, but tidyterra @dieghernan does.

@edzer
Copy link
Member

edzer commented Mar 26, 2023

I agree with @lionel- that this is worth exploring, and, when done right, has the potential to help a lot of users using sf objects in non-sf aware packages. Good to see that data.table has been there too. Right now, as @rsbivand mentioned, our focus is on the retiring of rgeos, rgdal and maptools and helping package developers using those migrating to sf. Unless someone else picks this up the end of this year will be an optimistic start date for working on this.

@edzer
Copy link
Member

edzer commented Jul 23, 2024

started reverse dependency checks...

@rsbivand
Copy link
Member

I hope the auguries were propitious! Is this in a branch that I could install to test ASDAR and SDSR?

@edzer
Copy link
Member

edzer commented Jul 23, 2024

Here's a POC check:

remotes::install_github("edzer/testingsf")
# Using github PAT from envvar GITHUB_PAT. Use `gitcreds::gitcreds_set()` and unset GITHUB_PAT in .Renviron (or elsewhere) if you want to use the more secure git credential store instead.
# Skipping install of 'testingsf' from a github remote, the SHA1 (5c9a2323) has not changed since last install.
#   Use `force = TRUE` to force installation
Sys.setenv("ADD_SF_NAMESPACE"="true")
library(sf)
# Linking to GEOS 3.12.1, GDAL 3.8.4, PROJ 9.3.1; sf_use_s2() is TRUE
demo(nc, ask = FALSE, echo = FALSE)
print(attr(nc, ".sf_namespace")) # check for https://github.com/r-spatial/sf/pull/2212
# function () 
# NULL
# <bytecode: 0x5b6178ccb2c0>
# <environment: namespace:sf>
testingsf::report(nc) # number of columns in nc[1]: 1 under branch sf_aware, 2 under branch main
# ncol x[1]: 1 

The branch can be installed directly:

remotes::install_github("r-spatial/sf", "sf_aware")

@rsbivand
Copy link
Member

ASDAR OK. But ...

SDSR ch 14:

library(sf)
data(pol_pres15, package = "spDataLarge")
pol_pres15 |> subset(select = c(TERYT, name, types)) |> head()
# Error in .subset2(x, i, exact = exact) : 
#  attempt to select less than one element in get1index
> o <- subset(pol_pres15, select = c(TERYT, name, types))
> head(o)
Error in .subset2(x, i, exact = exact) : 
  attempt to select less than one element in get1index
> str(o)
Classes 'sf' and 'data.frame':	2495 obs. of  3 variables:
 $ TERYT: chr  "020101" "020102" "020103" "020104" ...
 $ name : chr  "BOLESŁAWIEC" "BOLESŁAWIEC" "GROMADKA" "NOWOGRODZIEC" ...
 $ types: Factor w/ 4 levels "Rural","Urban",..: 2 1 1 3 1 1 2 2 2 2 ...
> sf:::is_sf_aware()
[1] TRUE

So this breaks base::subset, doesn't it?

@rsbivand
Copy link
Member

rsbivand commented Jul 23, 2024

> pol_pres15[1:2495, c(1, 4, 6), drop = FALSE]
Simple feature collection with 2495 features and 3 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 171677.6 ymin: 133223.7 xmax: 861882.2 ymax: 774774
Projected CRS: ETRS89 / Poland CS92
First 10 features:
    TERYT                name       types                       geometry
1  020101         BOLESŁAWIEC       Urban MULTIPOLYGON (((261089.5 38...
2  020102         BOLESŁAWIEC       Rural MULTIPOLYGON (((254150 3837...
3  020103            GROMADKA       Rural MULTIPOLYGON (((275346 3846...
4  020104        NOWOGRODZIEC Urban/rural MULTIPOLYGON (((251769.8 37...
5  020105          OSIECZNICA       Rural MULTIPOLYGON (((263423.9 40...
6  020106 WARTA BOLESŁAWIECKA       Rural MULTIPOLYGON (((267030.7 38...
7  020201             BIELAWA       Urban MULTIPOLYGON (((332596.9 31...
8  020202         DZIERŻONIÓW       Urban MULTIPOLYGON (((334652.9 31...
9  020203            PIESZYCE       Urban MULTIPOLYGON (((331038.9 32...
10 020204        PIŁAWA GÓRNA       Urban MULTIPOLYGON (((343571.3 31...

is OK, the damage is inside base:::subset.data.frame.

edzer added a commit that referenced this issue Jul 23, 2024
@edzer
Copy link
Member

edzer commented Jul 23, 2024

Fixed now by adding base - there might be others, like utils & stats?

@rsbivand
Copy link
Member

I remain concerned about model.matrix etc., but that should drop geometry, so probably I worry too much. Will check that adding base lets ch 14-17 run to completion.

@rsbivand
Copy link
Member

SDSR Ch 14-17 now complete without error. Maybe the revdeps will show whether utils or stats are needed too?

@edzer edzer mentioned this issue Jul 27, 2024
5 tasks
edzer added a commit that referenced this issue Jul 28, 2024
edzer added a commit that referenced this issue Jul 28, 2024
edzer added a commit that referenced this issue Jul 30, 2024
@edzer
Copy link
Member

edzer commented Jul 30, 2024

Packages added (for now) to the known_sf_aware list because upstream package assume sf-aware behaviour:

  • stats, because of as.dist in spatialsample
  • base, because of subset()
  • utils because of head()
  • od, used by abstr
  • rsample, used by spatialsample

Methods added to make this work:

  • relocate.sf (generic in dplyr), used by ARUtools

Still open issues

  • gtfs2gps fails, but would pass if data.table is added to the sf-aware package list

edzer added a commit that referenced this issue Jul 31, 2024
@edzer
Copy link
Member

edzer commented Jul 31, 2024

I think I've done about five revdep checks on 800+ packages now, just to discover that there are new packages depending on sf arriving on CRAN on a daily basis. There are now three packages that assume sf-aware behaviour of [.sf inside packages they depend on, but that do not depend on sf:

  • abstr assumes od is sf-aware
  • spatialsample assumes rsample is sf-aware
  • gtfs2gps assumes data.table is sf-aware.

@lionel- how do you suggest we go about this? I'm reluctant to add these secondary dependencies to the sf-aware list; can we also leave that to the packages, say that abstr calls some function that makes [.sf in od behave sf-aware? I tried a function that modifies known_sf_aware inside sf, using <<-, but that seems not to be possible.

edzer added a commit that referenced this issue Jul 31, 2024
@edzer
Copy link
Member

edzer commented Jul 31, 2024

Asking the question is answering it; this now allows e.g. package spatialsample to define in .onLoad()

if (utils::packageVersion("sf") >= "1.0.17")
  sf::sf_make_aware_pkg("rsample")

to make sure that [.sf has sf semantics when called inside rsample (which does not depend on sf).

edzer added a commit that referenced this issue Oct 2, 2024
edzer added a commit that referenced this issue Oct 2, 2024
edzer added a commit that referenced this issue Oct 2, 2024
edzer added a commit that referenced this issue Oct 2, 2024
This reverts commit d7c8f20.
edzer added a commit that referenced this issue Oct 2, 2024
edzer added a commit that referenced this issue Oct 2, 2024
edzer added a commit that referenced this issue Oct 2, 2024
edzer added a commit that referenced this issue Oct 2, 2024
This reverts commit c6a2bf6.
edzer added a commit that referenced this issue Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

4 participants