-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Writing GeoPackage files in a reproducible way #1618
Comments
According to GeoPackage 1.2.1 and 1.3 the default for the In R, for the lubridate::format_ISO8601(as.POSIXct(Sys.time()), precision = "ymdhms", usetz = TRUE)
#> [1] "2021-03-02T16:07:19+0100"
lubridate::format_ISO8601(as.POSIXct(Sys.Date()), precision = "ymdhms", usetz = TRUE)
#> [1] "2021-03-02T01:00:00+0100" Created on 2021-03-02 by the reprex package (v1.0.0) UPDATE:
Regarding the ISO 8601 time format, clearly the strict UTC format, represented by a 'Z' at the end, should be used. Requirement 15 of GeoPackage 1.2.1 (and 1.3):
This can probably be achieved by using the SQLite function |
A preliminary attempt is in https://github.com/florisvdh/sf/blob/set_timestamp_gpkg/R/manipulate_gpkg.R; some examples still to be added. And I did not yet test the function. No problem if sf would not be the right place for it. |
I'm hesitant in adopting driver-specific code. Is this maybe more something for https://github.com/r-spatial/sfdbi @etiennebr ? Or a package of its own? |
This is for the GDAL driver to do, possibly through options. https://gdal.org/drivers/vector/gpkg.html asserts 1.2 only. https://gdal.org/drivers/raster/gpkg.html isn't specific about versions, but mentions progress in GDAL 3.2 towards tile handling. I suggest those interesting fund a GDAL driver upgrade; these are potentially costly needs. If in the GDAL driver, every application benefits (terra, gdalcubes, rgdal, sf, etc., among other R packages too). |
I had looked at GDAL and it did not have such an option indeed. I just needed an approach right now, hence happy to share it, but I agree investment in GDAL is the better way to go! In the meantime I'm still available to contribute the function to another spatial package, and otherwise I'll distribute it separately, possibly in one of the (publicly available) packages of my organisation. Or as a small standalone package indeed. |
This can be done in
I wonder if |
Nice! The geopackage driver page does not list this variable among the dataset creation or layer options, so my guess is no. |
|
It's a global configuration option, so |
The possibility of using
That's an exact match with the original purpose. Regarding the precise date-time format for GPKG, without lubridate, my current take (here) is: format(timestamp, format = "%Y-%m-%dT%H:%M:%S.000Z", tz = "UTC") This approach works for both See the GPKG requirement referred above. So it requires more than just ISO 8601, notably milliseconds and 'Z' for UTC. I could rework the function to set a system variable and drop its dependency on |
@dbaston I expect that |
I'm guessing it would. Why not try and see? Then if it works, you could follow up OSGeo/gdal#3519 with a patch for the raster driver doc. |
FWIW: updated the Reprex with sf point layer (using date / using time) and with stars object, based on the function exampleslibrary(sf)
#> Linking to GEOS 3.9.0, GDAL 3.1.3, PROJ 7.2.1
library(openssl)
#> Linking to: OpenSSL 1.1.1 11 Sep 2018
md5sum <- function(x) paste(md5(file(x)))
# Using existing geopackage with vector layer:
filepath <- system.file("gpkg/b_pump.gpkg", package = "sf")
(md5_original <- md5sum(filepath))
#> [1] "18e20ba28fe1b85ed94f9d5debcb1771"
sf_layer <- read_sf(system.file("gpkg/b_pump.gpkg", package = "sf"))
# A rewrite changes the checksum:
filepath_notimeset <- file.path(tempdir(), "b_pump_notimeset.gpkg")
# write 1:
write_sf(sf_layer, dsn = filepath_notimeset)
(md5_notimeset1 <- md5sum(filepath_notimeset))
#> [1] "d63394fc36b6ffbff7f2d14690f89a11"
# write 2:
write_sf(sf_layer, dsn = filepath_notimeset, delete_dsn = TRUE)
(md5_notimeset2 <- md5sum(filepath_notimeset))
#> [1] "efaeb68623b009b49669ae977afd276c"
# compare:
md5_notimeset1 == md5_notimeset2
#> [1] FALSE
# Setting a fixed date
filepath_timeset <- file.path(tempdir(), "b_pump_timeset.gpkg")
(fixed_date <- as.POSIXct("2020-12-25"))
#> [1] "2020-12-25 CET"
set_timestamp_gpkg(fixed_date)
Sys.getenv("OGR_CURRENT_DATE")
#> [1] "2020-12-24T23:00:00.000Z"
# write 1 (date):
write_sf(sf_layer, dsn = filepath_timeset)
md5_timeset1 <- md5sum(filepath_timeset)
# write 2 (date):
write_sf(sf_layer, dsn = filepath_timeset, delete_dsn = TRUE)
md5_timeset2 <- md5sum(filepath_timeset)
# compare:
all.equal(md5_timeset1, md5_timeset2)
#> [1] TRUE
# Setting a fixed time
(fixed_time <- as.POSIXct("2020-12-25 12:00:00", tz = "CET"))
#> [1] "2020-12-25 12:00:00 CET"
set_timestamp_gpkg(fixed_time)
Sys.getenv("OGR_CURRENT_DATE")
#> [1] "2020-12-25T11:00:00.000Z"
# write 3 (time):
write_sf(sf_layer, dsn = filepath_timeset, delete_dsn = TRUE)
md5_timeset3 <- md5sum(filepath_timeset)
# write 4 (time):
write_sf(sf_layer, dsn = filepath_timeset, delete_dsn = TRUE)
md5_timeset4 <- md5sum(filepath_timeset)
# compare:
all.equal(md5_timeset3, md5_timeset4)
#> [1] TRUE
# Also works for GPKG with GDAL raster driver (accessed with stars):
suppressPackageStartupMessages({
library(stars)
library(dplyr)
})
filepath_stars <- file.path(tempdir(), "stars_2d.gpkg")
(fixed_time <- as.POSIXct("2010-02-14 12:00:00", tz = "CET"))
#> [1] "2010-02-14 12:00:00 CET"
set_timestamp_gpkg(fixed_time)
Sys.getenv("OGR_CURRENT_DATE")
#> [1] "2010-02-14T11:00:00.000Z"
stars_2d <-
system.file("tif/L7_ETMs.tif", package = "stars") %>%
read_stars() %>%
slice(band, 1)
# write 1:
stars_2d %>%
write_stars(filepath_stars, driver = "GPKG")
md5_stars1 <- md5sum(filepath_stars)
# write 2:
stars_2d %>%
write_stars(filepath_stars, driver = "GPKG")
md5_stars2 <- md5sum(filepath_stars)
# compare:
all.equal(md5_stars1, md5_stars2)
#> [1] TRUE Created on 2021-03-04 by the reprex package (v1.0.0) Session infosessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.0.4 (2021-02-15)
#> os Linux Mint 20
#> system x86_64, linux-gnu
#> ui X11
#> language nl_BE:nl
#> collate nl_BE.UTF-8
#> ctype nl_BE.UTF-8
#> tz Europe/Brussels
#> date 2021-03-04
#>
#> ─ Packages ───────────────────────────────────────────────────────────────────
#> package * version date lib source
#> abind * 1.4-5 2016-07-21 [1] CRAN (R 4.0.2)
#> askpass 1.1 2019-01-13 [1] CRAN (R 4.0.2)
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.2)
#> class 7.3-18 2021-01-24 [4] CRAN (R 4.0.3)
#> classInt 0.4-3 2020-04-07 [1] CRAN (R 4.0.2)
#> cli 2.3.0 2021-01-31 [1] CRAN (R 4.0.3)
#> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.0.3)
#> DBI 1.1.1 2021-01-15 [1] CRAN (R 4.0.3)
#> digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.3)
#> dplyr * 1.0.4 2021-02-02 [1] CRAN (R 4.0.3)
#> e1071 1.7-4 2020-10-14 [1] CRAN (R 4.0.3)
#> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.2)
#> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.2)
#> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2)
#> generics 0.1.0 2020-10-31 [1] CRAN (R 4.0.3)
#> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2)
#> highr 0.8 2019-03-20 [1] CRAN (R 4.0.2)
#> htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.0.3)
#> KernSmooth 2.23-18 2020-10-29 [4] CRAN (R 4.0.3)
#> knitr 1.31 2021-01-27 [1] CRAN (R 4.0.3)
#> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.0.4)
#> lwgeom 0.2-5 2020-06-12 [1] CRAN (R 4.0.2)
#> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.3)
#> openssl * 1.4.3 2020-09-18 [1] CRAN (R 4.0.2)
#> pillar 1.4.7 2020-11-20 [1] CRAN (R 4.0.3)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.2)
#> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.2)
#> R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.3)
#> Rcpp 1.0.6 2021-01-15 [1] CRAN (R 4.0.3)
#> reprex 1.0.0 2021-01-27 [1] CRAN (R 4.0.3)
#> rlang 0.4.10 2020-12-30 [1] CRAN (R 4.0.3)
#> rmarkdown 2.6 2020-12-14 [1] CRAN (R 4.0.3)
#> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.0.3)
#> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.2)
#> sf * 0.9-8 2021-03-04 [1] Github (florisvdh/sf@34434d1)
#> stars * 0.5-1 2021-01-25 [1] CRAN (R 4.0.3)
#> stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.2)
#> tibble 3.0.6 2021-01-29 [1] CRAN (R 4.0.3)
#> tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.2)
#> units 0.6-7 2020-06-13 [1] CRAN (R 4.0.2)
#> vctrs 0.3.6 2020-12-17 [1] CRAN (R 4.0.3)
#> withr 2.4.1 2021-01-26 [1] CRAN (R 4.0.3)
#> xfun 0.21 2021-02-10 [1] CRAN (R 4.0.4)
#> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.2)
#>
#> [1] /home/floris/lib/R/library
#> [2] /usr/local/lib/R/site-library
#> [3] /usr/lib/R/site-library
#> [4] /usr/lib/R/library |
Overview & some updates:
A thought: I think the above is rather to be seen independent of sf; can be used in various settings. But maybe an sf-specific |
Following helper functions to assist in writing reproducible GeoPackage files are now provided in an R package
Further ideas and contributions are welcome. Currently I have no further plans to expand the package, since it was only meant to make these specific functions available. It's not linked particularly to Related to |
OK, I added library(sf)
# Linking to GEOS 3.9.0, GDAL 3.2.1, PROJ 7.2.1
library(RSQLite)
library(lubridate)
# Attaching package: ‘lubridate’
# The following objects are masked from ‘package:base’:
# date, intersect, setdiff, union
nc <- st_read(system.file('shape/nc.shp', package = 'sf'))
# Reading layer `nc' from data source `/home/edzer/R/x86_64-pc-linux-gnu-library/4.0/sf/shape/nc.shp' using driver `ESRI Shapefile'
# Simple feature collection with 100 features and 14 fields
# Geometry type: MULTIPOLYGON
# Dimension: XY
# Bounding box: xmin: -84.32385 ymin: 33.88199 xmax: -75.45698 ymax: 36.58965
# Geodetic CRS: NAD27
co = c(OGR_CURRENT_DATE = format_ISO8601(Sys.time() - years(20),
precision = 'ymdhms',
usetz = TRUE))
st_write(nc, '/tmp/nc.gpkg', append = FALSE, config_options = co)
# Deleting layer `nc' using driver `GPKG'
# Writing layer `nc' to data source `/tmp/nc.gpkg' using driver `GPKG'
# Writing 100 features with 14 fields and geometry type Multi Polygon.
con <- dbConnect(SQLite(), '/tmp/nc.gpkg')
dbGetQuery(con, 'SELECT last_change FROM gpkg_contents')$last_change
# [1] "2001-03-31T17:17:19+0200"
names(co)=NULL
st_write(nc, '/tmp/nc.gpkg', append = FALSE, config_options = co)
# Error in CPL_write_ogr(obj, dsn, layer, driver, as.character(dataset_options), :
# config_options should be a character vector with names, as in c(key="value")
# Calls: st_write -> st_write.sf -> CPL_write_ogr
# Execution halted |
Note that config options set once persist to subsequent calls: library(sf)
# Linking to GEOS 3.9.0, GDAL 3.2.1, PROJ 7.2.1
library(RSQLite)
library(lubridate, warn.conflicts = FALSE)
nc <- st_read(system.file('shape/nc.shp', package = 'sf'), quiet = TRUE)
co = c(OGR_CURRENT_DATE = format_ISO8601(Sys.time() - years(20),
precision = 'ymdhms',
usetz = TRUE))
st_write(nc, '/tmp/nc.gpkg', append = FALSE, config_options = co, quiet = TRUE)
con <- dbConnect(SQLite(), '/tmp/nc.gpkg')
dbGetQuery(con, 'SELECT last_change FROM gpkg_contents')$last_change
# [1] "2001-03-31T17:23:27+0200"
st_write(nc, '/tmp/nc.gpkg', append = FALSE, quiet = TRUE)
con <- dbConnect(SQLite(), '/tmp/nc.gpkg')
dbGetQuery(con, 'SELECT last_change FROM gpkg_contents')$last_change
# [1] "2001-03-31T17:23:27+0200" |
That's great news - thanks a lot for adding this feature to In case of format(timestamp, format = "%Y-%m-%dT%H:%M:%S.000Z", tz = "UTC") (works for |
Any reason not to reset the config options after the call? If the options persist globally I think a separate |
I don't see a way to unset them in GDAL, there's no |
|
Ah, thanks that clears the case: from here: "This function can also be used to clear a setting by passing NULL as the value (note: passing NULL will not unset an existing environment variable; it will just unset a value previously set by CPLSetConfigOption())." So they need to be unset at the end of the function. |
That gives us: library(sf)
# Linking to GEOS 3.9.0, GDAL 3.2.1, PROJ 7.2.1
library(RSQLite)
library(lubridate, warn.conflicts = FALSE)
nc <- st_read(system.file('shape/nc.shp', package = 'sf'), quiet = TRUE)
co = c(OGR_CURRENT_DATE = format_ISO8601(Sys.time() - years(20),
precision = 'ymdhms',
usetz = TRUE))
st_write(nc, '/tmp/nc.gpkg', append = FALSE, config_options = co, quiet = TRUE)
con <- dbConnect(SQLite(), '/tmp/nc.gpkg')
dbGetQuery(con, 'SELECT last_change FROM gpkg_contents')$last_change
# [1] "2001-03-31T18:39:17+0200"
st_write(nc, '/tmp/nc.gpkg', append = FALSE, quiet = TRUE)
con <- dbConnect(SQLite(), '/tmp/nc.gpkg')
dbGetQuery(con, 'SELECT last_change FROM gpkg_contents')$last_change
# [1] "2021-03-31T16:39:17.815Z" |
I'm preparing an extra section in the |
Avoiding lubridate, see #1618 (comment) |
Also, in CPLGetConfigOption() it is explained that "If the given option was no defined with CPLSetConfigOption(), it tries to find it in environment variables" |
Hello, Has anyone approached the same problem (reproducible/consistent MD5 values for geopackages) in Python? Thanks in advance. |
Would the addition of a function
set_timestamp_gpkg(dsn, timestamp)
to sf be sensible? In thegpkg_contents
table of a GeoPackage thelast_change
column is always updated with a timestamp for each written layer; however this makes it difficult when the aim is file reproducibility. See also https://gis.stackexchange.com/a/385944/178290.Such function would be a separate helper function to reproducibly create GPKG files, allowing to achieve a constant file checksum as long as file contents (disregarding timestamp) don't change. It could be implemented by the user right after doing
st_write(driver = "GPKG")
, by setting an explicit (ISO 8601) timestamp, if reproducibility is important.A prototype script:
The text was updated successfully, but these errors were encountered: