Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leverage on giscoR #264

Merged
merged 8 commits into from
Jul 31, 2023
Merged

Leverage on giscoR #264

merged 8 commits into from
Jul 31, 2023

Conversation

dieghernan
Copy link
Member

Hi @antagomir @pitkant

This PR is related with #230 and #263

Basically now get_eurostat_geospatial is a wrapper of giscoR::gisco_get_nuts. I skimmed the function as much as I can so issues on that function would be addressed now in giscoR, reducing the burden of maintenance by eurostat package.

I also reviewed the dependencies removing a bunch of them, updated the vignettes and the actions and deleted unused files in docs/vignettes/revdep.

I checked the package and revdeps and all it’s fine. Summary of changes:

  • get_eurostat_geospatial() now leverages on giscoR::gisco_get_nuts() for
    downloading geospatial data:
    • "spdf" output class soft-deprecated, it would return a sf object with a message.
    • make_valid parameter soft-deprecated.
    • Added ... to the function so additional parametes can be passed to giscoR::gisco_get_nuts().
    • Dataset eurostat_geodata_60_2016 updated.

@dieghernan dieghernan marked this pull request as ready for review July 2, 2023 09:24
@pitkant
Copy link
Member

pitkant commented Jul 31, 2023

Excellent work, thank you very much! I tested this version and everything seems to be working smoothly.

I was left thinking about issue #240 since now there's a difference between datasets read from the cache and datasets downloaded from GISCO. Like this:

Downloaded from GISCO by using resolution "20":

> sf <- get_eurostat_geospatial(
+ output_class = "sf",
+  resolution = "20",
+  nuts_level = "all"
+  )
Extracting data using giscoR package, please report issues on https://github.com/rOpenGov/giscoR/issues
Cache management as per giscoR. see 'giscoR::gisco_get_nuts()'
> sf
Simple feature collection with 2016 features and 10 fields
Geometry type: GEOMETRY
Dimension:     XY
Bounding box:  xmin: -63.08825 ymin: -21.39077 xmax: 55.83808 ymax: 71.11814
Geodetic CRS:  WGS 84
First 10 features:
                    NAME_LATN MOUNT_TYPE CNTR_CODE URBN_TYPE COAST_TYPE FID                  NUTS_NAME NUTS_ID LEVL_CODE                       geometry geo
1                     ESPAÑA           0        ES         0          0  ES                    ESPAÑA       ES         0 MULTIPOLYGON (((-7.03184 43...  ES
2             SUOMI / FINLAND          0        FI         0          0  FI            SUOMI / FINLAND      FI         0 MULTIPOLYGON (((28.92968 69...  FI
3                      ÍSLAND          0        IS         0          0  IS                     ÍSLAND      IS         0 POLYGON ((-22.0074 63.83599...  IS
4  REGIÃO AUTÓNOMA DOS AÇORES          0        PT         0          0 PT2 REGIÃO AUTÓNOMA DOS AÇORES     PT2         1 MULTIPOLYGON (((-25.6919 37... PT2
5                      FRANCE          0        FR         0          0  FR                     FRANCE      FR         0 MULTIPOLYGON (((2.60704 50....  FR
6                    HRVATSKA          0        HR         0          0  HR                   HRVATSKA      HR         0 MULTIPOLYGON (((16.59681 46...  HR
7                MAGYARORSZÁG          0        HU         0          0  HU               MAGYARORSZÁG      HU         0 POLYGON ((22.12108 48.37831...  HU
[...]

Read from cache by using resolution "60":

> sf <- get_eurostat_geospatial(
+ output_class = "sf",
+  resolution = "60",
+  nuts_level = "all"
+  )
Extracting data from eurostat::eurostat_geodata_60_2016
> sf
Simple feature collection with 2016 features and 10 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -61.841 ymin: -21.376 xmax: 55.85 ymax: 71.178
Geodetic CRS:  WGS 84
First 10 features:
   LEVL_CODE NUTS_ID URBN_TYPE CNTR_CODE               NAME_LATN               NUTS_NAME MOUNT_TYPE COAST_TYPE FID geo                       geometry
1          0      AL         0        AL               SHQIPËRIA               SHQIPËRIA          0          0  AL  AL MULTIPOLYGON (((19.831 42.4...
2          0      AT         0        AT              ÖSTERREICH              ÖSTERREICH          0          0  AT  AT MULTIPOLYGON (((15.754 48.8...
3          0      BE         0        BE         BELGIQUE-BELGIË         BELGIQUE-BELGIË          0          0  BE  BE MULTIPOLYGON (((5.238 51.26...
4          0      BG         0        BG                BULGARIA                БЪЛГАРИЯ          0          0  BG  BG MULTIPOLYGON (((22.675 44.2...
5          0      CH         0        CH SCHWEIZ/SUISSE/SVIZZERA SCHWEIZ/SUISSE/SVIZZERA          0          3  CH  CH MULTIPOLYGON (((8.67 47.685...
6          0      CY         0        CY                  KYPROS                  ΚΥΠΡΟΣ          0          0  CY  CY MULTIPOLYGON (((34.633 35.8...
[...]

So the difference being mainly in the position of the "geo" column. I think having "geometry" column as the last column of the "sf" "data.frame" object would be desirable. Another question is whether it's actually needed, but then again someone needed it in the first place since it was added there. Does @muuankarski have any input on this?

@dieghernan
Copy link
Member Author

I wasn’t aware of #240, let me have a look

@dieghernan
Copy link
Member Author

Hi @pitkant

After 3a6e48f the output of the spatial objects are consistent in the number, name and position of the columns. If I check for all years:

years <- c('2003','2006','2010','2013','2016','2021')
allyears <- lapply(years, function(x){
  
  tb <- suppressMessages(eurostat::get_eurostat_geospatial(country = "LU",
                                          nuts_level = 0, resolution = 20,
                                          update_cache = TRUE))
                         
  

  # Prepare summary
  df <- data.frame(column = names(tb),
                   ncol = seq_len(ncol(tb)))
  df <- tidyr::pivot_wider(df, names_from = column, values_from = ncol)
  df$year <- x
  
  df
})

dplyr::bind_rows(allyears) |>
  dplyr::relocate(year, .before = 1) |>
  knitr::kable(caption = "Number indicates the position of the column in the object")
year id LEVL_CODE NUTS_ID CNTR_CODE NAME_LATN NUTS_NAME MOUNT_TYPE URBN_TYPE COAST_TYPE FID geo geometry
2003 1 2 3 4 5 6 7 8 9 10 11 12
2006 1 2 3 4 5 6 7 8 9 10 11 12
2010 1 2 3 4 5 6 7 8 9 10 11 12
2013 1 2 3 4 5 6 7 8 9 10 11 12
2016 1 2 3 4 5 6 7 8 9 10 11 12
2021 1 2 3 4 5 6 7 8 9 10 11 12

Number indicates the position of the column in the object

Created on 2023-07-31 with reprex v2.0.2

All the objects have the same columns in the same position. Note that, as mentioned in #240, for some years (i.e., from 2003 to 2013) there are columns that are not provided (i.e., MOUNT_TYPE and friends). In those cases, our output would return that columns but with NA , so it would be more consistent for users:

eurostat::get_eurostat_geospatial(year = 2010) |> 
  dplyr::glimpse()
#> Extracting data using giscoR package, please report issues on https://github.com/rOpenGov/giscoR/issues
#> Rows: 1,920
#> Columns: 12
#> $ id         <chr> "AT111", "AT112", "AT113", "AT121", "AT122", "AT123", "AT12…
#> $ LEVL_CODE  <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
#> $ NUTS_ID    <chr> "AT111", "AT112", "AT113", "AT121", "AT122", "AT123", "AT12…
#> $ CNTR_CODE  <chr> "AT", "AT", "AT", "AT", "AT", "AT", "AT", "AT", "AT", "AT",…
#> $ NAME_LATN  <chr> "Mittelburgenland", "Nordburgenland", "Südburgenland", "Mos…
#> $ NUTS_NAME  <chr> "Mittelburgenland", "Nordburgenland", "Südburgenland", "Mos…
#> $ MOUNT_TYPE <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ URBN_TYPE  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ COAST_TYPE <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ FID        <chr> "AT111", "AT112", "AT113", "AT121", "AT122", "AT123", "AT12…
#> $ geo        <chr> "AT111", "AT112", "AT113", "AT121", "AT122", "AT123", "AT12…
#> $ geometry   <MULTIPOLYGON [°]> MULTIPOLYGON (((16.648 47.4..., MULTIPOLYGON (…

Created on 2023-07-31 with reprex v2.0.2

@pitkant
Copy link
Member

pitkant commented Jul 31, 2023

Thank you for the fixes! The output looks nice and consistent now. I'm still not sure about adding duplicated columns in the dataset for supposedly easier joins, especially if geo and FID are both duplicates, but I guess it doesn't add too much overhead either. I added a note about geo column having the "Questioning" status

@dieghernan
Copy link
Member Author

Thanks @pitkant , I consider this PR ready now.

Comment on lines 214 to 216
if (!requireNamespace("sf")) {
message("'sf' package is required for geospatial functionalities")
return(invisible())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though this PR was already merged I noticed a logical problem that originated here:

!requireNamespace("sf") check is not run if use_local == TRUE
This causes problems later on in rows 265:267 (see another comment)

Comment on lines 257 to 267
sf_col <- attr(shp, "sf_column")
# Remove geometry
shp <- sf::st_drop_geometry(shp)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though this PR was already merged I noticed a logical problem that originated here:

!requireNamespace("sf") check is not run if use_local == TRUE
If sf is not installed then sf::st_drop_geometry(shp) can't be run. Something like shp$geometry <- NULL could do the same in this situation, I guess?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants