Skip to content

Commit

Permalink
Merge pull request #74 from minus34/202302
Browse files Browse the repository at this point in the history
Updates for 202302 release
  • Loading branch information
minus34 authored Feb 28, 2023
2 parents 7ed05a1 + 0675c16 commit c43517a
Show file tree
Hide file tree
Showing 42 changed files with 982 additions and 171 deletions.
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
### February 2023 Release
- Postgres dump files are now built on Postgres 14. Requiring Postgres 14+ to use them
- Docker images have been upgraded to Postgres 15

### August 2022 Release
- Docker images have been upgraded to Postgres 14

Expand Down
20 changes: 10 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Have a look at [these intro slides](https://minus34.com/opendata/intro-to-gnaf.p
### There are 4 options for loading the data
1. [Run](https://github.com/minus34/gnaf-loader#option-1---run-loadgnafpy) the load-gnaf Python script and build the database yourself in a single step
2. [Pull](https://github.com/minus34/gnaf-loader#option-2---run-the-database-in-a-docker-container) the database from Docker Hub and run it in a container
3. [Download](https://github.com/minus34/gnaf-loader#option-3---load-pg_dump-files) the GNAF and/or Admin Bdys Postgres dump files & restore them in your Postgres 13+ database
3. [Download](https://github.com/minus34/gnaf-loader#option-3---load-pg_dump-files) the GNAF and/or Admin Bdys Postgres dump files & restore them in your Postgres 14+ database
4. [Use or download](https://github.com/minus34/gnaf-loader#option-4---parquet-files-in-s3) Parquet Files in S3 for your data & analytics workflows; either in AWS or your own platform.

## Option 1 - Run load.gnaf.py
Expand Down Expand Up @@ -51,7 +51,7 @@ The behaviour of gnaf-loader can be controlled by specifying various command lin

#### Optional Arguments
* `--srid` Sets the coordinate system of the input data. Valid values are `4283` (the default: GDA94 lat/long) and `7844` (GDA2020 lat/long).
* `--geoscape-version` Geoscape version number in YYYYMM format. Defaults to current year and last release month. e.g. `202211`.
* `--geoscape-version` Geoscape version number in YYYYMM format. Defaults to current year and last release month. e.g. `202302`.
* `--raw-gnaf-schema` schema name to store raw GNAF tables in. Defaults to `raw_gnaf_<geoscape_version>`.
* `--raw-admin-schema` schema name to store raw admin boundary tables in. Defaults to `raw_admin_bdys_<geoscape_version>`.
* `--gnaf-schema` destination schema name to store final GNAF tables in. Defaults to `gnaf_<geoscape_version>`.
Expand All @@ -66,7 +66,7 @@ The behaviour of gnaf-loader can be controlled by specifying various command lin
* `--no-boundary-tag` DO NOT tag all addresses with some of the key admin boundary IDs for creating aggregates and choropleth maps.

### Example Command Line Arguments
* Local Postgres server: `python load-gnaf.py --gnaf-tables-path="C:\temp\geoscape_202211\G-NAF" --admin-bdys-path="C:\temp\geoscape_202211\Administrative Boundaries"` Loads the GNAF tables to a Postgres server running locally. GNAF archives have been extracted to the folder `C:\temp\geoscape_202211\G-NAF`, and admin boundaries have been extracted to the `C:\temp\geoscape_202211\Administrative Boundaries` folder.
* Local Postgres server: `python load-gnaf.py --gnaf-tables-path="C:\temp\geoscape_202302\G-NAF" --admin-bdys-path="C:\temp\geoscape_202302\Administrative Boundaries"` Loads the GNAF tables to a Postgres server running locally. GNAF archives have been extracted to the folder `C:\temp\geoscape_202302\G-NAF`, and admin boundaries have been extracted to the `C:\temp\geoscape_202302\Administrative Boundaries` folder.
* Remote Postgres server: `python load-gnaf.py --gnaf-tables-path="\\svr\shared\gnaf" --local-server-dir="f:\shared\gnaf" --admin-bdys-path="c:\temp\unzipped\AdminBounds_ESRI"` Loads the GNAF tables which have been extracted to the shared folder `\\svr\shared\gnaf`. This shared folder corresponds to the local `f:\shared\gnaf` folder on the Postgres server. Admin boundaries have been extracted to the `c:\temp\unzipped\AdminBounds_ESRI` folder.
* Loading only selected states: `python load-gnaf.py --states VIC TAS NT ...` Loads only the data for Victoria, Tasmania and Northern Territory

Expand Down Expand Up @@ -110,12 +110,12 @@ Download Postgres dump files and restore them in your database.
Should take 15-60 minutes.

### Pre-requisites
- Postgres 13+ with PostGIS 3.0+
- A knowledge of [Postgres pg_restore parameters](https://www.postgresql.org/docs/13/app-pgrestore.html)
- Postgres 14+ with PostGIS 3.0+
- A knowledge of [Postgres pg_restore parameters](https://www.postgresql.org/docs/14/app-pgrestore.html)

### Process
1. Download the [GNAF dump file](https://minus34.com/opendata/geoscape-202211/gnaf-202211.dmp) or [GNAF GDA2020 dump file](https://minus34.com/opendata/geoscape-202211-gda2020/gnaf-202211.dmp) (~2.0Gb)
2. Download the [Admin Bdys dump file](https://minus34.com/opendata/geoscape-202211/admin-bdys-202211.dmp) or [Admin Bdys GDA2020 dump file](https://minus34.com/opendata/geoscape-202211-gda2020/admin-bdys-202211.dmp) (~2.8Gb)
1. Download the [GNAF dump file](https://minus34.com/opendata/geoscape-202302/gnaf-202302.dmp) or [GNAF GDA2020 dump file](https://minus34.com/opendata/geoscape-202302-gda2020/gnaf-202302.dmp) (~2.0Gb)
2. Download the [Admin Bdys dump file](https://minus34.com/opendata/geoscape-202302/admin-bdys-202302.dmp) or [Admin Bdys GDA2020 dump file](https://minus34.com/opendata/geoscape-202302-gda2020/admin-bdys-202302.dmp) (~2.8Gb)
3. Edit the _restore-gnaf-admin-bdys.bat_ or _.sh_ script in the supporting-files folder for your dump file names, database parameters and for the location of pg_restore
5. Run the script, come back in 15-60 minutes and enjoy!

Expand All @@ -124,11 +124,11 @@ Parquet versions of all the tables are in a public S3 bucket for use directly in

Geometries are stored as Well Known Text (WKT) strings with WGS84 lat/long coordinates (SRID/EPSG:4326). They can be queried using spatial extensions to analytical platforms, such as [Apache Sedona](https://sedona.apache.org/) running on [Apache Spark](https://spark.apache.org/).

The files are here: `s3://minus34.com/opendata/geoscape-202211/parquet/` or `s3://minus34.com/opendata/geoscape-202211-gda2020/parquet/`
The files are here: `s3://minus34.com/opendata/geoscape-202302/parquet/` or `s3://minus34.com/opendata/geoscape-202302-gda2020/parquet/`

### AWS CLI Examples:
- List all datasets: `aws s3 ls s3://minus34.com/opendata/geoscape-202211/parquet/`
- Copy all datasets: `aws s3 sync s3://minus34.com/opendata/geoscape-202211/parquet/ <my-local-folder>`
- List all datasets: `aws s3 ls s3://minus34.com/opendata/geoscape-202302/parquet/`
- Copy all datasets: `aws s3 sync s3://minus34.com/opendata/geoscape-202302/parquet/ <my-local-folder>`

## DATA LICENSES

Expand Down
31 changes: 18 additions & 13 deletions docker/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
FROM debian:buster-slim

ARG BASE_URL="https://minus34.com/opendata/geoscape-202211"
ARG BASE_URL="https://minus34.com/opendata/geoscape-202302"
ENV BASE_URL ${BASE_URL}

# Postgres user password - WARNING: change this to something a lot more secure
Expand All @@ -14,7 +14,7 @@ RUN apt-get update \
&& wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add - \
&& echo "deb http://apt.postgresql.org/pub/repos/apt/ buster-pgdg main" | sudo tee /etc/apt/sources.list.d/pgdg.list \
&& apt-get update \
&& apt-get install -y postgresql-14 postgresql-client-14 postgis postgresql-14-postgis-3 \
&& apt-get install -y postgresql-15 postgresql-client-15 postgis postgresql-15-postgis-3 \
&& apt-get autoremove -y --purge \
&& apt-get clean && rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

Expand All @@ -24,21 +24,26 @@ RUN /etc/init.d/postgresql start \
&& sudo -u postgres psql -c "CREATE EXTENSION postgis;" \
&& /etc/init.d/postgresql stop

# enable external access to postgres - WARNING: these are insecure settings! Edit these to restrict access
RUN echo "host all all 0.0.0.0/0 md5" >> /etc/postgresql/15/main/pg_hba.conf
RUN echo "listen_addresses='*'" >> /etc/postgresql/15/main/postgresql.conf

# download and restore GNAF & Admin Boundary Postgres dump files
RUN mkdir -p /data \
&& cd /data \
&& wget --quiet ${BASE_URL}/gnaf-202211.dmp \
&& wget --quiet ${BASE_URL}/admin-bdys-202211.dmp \
&& /etc/init.d/postgresql start \
&& pg_restore -Fc -d postgres -h localhost -p 5432 -U postgres /data/gnaf-202211.dmp \
&& pg_restore -Fc -d postgres -h localhost -p 5432 -U postgres /data/admin-bdys-202211.dmp \
&& wget --quiet ${BASE_URL}/gnaf-202302.dmp \
&& wget --quiet ${BASE_URL}/admin-bdys-202302.dmp

RUN /etc/init.d/postgresql start \
&& pg_restore -Fc -d postgres -h localhost -p 5432 -U postgres /data/gnaf-202302.dmp \
&& /etc/init.d/postgresql stop \
&& rm /data/gnaf-202211.dmp \
&& rm /data/admin-bdys-202211.dmp
&& rm /data/gnaf-202302.dmp

RUN /etc/init.d/postgresql start \
&& pg_restore -Fc -d postgres -h localhost -p 5432 -U postgres /data/admin-bdys-202302.dmp \
&& /etc/init.d/postgresql stop \
&& rm /data/admin-bdys-202302.dmp

# enable external access to postgres - WARNING: these are insecure settings! Edit these to restrict access
RUN echo "host all all 0.0.0.0/0 md5" >> /etc/postgresql/14/main/pg_hba.conf
RUN echo "listen_addresses='*'" >> /etc/postgresql/14/main/postgresql.conf
EXPOSE 5432

# set user for postgres startup
Expand All @@ -48,4 +53,4 @@ USER postgres
# VOLUME ["/etc/postgresql", "/var/log/postgresql", "/var/lib/postgresql"]

# Start postgres when starting the container
CMD ["/usr/lib/postgresql/14/bin/postgres", "-D", "/var/lib/postgresql/14/main", "-c", "config_file=/etc/postgresql/14/main/postgresql.conf"]
CMD ["/usr/lib/postgresql/15/bin/postgres", "-D", "/var/lib/postgresql/15/main", "-c", "config_file=/etc/postgresql/15/main/postgresql.conf"]
2 changes: 1 addition & 1 deletion docker/xx_code_snippets.sh
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
cd /Users/$(whoami)/git/minus34/gnaf-loader/docker

# build gnaf loader image
docker build --squash --tag minus34/gnafloader:latest --tag minus34/gnafloader:202211 .
docker build --squash --tag minus34/gnafloader:latest --tag minus34/gnafloader:202302 .

# run gnaf loader container
docker run --name=gnafloader --publish=5433:5432 minus34/gnafloader:latest
Expand Down
2 changes: 1 addition & 1 deletion load-gnaf.py
Original file line number Diff line number Diff line change
Expand Up @@ -214,7 +214,7 @@ def populate_raw_gnaf(pg_cur):
# load all PSV files using multiprocessing
geoscape.multiprocess_list("sql", sql_list, logger)

# fix missing geocodes (added due to missing data in 202211 release)
# fix missing geocodes (added due to missing data in 202302 release)
sql = geoscape.open_sql_file("01-04-raw-gnaf-fix-missing-geocodes.sql")
pg_cur.execute(sql)

Expand Down
2 changes: 1 addition & 1 deletion postgres-scripts/01-04-raw-gnaf-fix-missing-geocodes.sql
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
-- workaround for missing default coordinates - 202211 release issue
-- workaround for missing default coordinates - 202302 release issue
with missing as (
select address_detail_pid
from raw_gnaf.address_default_geocode
Expand Down
4 changes: 2 additions & 2 deletions postgres-scripts/02-02a-prep-admin-bdys-tables.sql
Original file line number Diff line number Diff line change
Expand Up @@ -203,10 +203,10 @@ UPDATE admin_bdys.locality_bdys
;


-- -- add old locality_pids to unedited localities -- need to rollover old locality pids from GNAF 202211 release - not supplied in 202211 release
-- -- add old locality_pids to unedited localities -- need to rollover old locality pids from GNAF 202302 release - not supplied in 202302 release
-- UPDATE admin_bdys.locality_bdys as new
-- SET old_locality_pid = old.old_locality_pid
-- FROM admin_bdys_202211.locality_bdys AS old
-- FROM admin_bdys_202302.locality_bdys AS old
-- WHERE new.locality_pid = old.locality_pid;


Expand Down
2 changes: 1 addition & 1 deletion postgres-scripts/xx-04-02-manual-bdy-tags.sql
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@


-- fix 35 boatsheds
update gnaf_202211.address_principal_admin_boundaries
update gnaf_202302.address_principal_admin_boundaries
set lga_pid = 'lgacbffb11990f2',
lga_name = 'Hobart City'
where locality_pid = 'loc0f7a581b85b7'
Expand Down
4 changes: 2 additions & 2 deletions postgres-scripts/xx-add-elevation-to-gnaf.sql
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ DROP TABLE IF EXISTS temp_gnaf_100m_points;
--
-- SELECT ST_Value(dem.rast, gnaf.geom) as elevation,
-- *
-- FROM gnaf_202211.address_principals as gnaf
-- INNER JOIN gnaf_202211.srtm_3s_dem as dem on ST_Intersects(gnaf.geom, dem.rast) limit 100;
-- FROM gnaf_202302.address_principals as gnaf
-- INNER JOIN gnaf_202302.srtm_3s_dem as dem on ST_Intersects(gnaf.geom, dem.rast) limit 100;


Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@ SELECT als.gnaf_pid, als.street_locality_pid, als.locality_pid, als.alias_princi
ST_MakePoint(als.longitude, als.latitude)::geography,
ST_MakePoint(gnaf.longitude, gnaf.latitude)::geography
) as distance
FROM gnaf_202211.address_aliases as als
INNER JOIN gnaf_202211.address_alias_lookup as lkp on als.gnaf_pid = lkp.alias_pid
INNER JOIN gnaf_202211.address_principals as gnaf on lkp.principal_pid = gnaf.gnaf_pid
FROM gnaf_202302.address_aliases as als
INNER JOIN gnaf_202302.address_alias_lookup as lkp on als.gnaf_pid = lkp.alias_pid
INNER JOIN gnaf_202302.address_principals as gnaf on lkp.principal_pid = gnaf.gnaf_pid
WHERE als.latitude <> gnaf.latitude
OR als.longitude <> als.longitude
order by ST_Distance(
Expand Down
2 changes: 1 addition & 1 deletion postgres-scripts/xx-export-address-principals-to-csv.sql
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,5 @@ COPY (
address, locality_name, postcode, state, locality_postcode, confidence,
legal_parcel_id, mb_2016_code, mb_2021_code, latitude, longitude,
geocode_type, reliability
FROM gnaf_202211.address_principals
FROM gnaf_202302.address_principals
) TO '/Users/hugh.saalmans/tmp/address_principals.psv' HEADER CSV;
12 changes: 6 additions & 6 deletions postgres-scripts/xx-get-population-per-gnafpid.sql
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@
--WITH counts AS (
-- SELECT mb_2016_code,
-- count(*) AS address_count
-- FROM gnaf_202211.address_principals
-- FROM gnaf_202302.address_principals
-- GROUP BY mb_2016_code
--)
--UPDATE testing.mb_2016_counts AS mb
Expand All @@ -35,7 +35,7 @@
---- add geoms
--UPDATE testing.mb_2016_counts AS mb
-- SET geom = bdys.geom
-- FROM admin_bdys_202211.abs_2016_mb as bdys
-- FROM admin_bdys_202302.abs_2016_mb as bdys
-- WHERE mb.mb_2016_code = bdys.mb_16code::bigint;
--
--ANALYSE testing.mb_2016_counts;
Expand All @@ -58,7 +58,7 @@ SELECT gnaf.gnaf_pid,
mb.person,
mb.address_count,
gnaf.geom
FROM gnaf_202211.address_principals as gnaf
FROM gnaf_202302.address_principals as gnaf
INNER JOIN testing.mb_2016_counts AS mb on gnaf.mb_2016_code = mb.mb_2016_code
WHERE mb.address_count >= mb.dwelling
AND mb.dwelling > 0
Expand Down Expand Up @@ -92,7 +92,7 @@ SELECT gnaf.gnaf_pid,
mb.address_count,
gnaf.geom,
generate_series(1, ceiling(mb.dwelling::float / mb.address_count::float)::integer) as duplicate_number
FROM gnaf_202211.address_principals as gnaf
FROM gnaf_202302.address_principals as gnaf
INNER JOIN testing.mb_2016_counts AS mb on gnaf.mb_2016_code = mb.mb_2016_code
WHERE mb.address_count < mb.dwelling
AND address_count > 0
Expand Down Expand Up @@ -219,7 +219,7 @@ WITH adr AS (
mb.person,
mb.address_count,
gnaf.geom
FROM gnaf_202211.address_principals as gnaf
FROM gnaf_202302.address_principals as gnaf
INNER JOIN testing.mb_2016_counts AS mb on gnaf.mb_2016_code = mb.mb_2016_code
WHERE mb.address_count >= mb.person
AND mb.dwelling = 0
Expand Down Expand Up @@ -253,7 +253,7 @@ WITH adr AS (
mb.address_count,
gnaf.geom,
generate_series(1, ceiling(mb.person::float / mb.address_count::float)::integer) as duplicate_number
FROM gnaf_202211.address_principals as gnaf
FROM gnaf_202302.address_principals as gnaf
INNER JOIN testing.mb_2016_counts AS mb on gnaf.mb_2016_code = mb.mb_2016_code
WHERE mb.address_count < mb.person
AND mb.address_count > 0
Expand Down
10 changes: 5 additions & 5 deletions postgres-scripts/xx_calculate_partitions.sql
Original file line number Diff line number Diff line change
Expand Up @@ -5,13 +5,13 @@ CREATE TABLE testing2.gnaf_partitions AS
WITH parts AS(
SELECT unnest((select array_agg(counter) from generate_series(1, 99, 1) AS counter)) as partition_id,
unnest((select array_agg(fraction) from generate_series(0.01, 0.99, 0.01) AS fraction)) as percentile,
unnest((select percentile_cont((select array_agg(s) from generate_series(0.01, 0.99, 0.01) as s)) WITHIN GROUP (ORDER BY longitude) FROM gnaf_202211.address_principals)) as longitude
unnest((select percentile_cont((select array_agg(s) from generate_series(0.01, 0.99, 0.01) as s)) WITHIN GROUP (ORDER BY longitude) FROM gnaf_202302.address_principals)) as longitude
), parts2 AS (
SELECT 0 AS partition_id, 0.0 AS percentile, min(longitude) - 0.0001 AS longitude FROM gnaf_202211.address_principals
SELECT 0 AS partition_id, 0.0 AS percentile, min(longitude) - 0.0001 AS longitude FROM gnaf_202302.address_principals
UNION ALL
SELECT * FROM parts
UNION ALL
SELECT 100 AS partition_id, 1.0 AS percentile, max(longitude) - 0.0001 AS longitude FROM gnaf_202211.address_principals
SELECT 100 AS partition_id, 1.0 AS percentile, max(longitude) - 0.0001 AS longitude FROM gnaf_202302.address_principals
)
SELECT partition_id,
percentile,
Expand Down Expand Up @@ -43,7 +43,7 @@ WITH merge AS (
name,
state,
st_intersection(bdy.geom, part.geom) AS geom
FROM admin_bdys_202211.commonwealth_electorates as bdy
FROM admin_bdys_202302.commonwealth_electorates as bdy
INNER JOIN testing2.gnaf_partitions as part ON st_intersects(bdy.geom, part.geom)
)
INSERT INTO testing2.commonwealth_electorates_partitioned (partition_id, ce_pid, name, state, geom)
Expand All @@ -65,4 +65,4 @@ commit;

select count(*) from testing2.commonwealth_electorates_partitioned;

select count(*) from admin_bdys_202211.commonwealth_electorates_analysis;
select count(*) from admin_bdys_202302.commonwealth_electorates_analysis;
4 changes: 2 additions & 2 deletions postgres-scripts/xx_qa_table_counts.sql
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,14 @@ SELECT new.table_name,
new.aus - old.aus as difference,
new.aus as new_aus,
old.aus as old_aus
FROM gnaf_202211.qa as new
FROM gnaf_202302.qa as new
INNER JOIN gnaf_202102.qa as old ON new.table_name = old.table_name
;

SELECT new.table_name,
new.aus - old.aus as difference,
new.aus as new_aus,
old.aus as old_aus
FROM admin_bdys_202211.qa as new
FROM admin_bdys_202302.qa as new
INNER JOIN admin_bdys_202102.qa as old ON new.table_name = old.table_name
;
14 changes: 7 additions & 7 deletions postgres-scripts/xx_test_state_electorates.sql
Original file line number Diff line number Diff line change
Expand Up @@ -2,27 +2,27 @@



DROP VIEW IF EXISTS raw_admin_bdys_202211.vw_tenp_state_electorates;
CREATE VIEW raw_admin_bdys_202211.vw_tenp_state_electorates AS
DROP VIEW IF EXISTS raw_admin_bdys_202302.vw_tenp_state_electorates;
CREATE VIEW raw_admin_bdys_202302.vw_tenp_state_electorates AS
SELECT dat.*,
aut.name,
bdy.se_ply_pid,
bdy.geom
FROM raw_admin_bdys_202211.aus_state_electoral as dat
INNER JOIN raw_admin_bdys_202211.aus_state_electoral_class_aut as aut on dat.secl_code = aut.code
INNER JOIN raw_admin_bdys_202211.aus_state_electoral_polygon as bdy on dat.se_pid = bdy.se_pid
FROM raw_admin_bdys_202302.aus_state_electoral as dat
INNER JOIN raw_admin_bdys_202302.aus_state_electoral_class_aut as aut on dat.secl_code = aut.code
INNER JOIN raw_admin_bdys_202302.aus_state_electoral_polygon as bdy on dat.se_pid = bdy.se_pid
-- where name = 'KEW'
;

select * from raw_admin_bdys_202211.vw_tenp_state_electorates
select * from raw_admin_bdys_202302.vw_tenp_state_electorates
where name = 'KEW'
order by se_pid,
dt_create
;



select * from raw_admin_bdys_202211.aus_state_electoral_polygon
select * from raw_admin_bdys_202302.aus_state_electoral_polygon
where se_pid = 'VIC292'
order by se_pid,
dt_create
Expand Down
Loading

0 comments on commit c43517a

Please sign in to comment.