This repo is part of the Statbot Swiss Project.
Have a look at this repo for the final results of the research project with ZHAW.
The project has following parts:
- Statbot Postgres DB: A Postgres DB that is filled with open data from Swiss Statistical offices (Federal Office of Statistics and selected Cantonal Statistical Offices)
- Statbot Training Data: Training and test data for the Statbot Chatbot, see below.
- Statbot Chatbot: An ML Model for a Chatbot that turns questions into sql queries in order to derive answers to these questions by running the sql queries on the Statbot Postgres DB. It uses the Statbot Training Data for its model training.
This repo takes care of the following two tasks:
-
it provides functions and pipelines to fill and update the Statbot Postgres DB
-
it provides questions and queries as Statbot Training Data for the Statbot Chatbot:
flowchart TD
Fed[Federal Datasets] -->|Preprocess| Scripts[Statbot scripts]
Scripts -->|Upload| DB(Statbot postgress DB)
Can[Cantonal Datasets] -->|Preprocess| Scripts
User -->|Ask Questions| Chatbot(Statbot Swiss)
Chatbot -->|query| DB(Statbot postgreSQL DB)
The repo consists of the following parts:
- input datasets: contains the list input datasets for the pipelines and also a list of Swiss adminstative units, that are linked to each dataset
- R package
statbotData
: the functions in the directoryR
make up a R package of reusable functions, that are used throughout the pipelines and scripts - pipelines per dataset: they contain scripts for downloading datasets from their origin and preprocessing them for the postgres integration into Statbot Postgres DB. They also include other files that relate to the dataset such as the Statbot Training Data.
- scripts: the scripts directory contains sample scripts that can be used to do chores on the datasets and pipelines, such as uploading the dataset and its metadata to the Statbot Postgres DB](#statbot-postgres-db)
- app:
app.R
is a local shiny app that gives an overview about the repo and its pipelines: this app is only meant to support the development process and is not deployed anywhere
The data/const
directory consists of two files:
statbot_input_data.csv
: this is a csv file of all the datasets that have been uploaded to Statbot Postgres DBspatial_unit_postgres.csv
: this is a csv file for Swiss adminstrative units. It corresponds to the tablespatial_unit
in the Statbot Postgres DB.
The statbot input data statbot_input_data.csv
describes datasets that are available as opendata and have the following properties:
classDiagram
Dataset <|-- PxCube
Dataset <|-- CsvFile
Dataset <|-- ExcelSheet
Dataset <|-- LINDAS
Dataset : data_indicator
Dataset : status
Dataset : name
Dataset : format
Dataset : lang
Dataset : publisher
Dataset : organization
Dataset: +create_dataset()
Dataset: +download_data()
class PxCube{
px_id
access_url
size
encoding
}
class CsvFile{
download_url
}
class ExcelSheet{
download_url
}
class LINDAS{
query_url
}
mandatory properties:
data_indicator
: an identifier for the dataset pipeline such asA1
: this identifier serves also the directory name for the pipeline files in the repostatus
: only the statusesuploaded
orremote
are relevant and indicates that the dataset is already in postgres, any other status can be chosen as needed. The statusuploaded
indicates the dataset was uploaded to postgres via thisstatbotData
repository. The statusremote
indicates that the dataset was uploaded to postgres previously.name
: the table name in postgres for this datasetformat
: the format determines how the data is retrieved from its origin:px
,csv
,xlsx
,rdf
are the current optionslang
: the language of the dataset, currently this can be eitheren
,de
orfr
publisher
: the publisher of the datasetorganization
: the organization that the dataset belongs to
Optional properties for the px file format: for the px file format the following parameter needs to be specified:
px_id
: id of the px file
Additional optional parameters for this file format are:
size
:large
triggers a different download method for large filesencoding
:UTF-8
orlatin1
: forlarge
px
files the encoding needs to be specified
By default the R package BFS is used to retrive the data by default. It gets the data via api calls in batches. In the case of a large dataset the api limit might be an obstacle. Therefore if the dataset is classified as large
the the package pxRRead is used for the download.
Properties for the data access:
access_url
: url to access the original datasetquery_url
: if the format isrdf
: a url for the SPARQL endpoint is requireddownload_url
: for file formatcsv
orxlsx
the file download url is required
One goal of this repo ist to fill up the Statbot Postgres DB. This database has the following tables:
erDiagram
metadata_tables
metadata_table_columns
dataset_observation }o--|| spatial_unit : spatial_scope
dataset: each dataset is uploaded into exactly one table in Statbot Postgres DB: all tables contain these two mandatory columns:
year
: year of the observationspatialunit_uid
: Swiss administrative unit that the observation refers to
Besides these tables, there are extra tables.
spatial unit table: contains spatial units for all administrative units in Switzerland. spatialinit_uid
is the primary key in that table. The spatialunit_uid
column in the dataset tables serves as foreign key in the dataset tables.
extra metadata_tables: there are two extra metadata tables: their task is to provide additional metadata to the dataset tables.
metadata_table_columns
: contains descriptions of all table columns exceptspatialunit_uid
andyear
metadata_tables
: contains metadata on the dataset itself.
metadata_tables
This metadata table has the following fields:
name
: the name of the table in postgreslanguage
: language of the dataset that was used for the postgres importlast_pipeline_run
: date when the dataset was last imported from its sourcedata_source_url
: access url of the dataset sourcetitle
: title of the dataset in the language of the datasettitle_en
: title of the dataset in Englishdescription
: description of the table in the language of the datasetdescription_en
: description of the table in Englishtemporal_coverage
: year from toupdate_frequency
: update frequency as in EU frequency vocabularyspatial_coverage
: spatial coverage: canton or country for example
metadata_table_columns
This metadata table has the following fields:
name
: name of the column in the tabletable_name
: name of the postgres table the column belongs todata_type
: eithernumeric
orcategorical
title
: Title of the column in the datasets languagetitle_en
: English translation of the column titleexample_values
: 3 to five example values in this column
The statbot training data consists of questions in natural language and the corresponding
sql queries. The queries relate to a single input dataset that has been uploaded to the Statbot Postgres DB. The queries don't have joins instead of one possible join on the spatial_unit
table.
Below is an examples for the Statbot Training Data:
-- Wieviele Abstimmungsvorlagen gab es im Jahr 2000?
SELECT COUNT(*) as anzahl_abstimmungsvorlagen_im_jahr_2000
FROM abstimmungsvorlagen_seit_1971 as T
JOIN spatial_unit as S on T.spatialunit_uid = S.spatialunit_uid
WHERE S.name_de='Schweiz' AND S.country=TRUE AND T.jahr=2000;
-- Wieviele Abstimmungsvorlagen gab es seit dem Jahr 1971?
SELECT COUNT(*) as anzahl_abstimmungsvorlagen
FROM abstimmungsvorlagen_seit_1971 as T
JOIN spatial_unit as S on T.spatialunit_uid = S.spatialunit_uid
WHERE S.name_de='Schweiz' AND S.country=TRUE;
The Chatbot is not yet publically available and currently under active development. More information on this part of the project will be added later.
The pipelines are intended to process datasets, that have been selected in the input dataset list statbot_input_data.csv
and prepare them for the Statbot Postgres DB. Besides the dataset preprosessing script the pipeline directory also contains the training data as Statbot Training Data
Each pipeline has a directory in pipelines
by the name of the data_indicator
(from the input data). It consists of the following files:
<data_indicator>.R
: a script for downloading a dataset, processing it and preparing it for the postgres exportmetadata_table_columns.csv
: metadata for the postgres table, describing the fields, besidesyear
and `spatialunit_uid``metadata_tables.csv
: metadata for the postgres_tablequeries.sql
: Natural language questions and corresponding sql queries to answer these questions. This is the [Statbot Training Data]queries.log
: Log of the latest run of the sql queries either locally on the ds$postgres_export` or remote on the postgres databasesample.csv
: sample ofds$postgres_export
as csv file
To work on a pipeline, open the file <data_indicator>.R
and run the pipeline, adjust where needed. Once you are happy with ds$postgres_export
you can open a script in the scripts
directory to upload the dataset to postgres.
The queries.sql
files contain the Statbot Training Data for each dataset. See this section for a detailed description of these files.
These files are the output of a run of statbotData::testrun_queries(ds)
on the queries in the queries.sql
file of the dataset. A timestamp indicates the date and time of the run. It also contains information on whether the file was run on the remote postgres instance REMOTE
or on the local ds$postgres
data frame.
These are metadata for a table in postgres. They are loaded into the postgres metadata table metadata_tables
, see section on the Statbot Postgres DB above.
These are metadata for the table columns in postgres. They are loaded into the postgres metadata table metadata_table_columns
, see Statbot Postgres DB for details.
But the code for these chores is always the same and consists of using the appropriate functions of the package statbotData
. It does not depend on the specific pipeline or dataset, but on the situation and the status of the dataset. Therefore this code has been extracted from the pipelines and put into a directory of its own with a selection of different sample scripts that relate to certain scenarios:
scripts/upload_new_dataset.R
: for uploading a new dataset with its metadatascripts/dataset_in_postgres.R
: for testrunning queries for a dataset as part of the Statbot Training Data and checking its metadatascripts/update_existing_dataset.R
: for updating a dataset and its metadata on postgres in case the dataset has changed at its origin (a new release of the dataset might have been published with an additional year of statistics added for example)scripts/operate_on_db.R
: for checking on the Statbot Postgres DB: listing its tables and checking on metadata for the tables
In order to run the functions, pipelines and scripts in this repo access to the Statbot Postgres DB is needed. This access is set up via an environment file:
cp sample.env .env
Afterwards you need to fill in your credentials into the .env
file. Otherwise all fuctions that need postgres access won't work.
The functions of the statbotData package consist of reusable functions that are used in the pipelines, the app and the scripts. There are man pages for the package, that explain the usage and arguments of these functions in more detail. This section will just provide a short overview on these functions.
ds <- create_dataset(<data_indicator>)
: this function sets up the dataset class whenever work on a dataset should be done. This must always be done first.
download_data
: this function downloads the data for a dataset from its source. The chosen download method depends on the properties of the dataset instatbot_input_data.csv
, see section on the input data above.
ds$postgres_export
is the processed dataset that can be uploaded to the postgres instance Statbot Postgres DB](#statbot-postgres-db). To build this is the goal of each of the pipelines. Certain function in statbotData
perform operations on this tibble, in case the dataset is not yet in the status uploaded
or remote
:
statbotData::dataset_sample
: this function generates a sample of the dataset fromds$postgres_export
statbotData::testrun_queries
: this runs the queries from a filequeries.sql
in the pipeline directoryds$dir
and writes a logfilequeries.log
into the same directorystatbotData::generate_metadata_templates
: generate templates for the metadata files that are needed to upload metadata on the dataset to postgresstatbotData::update_pipeline_last_run_date
: update thepipeline_last_run
property in the metadata_tables on file; this should be done whenever the dataset has been rebuild from its origin and is uploaded to postgresstatbotData:read_metadata_tables_from_file
: checks the metadata and reads them from file. This should be checked before the metadata is uploaded to postgres
statbotData::create_postgres_table
: this uploads the data frameds$postgres
to postgres.ds$postgres
is build by the pipelines for the datasetstatbotData::update_metadata_in_postgres
: this uploads the metadata on file to postgres. Make sure the metadata are not there yet before this function is run for a dataset, otherwise delete the metadata for the table firststatbotData::testrun_queries
: this runs the queries on postgres once the dataset status has been set touploaded
in the in the statbot input data
statbotData::delete_table_in_postgres
: delete a table in postgresstatbotData::delete_metadata_from_postgres
: delete the metadata for a table in postgres
statbotData::list_tables_in_statbot_db
: lists all tables in postgresstatbotData::list_tables_with_metadata_in_statbot_db
: lists all tables that have metadata in postgresstatbotData::load_dataset_list
lists the statbot input data
statbotData::map_ds_spatial_units
: maps cantons and countrystatbotData::map_ds_municipalities
: maps municipalitiesstatbotData::map_ds_residential_areas
: maps residential areas
The repo also contains an app: app.R
:
It can be started from R studio and provides an overview over all elements in the repo:
- the status of the pipelines: whether a dataset has already been uploaded to Statbot Postgres DB
- the list of the statbot input datasets
- the queries as statbot training data
- samples of the datasets