Skip to content

5. Data

DΓ©nes Csala edited this page Jun 1, 2021 · 8 revisions

Standard

πŸ“Š We maintain a data repository updated daily that contains the data displayed on the site in a standardized, TIDY format. That means that every data point is a row (line) and every data feature is a column. The first column is called the index, and it is the typcially the column, based on which each of the data points gets a unique identifier. pandas automatically assigns this column to its index upon load, but the standard CSV format does not. Therefore, sometimes (especially for the case of time series data) the index column of datasets is a date. This makes pandas treat the data series as time series.

index feature1 feature2
data1.index data1.feature1 data1.feature2
data2.index data2.feature1 data2.feature2
...
data42.index data42.feature1 data42.feature2
...

During the data transformation and normalization process, the objective is to minimize the number of data columns. This means that this format ...

Country 2019 2020 2021
Austria 42 13 69
Belgium 75 12 77

... should be converted to this:

Country Year Value
Austria 2019 42
Austria 2020 13
Austria 2021 69
Belgium 2019 75
Belgium 2020 12
Belgium 2021 77

This operation is typically called a stack in pandas and a pivot in Excel/PowerBI.
Then, the following hold true:

  • Every row (line) contains a unique data point

  • Each data point is n-dimensional (caution! see below), where n equals the number of columns, i.e. each data points has n features.

  • The dataset has m elements, where m equals the number of rows

  • Likewise, the dataset can be represented as an n by m matrix

  • Columns headers are called features. Sometimes they are also called headers, (data) attributes or even (data) properties. The latter comes from the fact that when the data is not in a table format, it is often in a standardized JSON format, like this:

    [
      {"index":data1.index,"feature1":data1.feature1,"feature2":data1.feature2},
      {"index":data2.index,"feature1":data2.feature1,"feature2":data2.feature2},
      ...,
      {"index":data42.index,"feature1":data42.feature1,"feature2":data42.feature2},
      ...
    ]
    • In JSON/JavaScript lingo, this would be called a JavaScript Object Array, where index, feature1 and feature2 are called properties.
    • In python, this would be called a list of dictionaries, where index, feature1 and feature2 are called keys.
    • In both cases, data1.index, data1.feature1, ... are called values.
    • Likewise, in JSON/JavaScript the dataset can be represented as Array of length m, with each element being an Object containing n property-value pairs.
    • Likewise, in python the dataset can be represented as list of length m, with each element being an dictionary containing n key-value pairs.
  • The type of the features can be field or tag β¬… this is InfluxDB lingo. You might see them referred to as fact and dimension tables.

    • A fact is a measurable data value for the respective data point in each row. You might simply refer to this as a (quantitative or continuous) value.
    • A dimension is a descriptive tag for the respective data point in each row. You might refer to this as a tag, a label or a nominal value.
    • Sometimes the fact columns of the data table (fact table) is simply called data, and the dimension columns (dimension table) is called metadata.
    • Somewhat incorrectly and confusingly, dimension is also used colloquially to refer to a feature in general. This comes from the fact that the size of the data = nr of columns x nr of rows. This could allude to the fact that the data is n dimensional, where n equals the number of columns, i.e. the number of data features.
    • To avoid confusion, we prefer to use the column/feature ➑ field and tag nomenclature.

Formats

  • Time series datasets have dates in the yyyy-mm-dd format as their index and are sorted in increasing order.
  • Data series datasets have an increasing numerical range index starting from 0.
  • *_mirror type datasets are local mirrors of external datasets and typically retain the format of their respective original sources.
  • Column names are typically self-explanatory, unless otherwise noted in the Comments column.

Datasets

This the major situation update dataset, containing daily COVID-19 case, testing and vaccination updates. Contains both Cumulative values, as well as Daily rates.

Column name Column type Data type Data subtype Comments
date index datetime date yyyy-mm-dd
cases field quantitative integer Cumulative
heals field quantitative integer Cumulative
deaths field quantitative integer Cumulative
total_administered field quantitative integer Cumulative
total_administered_pfizer field quantitative integer Cumulative
total_immunized field quantitative integer Cumulative
total_immunized_pfizer field quantitative integer Cumulative
total_administered_moderna field quantitative integer Cumulative
total_immunized_moderna field quantitative integer Cumulative
total_administered_astra_zeneca field quantitative integer Cumulative
total_immunized_astra_zeneca field quantitative integer Cumulative
active field quantitative integer Daily rate
case field quantitative integer Daily rate
heal field quantitative integer Daily rate
death field quantitative integer Daily rate
administered field quantitative integer Daily rate
administered_pfizer field quantitative integer Daily rate
immunized field quantitative integer Daily rate
immunized_pfizer field quantitative integer Daily rate
administered_moderna field quantitative integer Daily rate
immunized_moderna field quantitative integer Daily rate
administered_astra_zeneca field quantitative integer Daily rate
immunized_astra_zeneca field quantitative integer Daily rate
tests field quantitative integer Cumulative
test field quantitative integer Daily rate
case14 field quantitative integer Rolling cumulative

This the major county-level dataset, containing daily COVID-19 case updates on a county level. Contains both Cumulative as well as 14-day Rolling cumulative values, both in absolute and per capita forms.

Column name Column type Data type Data subtype Comments
date field datetime date yyyy-mm-dd
cases field quantitative integer Cumulative
case_cap field quantitative float Cumulative - per capita
pop field quantitative integer Constant - Population
county tag nominal county in Romanian
iso tag nominal 2-letter label County code in Romanian
case_14 field quantitative integer Rolling cumulative
case_14_cap field quantitative float Rolling cumulative - per capita
id tag ordinal integer County code in topojson
lang tag nominal 2-letter label Constant = "RO"

This the major UAT-level (local administrative unit) dataset, driving the incidence map. It contains infection incidence rates (new case totals of last 14 days/1000 people - with 17 to 3 days before the date displayed) per UAT.

  • πŸ“… Updated daily (for the previous day) at πŸ•‘ 10:02 by @roeimbot

  • πŸ“Š Data sources:

Column name Column type Data type Data subtype Comments
date field datetime date yyyy-mm-dd
judet tag nominal county in Romanian, source
uat tag nominal local administrative unit in Romanian, source
siruta tag nominal integer SIRUTA codes
judet_norm tag nominal county in Romanian, normalized
uat_norm tag nominal local administrative unit in Romanian, normalized
incidence field quantitative float Incidence rate / 1000 people

Country level, cumulative cases, recovered and deaths dataset

The data is in non-standard format. The index is a datetime in the yyyy-mm-dd format, but the data values are in columns and not rows. Each country is a column, noted by its 2-digit ISO code, in lowercase.

  • d
  • d
  • d
  • d
  • d
  • d

Date-tagged list of economic, financial and social measures introduced by the Romanian Government during the pandemic

  • πŸ“… Updated every few days, manually

  • πŸ“Š Own dataset, based on Știri Oficiale

Column name Column type Data type Data subtype Comments
date field datetime date yyyy-mm-dd
desc tag nominal date Summary
link tag nominal url Announcement link
lang tag nominal 2-letter label Measure language
desc2 tag nominal label Announcement type
desc3 tag nominal label Measure type
  • d
  • d
  • d
  • d
  • d
  • d
  • d
  • d
  • d
  • d