needforheat-dataset-template

TO DO: Replace this text with a very short description of the dataset.

General info

TO DO: Replace this text by general info about the dataset, e.g. generic description of subjects and when the data was collected.

Recruitment

Replace this text that desribes how subjects were recruited, possibly including links to recruitment material used.

Inclusion criteria

Inclusion criteria were:

replace with inclusion criterion 1;
replace with inclusion criterion 2;
etc.

Data management

TO DO: Replace this text with (links to) data management plan, privacy policy and (if applicable), DPIA.

Data

In the sections below, the data pre-processing and data formats used in the data files will be described.

Subjects

TODO: describe

Measurement Devices

We used the following measurement device types to collect data. Some devices consisted of a main device and one or two satellite devices.

TO DO: Change the markdown table below as needed.

Source type	Category	Main device repo	Sattelite device 2 repo	Sattelite device 2 repo
`OpenTherm-Monitor`	comfort + installation + occupancy	twomes-opentherm-monitor-firmware
`DSMR-P1-gateway`	energy	twomes-p1-gateway-firmware
`DSMR-P1-gateway-Tin`	energy + comfort	twomes-p1-gateway-firmware	twomes-room-monitor-firmware
`DSMR-P1-gateway-TinTsTr`	energy + comfort + installation	twomes-p1-gateway-firmware	twomes-room-monitor-firmware	twomes-boiler-monitor-firmware
`DSMR-P1-gateway-TinTsTrCO2`	energy + comfort + installation + occupancy/ventilation	twomes-p1-gateway-firmware	twomes-room-monitor-firmware	twomes-boiler-monitor-firmware

Date and time information

All timestamps were measured in Unix time format, using device clocks regularly synchronized via NTP with the correct UTC time. Setting the local device clock to the proper UTC time via NTP was one of the first steps performed by the measurement devices after they were connected to the internet via the home Wi-Fi network of a subject. Each measurement device synchronized its device clock via NTP every 6 hours. Uploads of measurement data (which could contain more than one measurement) were timestamped both by the measurement device according to the local device clock and by the server. We did not yet check for deviations between the last device timestamp of a measurement upload and the upload timestamp at the server.

Timestamps were converted to a timezone-aware pandas.Timestamp value, in the Europe/Amsterdam timezone. In the csv files we use ISO 8601 format with time offset: YYYY-MM-DDThh:mm:ss±hhmm.

Raw measurements

Raw masurements will be available in the folder /raw-measurements/ in two formats:

twomes_raw_measurements.parquet: a single parquet file with data for all subject ids;
nnnnnn_raw_measurements.zip: zipped csv files, one for each subject id;

All measurement data is structured according to the table below. By importing the parquet variant using pandas.read_parquet(), you automatically get a DataFrame wih the recommended indices and data types.

Alternatively, you can also read the zipped csv files, but this typically takes much longer. You can use the code below to endup with a DataFrame with the recommended indices and data types:

Index/Column	Name	Type	Description
index	`id`	`category`	unique code of the home
index	`source_category`	`category`	catewgory, e.g. device, cloud_feed, energy_query, batch-import
index	`source_type`	`category`	device type name of the measurement device
index	`timestamp`	`Timestamp`	start of the interval (timezone aware)
index	`property`	`category`	property name of the measurement
column	`value`	`object`	value of the measurement
column	`unit`	`category`	unit of the measurement value

Raw propertes

In the folder /raw-properties/ we will make various measured properties available in an 'unstacked' format with each property in its own column and an appropriate datatype. Similar to measurements, we will make data available in two formats:

twomes_raw_properties.parquet: a single parquet file with data for all subject ids;
nnnnnn_raw_properties.zip: zipped csv files, one for each subject id;

All property data is structured according to the table below. By importing the parquet variant using pandas.read_parquet(), you automatically get a DataFrame wih the recommended indices and data types.

Alternatively, you can also read the zipped csv files, but this typically takes much longer. You can use the code below to endup with a DataFrame with the recommended indices and data types:

Index/Column	Name	Type	Description
index	`id`	`category`	unique code of the home
index	`source_category`	`category`	catewgory, e.g. device, cloud_feed, energy_query, batch-import
index	`source_type`	`category`	device type name of the measurement device
index	`timestamp`	`Timestamp`	start of the interval (timezone aware)
column	property_1; see property table below	data_type_1	measured value of this property
column	property2	data_type_2	measured value of this property
...	...	...	...
column	property_n	data_type_n	measured value of this property

Measured Properties

Below is a table that lists all properties that were measured, the data type in the raw-properties DataFrame, the measurement unit, the measurement interval, the source device and sensor that measured it, as well as the the property name and value format as retrieved from the Twomes database.

TO DO: Change the markdown table below as needed.

Property	Type	Unit	Measurement interval [h:mm:ss]	Description	Source Type	Sensor	Database property	Database format
`co2__ppm`	`float32`	ppm	0:05:00	CO₂ concentration	`DSMR-P1-gateway-TinTsTrCO2`	SCD41	`CO2concentration`	%d

Weather data was collected and geospatially interpolated using HourlyHistoricWeather from the Royal Netherlands Meteorological Institute (KNMI), based on average hourly values.

For all subject ids, we used the same location for geospatial interpolation of weather data: lat, lon = 52.xxxxx, 6.yyyyy. Average values were converted from the source units to the units as indicated in the table below.

Index/Column	Property	Type	Unit	Measurement interval [h:mm:ss]	Description	Source Type	Source property	Source value format	Source unit
index	`timestamp`	`Timestamp`			start of the measurement interval	KNMI	`YYYMMDD`, `H`		H=1: 0:00:00 - 0:59:59; H=24: 23:00:00 - 23:59:59;
column	`temp_out__degC`	`float32`	°C	1:00:00	outdoor temperature	KNMI	`T`	%d	0.1 °C
column	`wind__m_s_1`	`float32`	m/s	1:00:00	wind speed	KNMI	`FH`	%d	0.1 m/s
column	`ghi__W_m_2`	`float32`	W/m²	1:00:00	global horizontal irradiance	KNMI	`Q`	%d	J/(h·cm²)

Preprocessed data

TO DO: change preprocessing description below.

Preprocessing of measurements from the measurement database was done using get_preprocessed_homes_data(). Preprocessing steps include:

removal of duplicate measurements;
calculation of derived properties as a combination of other properties, as indicated in the column Calculation in the table below;
removal of absolute outliers, i.e measurement values smaller than the value in the column Min or larger than the value in the column Max in the table below;
removal of statistic outliers, i.e. measuremnt values with an absolute z-score higer than the value indicated in the Sigma column in he table below;
interpolation of measurements to intervals of 15 minutes (no interpolation between measurements that were 60 minutes apart or more);
All column values represent the average during the interval that starts at the timestamp indicated.

TO DO: Change the markdown table below.

Index/ Column	Name	Type	Unit	Description	Min	Max	Sigma
index	`id`	`Int16`		unique code of the home	000000	999999
index	`timestamp`	`Timestamp`		start of the interpolated interval (timezone aware)
column	`T_out__degC`	`float32`	°C	outdoor temperature	-28	40
column	`wind__m_s_1`	`float32`	m/s	wind speed	0	35
column	`ghi__W_m_2`	`Int16`	W/m²	global horizontal irradiance	0	1000
column	`T_in__degC`	`float32`	°C	indoor temperature	0	40	3

Status

Dataset is: collected, anonimization-in-progress

License

This data is made available under the CC BY 4.0 by the Research group Energy Transition, Windesheim University of Applied Sciences

Credits

Data collection was a joint effort of:

<contributor name 1> · @Github_handle_1 · Twitter @Twitter_handle_1
<contributor name 2> · @Github_handle_2 · Twitter @Twitter_handle_2
<contributor name 3> · @Github_handle_3 · Twitter @Twitter_handle_3
etc.

Thanks go to those who are the ultimate source of this dataset:

all anonymous subjects who volunteered to make their measurement data available

We use and gratefully aknowlegde the efforts of the makers of the following source code and libraries:

HourlyHistoricWeather, by @stephanpcpeters, licensed under an MIT-style licence

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

needforheat-dataset-template

Table of contents

General info

Recruitment

Inclusion criteria

Data management

Data

Subjects

Measurement Devices

Date and time information

Raw measurements

Raw propertes

Measured Properties

Preprocessed data

Status

License

Credits

About

License

energietransitie/needforheat-dataset-template

Folders and files

Latest commit

History

Repository files navigation

needforheat-dataset-template

Table of contents

General info

Recruitment

Inclusion criteria

Data management

Data

Subjects

Measurement Devices

Date and time information

Raw measurements

Raw propertes

Measured Properties

Preprocessed data

Status

License

Credits

About

Resources

License

Stars

Watchers

Forks