Welcome to the official GitHub repository and leaderboard page for the WeatherReal weather forecasting benchmark! This repository includes code for running the evaluation of weather forecasting models on the WeatherReal benchmark dataset, as well as links to the LFS storage for the data files. If you would like to contribute a model evaluation result to this page, please open an Issue with the tag "Submission". If you use this repository or its data, please cite the following:
- WeatherReal: A Benchmark Based on In-Situ Observations for Evaluating Weather Models
- Synoptic Data (for the Synoptic data used in the benchmark)
- NCEI ISD database
Weather forecasting is a critical application for many scenarios, including disaster preparedness, agriculture, energy, and day-to-day life. Recent advances in AI applications for weather forecasting have demonstrated enormous potential, with models like GraphCast, Pangu-Weather, and FengWu achieving state-of-the-art performance on benchmarks like WeatherBench-2, even outperforming traditional numerical weather prediction models (the ECMWF IFS). However, these models have not been fully tested on real-world data collected from observing stations; rather the evaluation has focused on reanalysis datasets such as ERA5 that are generated by NWP models and thus include those models' biases and approximations.
WeatherReal is the first benchmark to provide comprehensive evaluation against observations from the Integrated Surface Database (ISD) and the Synoptic Data API, in addition to aggregated user-reported observations collected from Weather from MSN Weather's reporting platform. We provide these datasets separately and offer separate benchmark leaderboards for each, in addition to separating leaderboards by tasks such as short-term and medium-range forecasting. We also provide evaluation code that can be run on a variety of model forecast schemas to generate scores for the benchmark. By interpolating gridded forecasts to station locations, and using nearest-neighbor interpolation to match point forecasts to station locations, we can fairly evaluate models that produce either gridded or point forecasts.
- A benchmark dataset of quality-controlled weather observations spanning the year 2023, with ongoing updates planned
- An evaluation framework to score many tasks for either grid-based or point-based forecasts
- Still in its infancy - we welcome feedback on how to improve the benchmark, the evaluation code, and the leaderboards
- A dataset for training weather forecasting models
- An inference platform for running models
- An exclusive benchmark for weather forecasting
Task: Forecasts are initialized twice daily at 00 and 12 UTC over the entire evaluation year 2023. Forecasts are evaluated at every 6 hours of lead time up to 168 hours (7 days). Headline metric is the RMSE for each predicted variable (except for ETS for precipitation) averaged over all forecasts and lead times. Note: The leaderboard is provisional due to incomplete forecast initializations for the provided mdoels, and therefore is subject to change.
WeatherReal-ISD | 2-m temperature (RMSE, K) |
10-m wind speed (RMSE, m/s) |
mean sea-level pressure (RMSE, hPa) |
total cloud cover (RMSE, okta) |
6-hour precipitation > 1 mm (ETS) |
---|---|---|---|---|---|
Microsoft-Point | 2.258 | 1.753 | - | 2.723 | - |
Aurora-9km | 2.417 | 2.186 | 2.939 | - | - |
ECMWF | 2.766 | 2.251 | 3.098 | 3.319 | 0.248 |
GFS | 3.168 | 2.455 | 3.480 | - | - |
We welcome feedback from the modeling community on how best to use the data for evaluating forecasts in a way that best reflects the end consumer experience with various forecasting models. We propose the following common tasks:
- Short-range forecasting. Forecasts are initialized four times daily at 00, 06, 12, and 18 UTC. Forecasts are evaluated every 1 hour of lead time up to 72 hours. Headline metric is the RMSE (ETS for precipitation) for each predicted variable averaged over all forecasts and lead times.
- Nowcasting. Forecasts are initialized every hour. Forecasts are evaluated every 1 hour of lead time up to 24 hours. Headline metric is the RMSE (ETS for precipitation) for each predicted variable averaged over all forecasts and lead times.
- Sub-seasonal-to-seasonal forecasts. Following the schedule of ECMWF's long-range forecasts prior to June 2023, forecasts are initialized twice weekly at 00 UTC on Mondays and Thursdays. Forecasts are averaged either daily or weekly for lead times every 6 hours. Ideally forecasts should be probabilistic, enabling the use of proper scoring methods such as the continuous ranked probability score (CRPS). Headline metrics are week 3-4 and week 5-6 average scores.
WeatherReal includes several versions, all derived from global near-surface in-situ observations: (1) WeatherReal-ISD: An observational dataset based on Integrated Surface Database (ISD), which has been subjected to rigorous post-processing and quality control through our independently developed algorithms. (2) WeatherReal-Synoptic, An observational dataset from Synoptic Data PBC, a data service platform for 150,000+ in-situ surface weather stations, offering a much more densely distributed network. A quality control system is also provided as additional attributes delivered alongside the data from their API services. The following table lists available variables in WeatherReal-ISD and WeatherReal-Synoptic.
Variable | Short Name | Unit1 | Variable | Short Name | Unit |
---|---|---|---|---|---|
2m Temperature | t | °C | Total Cloud Cover | c | okta4 |
2m Dewpoint Temperature | td | °C | 1-hour Precipitation | ra1 | mm |
Surface Pressure2 | sp | hPa | 3-hour Precipitation | ra3 | mm |
Mean Sea-level Pressure | msl | hPa | 6-hour Precipitation | ra6 | mm |
10m Wind Speed | ws | m/s | 12-hour Precipitation | ra12 | mm |
10m Wind Direction | wd | degree3 | 24-hour Precipitation | ra24 | mm |
1: Refers to the units used in the WeatherReal-ISD we publish. For the units provided by the raw ISD and Synoptic, please consult their respective documentation. 2: For in-situ weather stations, surface pressure is measured at the sensor's height, typically 2 meters above ground level at the weather station. 3: The direction is measured clockwise from true north, ranging from 1° (north-northeast) to 360° (north), with 0° indicating calm winds. 4: Okta is a unit of measurement used to describe the amount of cloud cover, with the data range being from 0 (clear sky) to 8 (completely overcast).
The data source of WeatherReal-ISD, ISD [Smith et al., 2011], is a global near-surface observation dataset compiled by the National Centers for Environmental Information (NCEI). More than 100 original data sources, including SYNOP (surface synoptic observations) and METAR (meteorological aerodrome report) weather reports, are incorporated.
There are currently more than 14,000 active reporting stations in ISD and it already includes the majority of known station observation data, making it an ideal data source for WeatherReal. However, the observational data have only undergone basic quality control, resulting in numerous erroneous data points. Therefore, to improve data fidelity, we performed extensive post-processing on it, including station selection and merging, and comprehensive quality control. For more details on the data processing, please refer to the paper.
Data of WeatherReal-Synoptic is obtained from Synoptic Data PBC, which brings together observation data from hundreds of public and private station networks worldwide, providing a comprehensive and accessible data service platform for critical environmental information. For further details, please refer to Synoptic Data’s official site. The WeatherReal-Synoptic dataset utilized in this paper was retrieved in real-time from their Time Series API services in 2023 to address our operational requirements, and the same data is available from them as a historical dataset. For precipitation, Synoptic also supports an advanced API that allows data retrieval through custom accumulation and interval windows. WeatherReal-Synoptic encompasses a greater volume of data, a more extensive observation network, and a larger number of stations compared to ISD. Note that Synoptic provides a quality control system as an additional attribute alongside the data from their API services, thus the quality control algorithm we developed independently has not been applied to the WeatherReal-Synoptic dataset.
The WeatherReal datasets are available from the following locations:
- WeatherReal-ISD: A single file in netCDF format for year 2023: GitHub LFS
- WeatherReal-Synoptic: Please reach out directly to Synoptic Data PBC for access to the data.
The evaluation code is written in Python and is available in the evaluation
directory. The code is designed to be
flexible and can be used to evaluate a wide range of forecast schemas. The launch script evaluate.py
can be used to run
the evaluation. The following example illustrates how to evaluate temperature forecasts from gridded and point-based models:
python evaluate.py \
--forecast-paths /path/to/grid_forecast_1.zarr /path/to/grid_forecast_2.zarr /path/to/point_forecast_1.zarr \
--forecast-names GridForecast1 GridForecast2 PointForecast1 \
--forecast-var-names t2m t2m t \
--forecast-reformat-funcs grid_v1 grid_v1 point_standard \
--obs-path /path/to/weatherreal-isd.nc \
--obs-var-name t \
--variable-type temperature \
--convert-fcst-temperature-k-to-c \
--output-directory /path/to/output
We welcome all submissions of evaluation results using WeatherReal data! To submit your model's evaluation metrics to the leaderboard, please open an Issue with the tag "Submission" and include all metrics/tasks you would like to submit. Please also include a reference paper or link to a public repository that can be used to peer-review your results. We will review your submission and add it to the leaderboard if it meets the requirements.
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.