Belgium_en.Rmd

---
title: "An analysis of excess mortality in 2020 in Belgium"
author: "Gael Fraiteur"
output:
  html_document:
    toc: yes
    toc_depth: 2
    fig_width: 9
    fig_height: 6
---

```{r setup, include=FALSE}
source("Belgium.R", local = knitr::knit_global())
knitr::opts_chunk$set(echo = FALSE, message = FALSE)


this_theme <- function() {
  theme_minimal() + theme(legend.position='bottom')   
}
```


## Introduction

The COVID19 epidemic has caused disruptions of our societies that have not been seen in Western Europe since World Ware II. However, it is still difficult to 
understand how exceptional this epidemic actually is compared to historic mortality data. Is it a once-in-a-century event, or does it repeat every ten years?
This article attempts to shed some light on these questions.

When discussing the effect of the COVID19 epidemic, Belgium is an interesting country to study for several reasons. First, it has
the highest reported death rate from COVID19 in the world. Secondly, Statbel, the Belgian governmental agency for statistics, publishes mortality
information every week with a delay of just 3 weeks. Last and not least, this is the country where the author grew up.

Our approach is to compare the number of deaths in 2020 (for weeks 1 to `r death_2020_max_week`) with the number of deaths that would be _expected_ based on 
the structure of the population on January 1^st^, 2020 and on the usual death rates. The difference between expectations and reality
gives us a metric named _excess deaths_ when it is positive, or _deaths deficit_ when it is negative.

We compare this metric with the number of deaths due to COVID as reported by Sciensano, and with historical death and mortality data.

At the end of the article, we try to answer a few controversial questions based on data.

This article, as well as all R scripts that have been used to compute the data, are hosted on GitHub at https://github.com/gfraiteur/mortality.
All data comes from open sources and can be freely downloaded. If you have a question or remark related to this article, 
or if you have found a bug or inaccuracy, please open an issue on GitHub or, better, submit a pull request.


## About the author

Gael Fraiteur graduated from the Louvain School of Engineering in 2001 as a civil engineer in applied mathematics. He also holds two minor degrees
in philosophy from UCLouvain. He has worked since then in the software industry. In 2004, he started an open-source project named PostSharp. In 2009, he founded a company
to market and develop the product. As the CEO and principal engineer of PostSharp Technologies, the author now shares his time between R&D, management and marketing. 


## Data sources

The data were downloaded from the following sources:

* Number of deaths: Statbel, 
  [Number of deaths per day, sex, age, region, province, district](https://statbel.fgov.be/en/open-data/number-deaths-day-sex-district-age).
  For 2020, this data is available until the `r death_2020_max_week`^th^ week.
  
* Structure of population: Statbel, 
  [Population by place of residence, nationality, marital status, age and sex](https://statbel.fgov.be/en/open-data?category=23)

* Cause of mortality: Statbel, 
  [Causes of death by month, sex, age group and region](https://statbel.fgov.be/en/open-data/causes-death-month-sex-age-group-and-region)

* COVID mortality: Sciensano, [Mortality by date, age, sex, and region](https://epistat.wiv-isp.be/covid/)

* General mortality: Human Mortality Database, [Death rates for Belgium, 1x1](https://www.mortality.org/cgi-bin/hmd/country.php?cntr=BEL&level=1)

* Temperatures: Royal Meteorological Institute of Belgium, 
  [Automatic weather station (AWS) daily observations](https://opendata.meteo.be/geonetwork/srv/eng/catalog.search;jsessionid=B123D8FF9D843F6B8721B8878EB55479#/metadata/RMI_DATASET_AWS_1DAY)

The script gives the exact URLs of the source data files we used, as well as the commands we have used to parse and prepare the dataset.

## Building a predictive model

If we want to compare the actual number of deaths in 2020 with what would be a "normal" number of deaths, we need to define what "normal" is.
Many authors have taken the average of the past 5 or 10 years as their point of comparison. However, this choice gives numbers that are
lower than what is realistic because the number of deaths has constantly increased of about 0.5% per year in the last 10 years. Comparing 2020 to 2015
would then cause a difference of 2,500 deaths, which is already significant.

```{r}

death_per_year <-
  death %>%
  filter ( year < 2020 ) %>%
  group_by( year ) %>%
  summarize( death_observed = sum( death_observed ))

yearly_death_increase <-
  coefficients( lm( death_per_year$death_observed ~ death_per_year$year) )[2] / 
  death_per_year %>% summarise( avg_death = mean(death_observed) )

death_per_year %>%
  ggplot( aes( x = year, y = death_observed)) + 
    geom_line(aes()) + 
    geom_smooth(method='lm', se = FALSE) +
      ggtitle("Number of deaths per year") +
    scale_x_continuous( breaks = 2009:2019,name = "year") +
    scale_y_continuous( labels=function(x) format(x, big.mark = ",", scientific = FALSE), name = "deaths") +
    this_theme()

```

Our model of yearly mortality is built in three steps: 

1. The first input is the structure of the population the 1^st^ of January of each year between 2009 and 2020 (i.e. the number of residents of a given age and sex alive on that day).

2. The second input is the mortality data for Belgium for the corresponding age, sex and year, i.e. the probability that
a person who was alive on January 1^st^ morning would be dead on December 31^st^ evening. The expected number of deaths for a given group and year is simply the
product of the first input with the second input.

3. We split the yearly data into weeks based on weekly historic death data.


### 1. Structure of the population

The first input of our model is the structure of the population published by Statbel (see data sources). The data set gives us the number
of inhabitants of each sex who are alive and have a specific age on January 1^st^ of a given year. This data is available since 2009.

As you can see on the next graph, the number of people aged 65 years or more is increasing every year by approximately 34,000, a 1.6% growth.

```{r}

population_structure %>%
  filter( age_group %in% c("85+", "75-84", "65-74")) %>%
  group_by(age_group, year) %>%
  summarise( population_count = sum(population_count) ) %>%
 ggplot( aes( x = year, y = population_count, fill = age_group)) + 
  geom_area(aes()) +
  scale_x_continuous(name = "year", breaks = 2009:2020) +
  scale_y_continuous(labels=function(x) format(x, big.mark = ",", scientific = FALSE), name = "inhabitants") +
  this_theme() +
  ggtitle("Number of inhabitants aged 65 or more")

population_older_65 <- population_structure %>%
  filter( age_group %in% c("85+", "75-84", "65-74")) %>%
  group_by(year) %>%
  summarise( population_count = sum(population_count) ) 
  
  # lm(population_older_65$population_count ~ population_older_65$year )
  
  # population_older_65

```

### 2. Death rate

The _death rate_ for a given period is the probability that someone of a given group (i.e. given age group and sex) who is alive
at the beginning of the period would be alive at the end of the period. Unless stated otherwise, the period is a calendar year.

Let's look at the death rate for a few ages. As you can see, the death rate is continuously decreasing except for
the age group over 100. This graph also shows a greater variance of the death rate for older age groups, which seems
to mean that these age groups are more sensitive to events that happen less than once a year, such as epidemics
or exceptional weather.


```{r}

mortality_for_graph <- mortality %>%
  filter( year >= 2009 & age %in% seq( 10, 100, 10)  )  %>%
  group_by( age, year ) %>%
  summarise( mortality = mean(mortality)) %>%
  mutate( is_old = age >= 50 )
  
mortality_for_graph$age <- as.factor(mortality_for_graph$age)

mortality_for_graph %>%
  filter( is_old == TRUE ) %>%
   ggplot( aes( x = year, y = mortality, color = age) ) +
    geom_line() +
  scale_color_discrete() +
  scale_x_continuous(name = "year") +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), name = "death rate")  +
  geom_smooth( method='lm', se = FALSE, linetype="dotted", weight=1) +
  this_theme() +
  ggtitle("Death rate for different ages over 50 in 2009-2018")

mortality_for_graph %>%
  filter( is_old == FALSE ) %>%
   ggplot( aes( x = year, y = mortality, color = age) ) +
    geom_line() +
  scale_color_discrete() +
  scale_x_continuous(name = "year") +
  scale_y_continuous(labels = scales::percent_format(accuracy = 0.01), name = "death rate", limits = c(0,NA))  +
  geom_smooth( method='lm', se = FALSE, linetype="dotted", weight=1) +
  this_theme() +
  ggtitle("Death rate for different ages under 50 in 2009-2018")

```

We model the death rate with a linear regression for each age and sex. This model allows us to extrapolate
the data to 2019 and 2020, and removes the year-to-year variations for all previous years. That is, this
death rate model removes the effect of epidemics and weather conditions that happen less frequently than yearly.

Once we have a death rate model for each group and year, we multiply this coefficient by the actual
population for this group and year, which should give us the number of expected deaths. However, the expected and observed
number of deaths, summed from 2009 to 2019, don't match exactly. This discrepancy is expected and its cause is not important.
To cancel the discrepancy, we apply the following correction factors:

```{r}
pivot_wider(model_correction, names_from = sex, values_from = correction  ) %>% 
  knitr::kable( col.names = c("Age group", "F", "M"), caption = "Correction factors", format = "markdown")
```

The following graphs shows the resulting projections of number of deaths per year and age group.

```{r}

# Graph: expected number of deaths per year and age group
death_expected_yearly %>%
  group_by(year, age_group) %>%
  summarise( death_expected_yearly = sum(death_expected_yearly)) %>%
  ggplot( aes( x = year, y = death_expected_yearly, fill=age_group)) + 
    geom_area() + 
    ggtitle("Expected number of deaths per year") +
    scale_x_continuous(name = "year", breaks = 2009:2020) +
    scale_y_continuous(labels=function(x) format(x, big.mark = ",", scientific = FALSE), 
                       name = "deaths", 
                       breaks = 10000 * (0:10), limits = c(0,NA)) +
    this_theme() +
     guides(fill = guide_legend(reverse=TRUE))

```

Let's now compare the projections with the reality:

```{r}
# Graph: expected number of deaths per year vs reality
death_expected_yearly %>%
  filter(year <= 2019) %>%
  group_by(year) %>%
  summarise( death_expected_yearly = sum(death_expected_yearly), death_observed_yearly = sum(death_observed_yearly)) %>%
  ggplot( aes(year, y = value)) + 
    geom_line(aes(y = death_expected_yearly, col = "expected")) + 
    geom_line(aes(y = death_observed_yearly, col = "real")) +
    ggtitle("Expected vs observed number of deaths per year") +
    scale_x_continuous(name = "year", breaks = 2009:2019) +
    scale_y_continuous(labels=function(x) format(x, big.mark = ",", scientific = FALSE)) +
    this_theme()+ theme(legend.title = element_blank()) 

```


### 3. Splitting years into weeks

We now have a yearly model, but we need weekly projections. For our weekly model, we first compute, for each sex and age group, the percent of deaths
that happens in a given week of the year, and we multiply this coefficient with the yearly death rate for this
year. Note that we applied a 3-week centered rolling mean to the data series before aggregating per week of year.


```{r}

death_per_week_of_year %>% 
  group_by(week_of_year, age_group) %>% 
  summarise( week_death_percent = mean(week_death_percent) ) %>%
  ggplot( aes( x = week_of_year, y = week_death_percent)) + 
  geom_line(aes(color=age_group)) + 
  ggtitle("Distribution of deaths per week") +
  scale_x_continuous(name = "week of year") +
  scale_y_continuous(labels = scales::percent_format(), name = "percent of deaths in this week")  +
  this_theme()

```

We can notice a higher seasonality of death rate for the more aged groups. For the younger groups, there is almost
no seasonal correlation but there is however more noise due to the lower amount of data points. We chose not to
address this problem as it should have only minor impact on the conclusions.

With this weekly model, we can finally compute the weekly predictive model and compare it with the reality:


```{r}
# Graph: expected number of deaths per week vs reality
death_expected_weekly %>%
  group_by(date) %>%
  summarise( death_expected = sum(death_expected), death_observed = sum(death_observed)) %>%
  ungroup() %>%
  filter(death_observed>1500) %>%
  arrange(date ) %>%
  ggplot( aes(x = date)) + 
    this_theme() + theme(legend.title = element_blank())  +
#    geom_ribbon( aes(x = date, ymin = death_expected, ymax = death_observed, fill = death_expected >= death_observed, size =0),
#                 alpha = 0.4, show.legend = FALSE) +
    geom_line(aes(y = death_expected, col = "expected")) + 
    geom_line(aes(y = death_observed, col = "real")) +
    scale_x_date( date_breaks = "1 year", labels = date_format("%Y"), name = "year") +
    scale_y_continuous(labels=function(x) format(x, big.mark = ",", scientific = FALSE), name = "deaths per week", limits = c(0,5000)) +
    ggtitle("Expected vs observed number of deaths per week") 
```

## Excess death rate

_Excess deaths_ is the different between the actual number and the expected number of deaths. When there are fewer deaths
than expected, this is called _death deficit_.

The following graphs represents excess deaths from 2009:

```{r}

# Graph: excess death rate
death_expected_weekly %>%
  group_by(date) %>%
  summarise( excess_death = sum(excess_death)) %>%
  ungroup() %>%
  filter(excess_death>-500) %>%
  ggplot( aes(x = date, y = excess_death)) +
    geom_line( color = 'cyan3') +
    this_theme() +
    scale_x_date( date_breaks = "1 year", labels = date_format("%Y"), name = "year") +
    ggtitle("Excess deaths per week") 
```

## Cumulative excess mortality

Weekly excess mortality does not give a good insight of the impact of a specific _event_, such as an epidemic or a special weather condition,
over a long period of time. It does not answer the following questions:

* How many people died during a specific event?
* How long does it take to "recover" from the event?

To answer these questions, the cumulative excess mortality (i.e. the integral of the curve over time) is a better metric.

The following graphs show the cumulative death, first starting from January 1^st^, 2009, and then zooming in on 2019-2020.


```{r}

all_cum_death_expected_weekly <-
cum_death_expected_weekly %>%
  group_by(date) %>%
  summarise( excess_death = sum(excess_death), cum_excess_deaths = sum(cum_excess_deaths), year = mean(year)) %>%
  ungroup()

all_cum_death_expected_weekly %>%
  ggplot( aes(x = date)) +
    geom_line(aes(y = excess_death, col = "weekly excess deaths")) + 
    geom_line(aes(y = cum_excess_deaths, col = "cumulative excess deaths")) +
    scale_x_date( date_breaks = "1 year", labels = date_format("%Y"), name = "year") +
    ggtitle("Cumulative excess deaths") +
    this_theme() + theme(legend.title = element_blank()) +
    scale_y_continuous(labels=function(x) format(x, big.mark = ",", scientific = FALSE), name = "excess deaths", breaks = seq( -2000, 20000, 2000))

max_cum_death_deficit <- 
  (all_cum_death_expected_weekly %>%
  filter(year>=2019) %>%
  summarise( max_cum_death_deficit = min(cum_excess_deaths) ))$max_cum_death_deficit
  
max_cum_death_deficit_date <-
(all_cum_death_expected_weekly %>%
  filter( cum_excess_deaths == max_cum_death_deficit ))$date


all_cum_death_expected_weekly %>%
  filter(year>=2019) %>%
  group_by(date) %>%
  summarise( excess_death = sum(excess_death), cum_excess_deaths = sum(cum_excess_deaths)) %>%
  ungroup() %>%
  ggplot( aes(x = date)) +
  geom_line(aes(y = excess_death, col = "weekly excess deaths")) + 
  geom_line(aes(y = cum_excess_deaths, col = "cumulative excess deaths")) +
  geom_point(aes(x = max_cum_death_deficit_date, y = max_cum_death_deficit), color = "green") +
  scale_x_date( date_breaks = "1 month", labels = date_format("%m"), name = "month") +
  ggtitle("Cumulative excess deaths (2020)") +
  this_theme() +
  scale_y_continuous(labels=function(x) format(x, big.mark = ",", scientific = FALSE), name = "excess deaths", breaks = seq( -2000, 20000, 2000)) +
  theme(legend.title = element_blank())

```

Note that the initial value of the cumulative metric is arbitrary. Here, we chose that the zero point would be January 1^st^, 2009. Because
our model is based on a linear regression over years, and because of the yearly fluctuations, we can expect that the zero point is crossed 
every second year.

The second arbitrary zero is the one of January 1^st^, 2020. This one is the consequence of the calibration of our model with real data for the period from 2009 
to 2019. It is important to understand how arbitrary this zero is.

The cumulative maximal death deficit was observed on March 11th, 2020, with a value of `r abs(round(max_cum_death_deficit))` (point represented in green in the 
graph above). So when comparing the expected cumulative mortality with the real cumulative mortality, a difference of `r abs(round(max_cum_death_deficit))` can 
be obtained just by choosing a different beginning of the comparison time window.

### Excess death rate by age group

Taking this remark into account, we can now look at the increased death rate for the year 2020 (between weeks 1 and `r death_2020_max_week`) for different age 
groups. The following table shows the difference in death rate during the studied period in absolute terms and in terms that are relative to the expected values 
for 2020.


```{r}

excess_death_2020 %>% 
  select( age_group, sex, expected_mortality, observed_mortality, excess_mortality, excess_death_percent  )   %>% 
  mutate( expected_mortality = label_percent(accuracy = 0.01 )(  expected_mortality ) ,
          observed_mortality = label_percent(accuracy = 0.01 )(  observed_mortality ) ,
          excess_mortality = label_percent(accuracy = 0.001 )(  excess_mortality ) ,
          excess_death_percent = label_percent(accuracy = 1 )(  excess_death_percent ) ) %>%
  knitr::kable( col.names = c("Age group", "Sex", "Expected death rate", "Actual death rate", "Absolute excess death rate", "Relative excess death rate"), 
                caption = sprintf( "Death rates by age group during week 1-%d of 2020", death_2020_max_week ), 
                format = "markdown",
                format.args = list(decimal.mark = ",", big.mark = "'"))

```


The following graph shows the relative values only:

```{r}
# Graph: excess death rate in 2020 compared
ggplot(excess_death_2020, aes(x=age_group, y=excess_death_percent, fill=sex)) +
  geom_bar(stat="identity", color="black", position=position_dodge())+
  this_theme() +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), "excess death rate") +
  ggtitle("Relative increase or decrease of death rate by age group and sex in 2020 compared to a normal") 

excess_death_2020_by_age_group <-
  excess_death_2020 %>%
     group_by( age_group ) %>%
    summarise( excess_death_percent = mean(excess_death_percent))

excess_death_2020_age_20 <- (excess_death_2020_by_age_group %>% filter( age_group == "0-24" )) $ excess_death_percent
excess_death_2020_age_84 <- (excess_death_2020_by_age_group %>% filter( age_group == "75-84" )) $ excess_death_percent

```

As you can see, the age group whose death rate is the most different in 2020 than the expectation is the younger one,
which had `r paste(abs(round(100*excess_death_2020_age_20)), "%", sep="")` less risk to die in 2020 than in another year.
We will come back to this number.

The age group whose death rate is the most negatively affected in 2020 is the 75-84 one, with a risk of dying 
`r paste(abs(round(100*excess_death_2020_age_84)), "%", sep="")` higher than expected.

### Absolute death rate in historical context

To give some context, here is the historic death rate since 2008 compared to what was _expected_ for 2020. The dashed
horizontal lines show the _actual_ death rate in 2020. It can be seen on this graph that the death rates observed in 2020 would
be typical for year 2012.


```{r}

mortality_in_2020 <- mortality_lr_coefficients
mortality_in_2020$mortality_2020 <- mortality_in_2020$intercept + 2020 * mortality_in_2020$year
mortality_in_2020 <- mortality_in_2020 %>% select( -year, -intercept )


mortality_compared_2020 <-
  merge( complete( mortality, sex, year, age ), mortality_in_2020, by = c("age", "sex" ))
  
mortality_compared_2020$compared_mortality = mortality_compared_2020$mortality / mortality_compared_2020$mortality_2020

mortality_compared_2020 <- na.omit( mortality_compared_2020 )

compared_mortality_for_graph <- mortality_compared_2020 %>%
  group_by( age_group, year ) %>%
  summarise( mortality_compared_2020 = mean(compared_mortality))

compared_mortality_for_graph_hline_85 <-
  1+(excess_death_2020_by_age_group %>% filter( age_group == "85+"))$excess_death_percent

compared_mortality_for_graph_hline_75 <-
  1+(excess_death_2020_by_age_group %>% filter( age_group == "75-84"))$excess_death_percent

compared_mortality_for_graph_hline_65 <-
  1+(excess_death_2020_by_age_group %>% filter( age_group == "65-74"))$excess_death_percent


compared_mortality_for_graph %>%
  filter( year >= 2008 & age_group != "0-24" ) %>%
   ggplot( aes( x = year, y = mortality_compared_2020, color = age_group) ) +
    geom_line() +
  scale_color_discrete() +
  scale_x_continuous(name = "year", limits = c(2008, 2020), breaks = 2*1000:1010) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), name = "death rate relatively to 2020", limits = c(0.9,1.5), breaks = seq(1,3,0.2) ) +
  geom_hline( aes(yintercept = compared_mortality_for_graph_hline_85, color = "85+"),  alpha = 0.5,  linetype = "dashed" ) +
  geom_hline( aes(yintercept = compared_mortality_for_graph_hline_75, color = "75-84"), alpha = 0.5,  linetype = "dashed"  ) +
  geom_hline( aes(yintercept = compared_mortality_for_graph_hline_65, color = "65-74"), alpha = 0.5,  linetype = "dashed"  ) +
  this_theme() +
  ggtitle("Historic death rate relative to the expected death rate in 2020")

```


For more perspective, here the same graph starting from 1900 on a logarithmic vertical scale.

```{r}

compared_mortality_for_graph %>%
  filter( year >= 1900 ) %>%
   ggplot( aes( x = year, y = mortality_compared_2020, color = age_group) ) +
    geom_point( size = 1 ) +
  scale_color_discrete() +
  scale_x_continuous(name = "year", breaks = seq(1900,2020,10)) +
  scale_y_log10(labels = scales::percent_format(accuracy=1), name = "death rate relatively to 2020 normal", breaks = 10 ^ seq(0,2,0.2), limit = c(1,100))  +
  this_theme() +
  ggtitle("Historic death rate relative to the expected death rate in 2020")
```

### Death rate variation in historical context

We have seen that the death rate in 2020 was typical for 2012. How can we then explain the framing of the
epidemic, by the government and media, as a catastrophe of unseen since Word War II or even the Spanish flu? 
Looking at the _year-to-year variation_ of death rate instead of the death rate itself gives some hints. 

How far should we look in the past to see a death rate that is, in absolute terms, 1% to 2% higher than normal?

To answer this question, we will use the death rate data of the Human Mortality Database since 1900. We will
define "normal" as being a 5-year rolling median. The median does a better job than the mean to eliminate exceptional events like epidemics.

The following graph shows only strictly positive deviations and the vertical axis is logarithmic, with horizontal
lines for 0.2%, 1% and 2%, the approximate values of excess death rate for the three oldest age groups.

```{r}

mortality_deviation_min_year <- 1900
mortality_deviation_max_year <- 2020

mortality_deviation %>%
  filter(year>=mortality_deviation_min_year & deviation > 0 ) %>%
  ggplot( aes( x = year, y = deviation, color = age_group ) ) +
    geom_point(aes( shape = sex ), size = 1 ) +
    scale_y_log10(labels = scales::percent_format(accuracy=0.001), name = "deviation", breaks = c( 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1, 1e1))  +
    scale_x_continuous(name = "year", limits = c(mortality_deviation_min_year, mortality_deviation_max_year), 
                       breaks =  seq(mortality_deviation_min_year, mortality_deviation_max_year, 10), minor_breaks = NULL )  +
    this_theme() +
    geom_hline( yintercept = 0.02, color = "blue", alpha = 0.4, linetype = "dashed" ) +
    geom_hline( yintercept = 0.01, color = "blue", alpha = 0.4, linetype = "dashed" ) +
    geom_hline( yintercept = 0.002, color = "blue", alpha = 0.4, linetype = "dashed" ) +
    ggtitle("Relative deviation of the yearly death rate from the 5-year rolling median death rate")

```


The next graph  aggregates the data by decade and shows the maximum deviation of the yearly death rate from the median death rate during the whole decade, 
again with horizontal lines for 0.2%, 1% and 2%.

```{r warning=FALSE}

mortality_deviation %>%
  filter(year>=mortality_deviation_min_year  ) %>%
  mutate( year_rounded = year - ( year %% 10) + 5) %>%
  group_by( age_group, year_rounded ) %>%
  summarise( max_deviation = max(deviation)) %>%
  ggplot( aes( x = year_rounded, y = max_deviation, color = age_group ) ) +
  geom_line(aes(  )) +
  scale_y_log10(labels = scales::percent_format(accuracy=0.001), name = "deviation", 
                limit = c(1e-5,1),
                breaks = c( 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1))  +
  scale_x_continuous(name = "year", limits = c(mortality_deviation_min_year, 2020), 
                     breaks =  seq(mortality_deviation_min_year, 2020, 10), minor_breaks = NULL )  +
  this_theme() +
    geom_hline( yintercept = 0.02, color = "blue", alpha = 0.4, linetype = "dashed" ) +
    geom_hline( yintercept = 0.01, color = "blue", alpha = 0.4, linetype = "dashed" ) +
    geom_hline( yintercept = 0.002, color = "blue", alpha = 0.4, linetype = "dashed" ) +
    ggtitle("Maximal deviation the yearly death rate from the 5-year rolling median")
```

It is clear from these 2 graphs that a 2% deviation is usual and even on the small end for the age group over 85 years.

However, for the groups 65-74 and 75-84, we have to go back to the 1960s to see a similar deviation.


### Compared death rate: conclusion

How exceptional is 2020 to other years? The conclusion is paradoxical.

If we look at the absolute death rates, 2020 is similar to 2012. If we look at the excess death rate for the 85+ age group, we also find numbers 
that would be usual around 2010. We could then say that 2020 is a one-in-ten-years occurrence, which would make it difficult to justify the
extraordinary restrictions on freedom of movement and gathering.

However, the real originality of 2020 from the point of view of historic death data is not the absolute death rate, not even the excess death rate
for the oldest age group, but the _excess death rate of the 64-85 age group_, for which there is no equivalent since the 1960s.


This paradox brings light to the polarized discussion between those who believe that the epidemic is a normal event that
happens every 10 years (it is, in terms of absolute death rate) and those who believe there was no worse event since World War II
(this is true but the 1960s were not far).


## Comparison with reported COVID deaths

The following graph compares the number of weekly excess deaths with the number of deaths attributed to COVID19, as
reported by Sciensano.

```{r}
 
compared_covid_death %>%
  group_by( date ) %>%
  summarise( covid_death = sum(covid_death), excess_death = sum(excess_death) ) %>%
  ggplot( aes(x = date)) +
    geom_line(aes(y = covid_death, col = "covid deaths")) + 
    geom_line(aes(y = excess_death, col = "excess deaths")) +
    geom_hline(yintercept = 0) + 
    scale_x_date( labels = date_format("%m-%y"), date_breaks = "1 month", name = "date") +
    scale_y_continuous(labels=function(x) format(x, big.mark = ",", scientific = FALSE), name = "deaths per week") +
    ggtitle("Weekly COVID19 deaths vs excess deaths") + 
    this_theme() +
    theme(legend.title = element_blank())
```


We can see that the number of deaths attributed to COVID19 by Sciensano copies quite well the number of excess deaths
estimated by our model, at least during the two epidemic peaks. The most notable exception is during the deaths
peak of August. The second difference that needs to be discussed is the death deficit after the spring peak.


### Death peak in August

```{r}

heat_death_2020 <- round(
(death_expected_weekly %>%
  filter( date >= as.Date("2020-08-01")  & date < as.Date("2020-08-31") ) %>%
  summarise( excess_death = sum(excess_death) ))$excess_death )

heat_death_2019 <- round(
(death_expected_weekly %>%
  filter( date >= as.Date("2019-07-01")  & date < as.Date("2019-07-31") ) %>%
  summarise( excess_death = sum(excess_death) ))$excess_death )


```

The number excess deaths during the heat wave of August 2020 was `r heat_death_2020`. By comparison, the
heat wave of July 2019 caused an excess death of `r heat_death_2019`: 10 times less.

Can this be explained by exceptional temperatures? The following graphs shows the history of maximum daily temperatures in Belgium as well of a 7-day and a 
30-day right-aligned moving average of this metric.


```{r warning=FALSE}

weather %>%
  filter( date >= as.Date("2018-06-01") ) %>%
  group_by( date ) %>%
  summarise( air_temperature = max(air_temperature_max) ) %>%
  ungroup() %>%
  ggplot( aes( x = date, y = air_temperature ) ) +
      geom_point(aes(),alpha=0.1, size = 1 )+
      geom_line(aes(y=rollmeanr(air_temperature, 7, na.pad=TRUE)),color = "blue")+
      geom_line(aes(y=rollmeanr(air_temperature, 30, na.pad=TRUE)),color = "red")+
      ggtitle("Maximum daily temperatures and 7-day and 30-day moving averages") +
      scale_x_date( labels = date_format("%m-%y"), date_breaks = "2 month", 
                    date_minor_breaks = "1 month", limits = c(as.Date("2018-06-01"),as.Date("2020-12-31") ) ) 

```

As we can see, the maximal weekly temperatures were not exceptional compared to 2019 or even 2018. Therefore, temperature alone cannot
be a sufficient explanation of the peek of deaths in August 2020. Other factors need to be considered, for instance a generally
weaker condition of the population due to the epidemic or to the lockdown itself.

### Death deficit between death peeks

The second difference between the COVID19 death curve and the excess death curve is that the second shows a death deficit
after the first epidemic peak. This means that a part of people who died during the epidemic peak would have died anyway a couple
of weeks later.

This is more visible on the cumulative excess deaths graph

```{r}
all_cum_death_expected_weekly %>%
  filter(date>=as.Date("2020-03-01")) %>%
  group_by(date) %>%
  summarise( excess_death = sum(excess_death), cum_excess_deaths = sum(cum_excess_deaths)) %>%
  ungroup() %>%
  ggplot( aes(x = date)) +
  geom_line(aes(y = excess_death, col = "weekly excess deaths"), linetype="dashed") + 
  geom_line(aes(y = cum_excess_deaths, col = "cumulative excess deaths")) +
  scale_x_date( date_breaks = "1 month", labels = date_format("%m"), name = "month") +
  ggtitle("Cumulative excess deaths (2020)") +
  this_theme() +
  scale_y_continuous(labels=function(x) format(x, big.mark = ",", scientific = FALSE), name = "excess deaths") +
  theme(legend.title = element_blank())
```

```{r}
death_excess_summer <-
  all_cum_death_expected_weekly %>%
  filter( date >= as.Date("2020-05-01") & date < "2020-08-01") %>%
  summarise( min_cum_excess_deaths = min(cum_excess_deaths), max_cum_excess_deaths = max(cum_excess_deaths))

death_excess_spring <-
  all_cum_death_expected_weekly %>%
  filter( date >= as.Date("2020-03-01") & date < "2020-06-01") %>%
  summarise( min_cum_excess_deaths = min(cum_excess_deaths), max_cum_excess_deaths = max(cum_excess_deaths))

net_death_excess_spring <- round(death_excess_spring$max_cum_excess_deaths - death_excess_spring$min_cum_excess_deaths)
net_death_excess_summer <- round(death_excess_summer$max_cum_excess_deaths - death_excess_summer$min_cum_excess_deaths)

```

During the spring peak, the cumulative excess death was `r net_death_excess_spring`. During the summer, before the August peak, the death deficit was 
`r net_death_excess_summer`, that is, approximately `r paste(abs(round(100*net_death_excess_summer/net_death_excess_spring)), "%", sep="")`
of people who died during of COVID19 during spring had their life shortened by less than 3 months.

### Comparision of number of deaths by age group

The following chart compares the number of excess deaths in 2020 with the number of COVID19 deaths.

```{r}

compared_covid_death_by_age_group <-
  compared_covid_death %>%
                 group_by( age_group ) %>%
                 summarise( covid_death = sum(covid_death), excess_death = sum(excess_death))

compared_covid_death_age_group_85 <- 
    compared_covid_death_by_age_group %>% filter(age_group == "85+")

reshape2::melt(compared_covid_death_by_age_group,id=("age_group")) %>%
  ggplot( aes(x=age_group, y=value, fill=variable)) +
    geom_bar(stat='identity', position='dodge') +
    scale_y_continuous(labels=function(x) format(x, big.mark = ",", scientific = FALSE), name = "deaths") +
    scale_x_discrete((name='age group')) +
    scale_fill_discrete(labels=c('COVID deaths', 'excess deaths') ) +
    ggtitle("COVID mortality vs excess mortality per age group")  + 
    theme(legend.title = element_blank())

```

The numbers are quite similar, except for the oldest age group where COVID19 deaths are 
`r round(compared_covid_death_age_group_85$covid_death - compared_covid_death_age_group_85$excess_death)` 
(`r paste(abs(round(100-100*compared_covid_death_age_group_85$covid_death/compared_covid_death_age_group_85$excess_death)), "%", sep="")`)
higher than excess deaths. This means than an approximate 20% of the people over 85 who died of COVID19 during 2020 (or were diagnosed as such)
would have died from another cause in the same period

## Decreased death rate of the youngest age group

The death rate of the 0-24 age group is `r paste(abs(round(100*excess_death_2020_age_20)), "%", sep="")` lower in 2020 than expected. How
can we explain this significant difference? Intuitively, this could be explained by the lower exposure to risk of accidents. Let's test
this hypothesis based on the _Causes of death by month, sex, age group and region_ data set published by Statbel (see data sources).

The cause code V01-Y98 includes all deaths by external causes, including accidents and homicides. This data is available from 2009 to 2019.
Let's visualize the percent of accidental deaths by age group and sex:

```{r}
 
accidental_death %>% 
  filter( year == 2017 ) %>%
  summarise( percent_accidental_death = sum( accidental_death_observed ) / sum(all_death_observed)) %>%
  ggplot( aes( x = age_group, y = percent_accidental_death, fill = sex )) +
    geom_bar(position = "dodge", stat="identity") +
    scale_y_continuous(labels = scales::percent_format(accuracy = 1), name = "percent of deaths from external causes", limit = c(0,NA))  +
    ggtitle("Percent of deaths from external causes") +
    this_theme()  

```

```{r}
accidental_death %>% 
  filter( age_group == "0-24") %>%
  ggplot( aes( x = year, y = percent_accidental_death, color = sex )) +
    geom_line() +
    ggtitle("Percent of deaths from external causes in the 0-24 age group") +
    this_theme()  +
    scale_x_continuous(name = "year", limits = c(2009, 2020), breaks = 2009:2020, minor_breaks = NULL ) +
    scale_y_continuous(labels = scales::percent_format(accuracy = 1), name = "percent of deaths from external causes", limit = c(0,NA))  +
    geom_smooth( method = lm, se=FALSE, linetype="dotted")
  
```

If we extrapolate this data linearly to 2020, we get a percent of accidental deaths of
`r paste(round(100*accidental_death_2020[accidental_death_2020$sex=='M',"percent_accidental_death"]), "%", sep="") ` for males and 
`r paste(round(100*accidental_death_2020[accidental_death_2020$sex=='F',"percent_accidental_death"]), "%", sep="") ` for females

This shows that we cannot explain the decrease of the death rate of the youth in 2020 solely by the decrease
of accidental mortality caused by the lockdowns. For males, it would mean that there would be almost deaths
by accident during the whole period (not just during the lockdowns). For females, accidental death is 
smaller than the mortality drop.

Therefore, other factors have to be considered to explain this drop, and the data we analyzed does not provide
any useful hint. 

Because of this, it is also useless to try to interpret the death rate in 2020 as a negative factor to the
the virus itself, and a positive factor due to the restrictions of movements and activities. Both factors
probably existed, but the data available now do not allow to quantify them.


## Revisiting controversial questions

This article allows to answer a few polemic questions that rose around the COVID19 epidemic and the response of the
government and mainstream media:

* **Is the number of deaths reported by Sciensano exaggerated, either by biased diagnostic methods or deliberately? **
  
  **No**, there is no indication that the death data reported by Sciensano are incorrect.
  
  
* **Would a majority of people who died from COVID19 have died from another cause very soon?**

  **No**. Approximately `r paste(abs(round(100*net_death_excess_summer/net_death_excess_spring)), "%", sep="")`
  of people who died during of COVID19 would have died less than 3 months from other causes. Based on death data,
  we cannot take any conclusion for a longer time period or for the people who died in autumn.

  
* **Has the restrictions of freedom of movement impacted the death rate?**

  **Yes**, the death rate on the 0-24 age group has decreased by 35% relatively to normal in 2020. However this decrease cannot be explained by a
  decrease of accidental mortality alone.
  
    
* **How exceptional was year 2020 in terms of mortality?**

  By definition, an _exception_ is an event that does not occur frequently. Therefore, to assess the exceptionally of an event, we
  to find a comparable precedent.
  
  Whether we look at the _absolute_ or _excess_ death rate, we get a different level of exceptionally.
  
  **Once in 50 years** if you look at excess death rate for the age group from 65 to 74 years, i.e. the risk of dying in 2020 compared to expectations.
  There is no equivalent of the excess death rate since the 1960s.
  
  **Once in 10 years** if you look at absolute death rate number. The absolute death rates of 2020 are close to the ones of 2010. That is, if there
  were in 2010 the same number of deaths than in 2020 for the same population structure, this would be a normal year.
  
  2020 is _at most_ 20% worse than a normal year. For people in the most affected age group, the risk to dying of coronavirus in 2020 (_additionally_ to
  normal risks), was 20% than the risk of dying from another reason. 
  

This article also lets a few questions unanswered:

* **How to explain the 35% decrease of mortality in the 0-24 age group?**

  A decrease of accidental mortality is not a sufficient explanation. This mortality deficit should be modeled and applied to the 25-45 age group,
  which is typically also subject to higher accidental death. This would result in a higher death rate attributed to COVID19 in this age group.
  The data analyzed here did not allow us to perform such analysis.

* **How to explain the increased mortality during August 2020?**
  
  The mortality was 10 times higher during the heat peak of July 2019 than during the one of July 2019. Temperatures
  were not significantly higher, therefore another explanation is necessary. 
  

Probably the most important question that article _cannot_ answer is: **What would have been the mortality _without_ the lockdowns?** Analyzing
past mortality data in only one country certainly cannot answer this question. Lockdowns do not heal, but they help decreasing the mortality
in two ways: first by avoiding the overload of the health system, secondly because the knowledge of the virus and its cures increase overtime

  
## Conclusion

This article challenges the believe that the 85+ age group is the most affected by the COVID19 epidemic.
In terms of absolute mortality. However you look at the _relative_ increase of the death rate, the 85+ age group is twice less affected than the 75-74 one. 

Perhaps the most striking conclusion is the paradox between the fact that, on one side,
the death rate in 2020 was equivalent to the one around 2010 and, on the other side, there has not been a comparable
year-to-year variation of death rate for the 65-84 age group until the 1960s.

What happened that a death rate by age group that would have been normal in 2010 now causes major disruptions of our societies?

An element of answer is that disruptions were imposed, not based on real mortality numbers, but based on _expectations_ of mortality in order 
to _prevent_ higher mortality. It would be more fair to judge the efficiency of political action by the _avoided_ mortality rather than by the actual mortality,
which means that we would need to compare actual mortality with expected mortality for the COVID-19 epidemic in Belgium. It will not be possible
to evaluate the impact of the lockdowns until the epidemic is completely finished. Until such analysis is possible, we can only reason based
on actual mortality.


Historic data clearly shows both the death rate and the year-to-year deviation from the expected death rate have constantly decreased since the 1950s. 
This constant reduction of risk may have major sociological, psychological and political implications. Rulers and politicians alike have instrumentalized fear 
for centuries in order to control populations, and instrumentalized victims in order to justify their oppressive policies and consolidation of power. With an increasingly lower 
tolerance to the risk of dying, the political opportunity to instrument epidemics is also growing. This effect is amplified by the simple fact that the number 
of older voters is increasing and politicians have a record of implementing short-sighted strategies just to win more votes.

We have seen that 2020 can be compared to 2010 in terms of absolute mortality. We should also compare 2020 to past years using
other criteria: quality of life, actuation of civic rights, possibility to die in dignity, social justice, respect of the State of Law,
freedom of speech, ... When was the last time when we saw similar regressions in Belgium? We would probably need to look back to 1940-1945. 

Are we willing to live in wartime conditions to maintain a mortality under what would be acceptable just one or two decades ago? What it the
acceptable risk? When should we stop accepting restrictions to live in the name of reducing risk?

Once the COVID-19 epidemic will belong to the past, once will be able to estimate the real costs and benefits of the lockdowns, these questions will become unavoidable. 

It has been repeated _ad nauseam_ that the economics must not prevail over human life. This oversimplified moral imperative has been used to justify the sacrifice
of small businesses, students and other segments of the population, or the air transportation industry. This motto must be questioned. At least, the media should return
it to a few big companies in industries like food, oil, tobacco or pharmacy -- to see if really still applies to everybody.

The real question of 2020 is rather: should mere biological survival prevail over the possibility a fully human life, a human
life with friends, family, sports, politics, arts, religion, ... all the facets that make life really human.


## Related works

* Scientific Directorate of Epidemiology and public health, [All-cause mortality supports the COVID-19 mortality in Belgium and comparison with major fatal events of the last century](https://archpublichealth.biomedcentral.com/articles/10.1186/s13690-020-00496-x) (covers only March-June 2020)