Skip to content
This repository has been archived by the owner on Mar 10, 2023. It is now read-only.

case numbers for Germany are diverging further from official numbers #1008

Open
ms82494 opened this issue Mar 19, 2020 · 31 comments
Open

case numbers for Germany are diverging further from official numbers #1008

ms82494 opened this issue Mar 19, 2020 · 31 comments

Comments

@ms82494
Copy link

ms82494 commented Mar 19, 2020

This issue is very substantial and getting bigger. Let me illustrate:

Date JHU data official German data
16-Mar 7,272 6,012
17-Mar 9,257 7,156
18-Mar 12,327 8,198

This has been reported in this thread: #826, complete with an explanation of the likely source of the error. Unfortunately, the maintainers don't seem to have taken note of issue #826, but we are getting further and further away from the official numbers, and it's getting implausible to ascribe this to timing issues (today's official numbers are by more than 1,000 below YESTERDAY's JHU numbers), or presumptive cases (how could there be 4,000 or more presumably positive tests piling up in the processing queue?)

The JHU numbers are now widely reported back into Germany, where they cause confusion. Here's a German news article that discusses the discrepancy between JHU to official numbers. Note that the article vaguely attributes the difference to JHU having "real-time" data and/or knowledge of (or modeling somehow) so-far undetected cases that likely exist in the population. I don't think either of these theories holds any water.

Maybe I am missing something, in which case I'd welcome an explanation. Otherwise, I think it would be best to ignore the numbers reported on German COVID cases in this repo.

@ms82494 ms82494 changed the title case numbers for Germany are diverge further from official numbers case numbers for Germany are diverging further from official numbers Mar 19, 2020
@mwargan
Copy link

mwargan commented Mar 19, 2020

Good comments here! I wouldn't ignore though, but look at how it can be fixed...

@ms82494
Copy link
Author

ms82494 commented Mar 19, 2020

Looks like the data in this repo aggregates cases reported in each German state, as opposed to the totals reported by the Robert Koch Institut (the entity charged by the federal health administration with reporting). To me, that doesn't explain the divergence, but I at least understand where JHU is coming from. I wonder, though, whether it's possible that an individual case may get reported to multiple states' health authorities if detected in a metropolitan area that extends across more than one state.

@jgehrcke
Copy link

@jgehrcke
Copy link

jgehrcke commented Mar 19, 2020

The JHU numbers are now widely reported back into Germany, where they cause confusion

@ms82494 not sure if you are German or English, and whether or not you can read my blog post. I want to confirm explicitly that as far as I have understood the situation the JHU data are actually the better ones. The RKI data is lagging behind by 1-3 days by now. The JHU data represent at every point in time the current count of the individual Gesundheitsaemter in Germany. These are official, good numbers, to my knowledge. ZEIT ONLINE is rather fast adding and publishing those (< 1 day), and I believe that's what JHU then uses. The RKI and WHO need 2-3 days to add and publish. That's ironic in an exponential growth phase, of course.

@JiPiBi
Copy link

JiPiBi commented Mar 19, 2020

On the contrary, for France, JHU often gives smaller values than official ones.
On my side, I fix them by code and own dictionary when significant because the increasing rates become senseless.

@johannesschobel
Copy link

I have explained the discrepancy in German here: https://gehrcke.de/2020/03/deutschlands-covid-19-fallzahlen-des-rki-und-der-who-haben-inzwischen-2-3-tage-verzogerung/

Dear @jgehrcke,

this would, however, imply, that the OFFICIAL GERMAN numbers are higher than the JHU numbers, right? However, it's the other way round.. The JHU numbers are higher than the official numbers 🤔

@JiPiBi
Copy link

JiPiBi commented Mar 19, 2020

@ms82494 I went on the RKI site and the value you gave for 18/03 are annouced for 18/03 01:00am so they must correspond to the 17th , am I right ?
so the difference still exists but nor 2000 , about 1000 (that remains to high obviously)

@JiPiBi
Copy link

JiPiBi commented Mar 19, 2020

complementary question :
on RKI site , I read

Among these cases, 4,605are male (56%) and 3,568female (44%). The age range is from 0to 96years, including 67children under the age of 5, 199children aged 5to 14years,6,557persons aged15 to 59 yearsand 1,337persons 60 years and older(see Figure 2).The age of 38notified cases is unknown.The median age is47years

Pls , does this low median age explains the very low death ratio and very low absolute number of dead people in Germany ?

@ms82494
Copy link
Author

ms82494 commented Mar 19, 2020

@jgehrcke Yes, I speak German and I did read your blog post. This AM the German RKI numbers caught up a little, being only about 3k behind the JHU/state numbers. That supports your thesis that this is simply processing delay at RKI.

Your blog post actually captures the reason for my initial skepticism towards the "processing delay" explanation: Official state numbers are electronically submitted to the RKI. If the RKI only needs to tabulate and add these numbers and publish for the federal level, why does it take two days to accomplish that? And why did the numbers not just lag, but actually diverge?

I don't know whether there's more to the RKI processing pipeline and whether that would be justified. Maybe the RKI numbers will even be found to be more accurate in the end. But, as I acknowledged in my second post, for the purpose of justifying the JHU numbers it doesn't really matter, This repo is following the rules it has set out for itself, and I am not arguing (anymore) that any revisions need to be made here.

@ms82494
Copy link
Author

ms82494 commented Mar 19, 2020

@JiPiBi I think a lot of the infections can be traced to German vacationers on ski trips to Northern Italy. Those would skew to a younger age, Also, German health authorities had both testing capacity and the determination to run down the contacts of each person tested positive. This probably led to infections being caught and treated a little earlier than in some other countries (looking at you, US).

@JiPiBi
Copy link

JiPiBi commented Mar 19, 2020

@ms82494 In France too , every new case was tracked at the beginning with numerous people acting like police squads and local confinment for villages (remember France was the first European country to be impacted end of January ), but with the current numbers , the testing is no more so largely used , even if you are sick and confined at home.

About the treatment, apart from treating the symptoms like fever , I dont think any country can do more , some clinical test begin , but ...

When I read people's stories , some old ones are tired for some days, like a flu , but not much more, probably less , even in good shape, and suddenly they have breath difficulties and in some hours , they die (same stories in Italy)
Even countries like South Korea who takes the first important decisions are higher than Germany

Country/Region
Italy 7.944519
Iraq 7.142857
Algeria 6.666667
San Marino 6.422018
Philippines 6.417112
Iran 6.110458
Spain 4.536942
China 3.984801
Japan 3.302961
Indonesia 2.906977
United Kingdom 2.857143
Netherlands 2.517564
France 2.268308
Poland 2.100840
Egypt 2.040816
US 1.681981
Greece 1.291990
Australia 1.106195
Canada 1.046025
Cruise Ship 1.005747
Switzerland 1.000000
Korea, South 0.973558
Belgium 0.804505
Sweden 0.588235
Denmark 0.390625
Germany 0.259263

Confinment limits the number of infected people , but the number of deaths is a mystery : it seems in France that 50% of people in ICU finally die .... and the others remain 2-3 weeks in ICU .

So the challenge now is to open as many ICU as we can and maintain confinment to limit the needs for ICU at the same time

@jgehrcke
Copy link

this would, however, imply, that the OFFICIAL GERMAN numbers are higher than the JHU numbers, right? However, it's the other way round

@johannesschobel ouch, which statement of mine sounds like that? That's of course not at all what I wanted to say. I re-read my article and couldn't quite find the sentence that led to this impression. Would however love to understand. Finding the wording problems in your own words is of course always a challenge. Please let me know!

Btw, in the meantime I have published https://github.com/jgehrcke/covid-19-germany-gae -- please have a look! Feedback welcome.

@jgehrcke
Copy link

jgehrcke commented Mar 19, 2020

@jgehrcke Yes, I speak German and I did read your blog post. T

Thanks for the feedback and your explanations @ms82494 -- love this discussion.

I don't know whether there's more to the RKI processing pipeline and whether that would be justified. Maybe the RKI numbers will even be found to be more accurate in the end. But, as I acknowledged in my second post, for the purpose of justifying the JHU numbers it doesn't really matter, This repo is following the rules it has set out for itself, and I am not arguing (anymore) that any revisions need to be made here.

Yes! Seems like we're on the same page. Especially about the fact that there are some mysteries left... what is the RKI doing with the data, and how important is their processing step? I would hate to copy pre-mature data, but I also tend to believe that the manual processing through some desks of some classical office workers in a German amt simply takes a while (read: my theory is that there is no magic to the delay, it's just old-school office work).

Edit: @ms82494 as a shameless plug I'd also like to ask you to please have (another?) look at https://github.com/jgehrcke/covid-19-germany-gae and to share it with interested parties if you can. I invested a little bit of love and plan to maintain this properly. Thanks!

@johannesschobel
Copy link

Dear @jgehrcke ,

sorry, i may have mistaken your statement.

In the table of the original poster the JHU numbers were higher than the official numbers.
I thought, because there is a delay of 1-3 days between RKI and JHU, the RKI numbers should be higher (because JHU gets the numbers later), right? or maybe i have mistaken this..

All the best

@TG9541
Copy link

TG9541 commented Mar 21, 2020

Referring to #1165

@jgehrcke: thanks for your blob post!
The problem with the "RKI data lags behind" thesis is this:

  • RKI receives data from local authorities in a (presumably) accumulative and structured way
  • JHU CSSE scrapes the data from source from sources of (presumably) unknown structure

Right now, I don't know how to test if the JHU gets all the data or if it counts data twice because a source entity is already included in a different source. In a situation like this, the credibility of sources is critical, and as @ms82494 pointed out, news media are increasingly publishing hypotheses about the shortcomings of the infrastructure used by RKI (i.e. the official data acquisition pipeline).

I feel that alternative sources are detrimental to a coordinated response unless sources are transparent, differences between sources can be explained (e.g. in terms of lag), and error margins are known.

@coezbek
Copy link

coezbek commented Mar 21, 2020

It doesn't make any sense to me that JHU CSSE has more up-to-date data then RKI, because as stated on the JHU CSSE mainpage, JHU just uses the European data from https://www.ecdc.europa.eu/en/cases-2019-ncov-eueea

(which I guess they just download here: https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide)

First: Why would there be any difference between the European numbers and the RKI?

Second: Why would JHU differ from the European numbers? The JHU CSSE Repository does not give me any indication that they do any sophisticated scrapping from German states or even more local level (Gesundheitsämter der Kreise).

@jgehrcke
Copy link

jgehrcke commented Mar 22, 2020

@TG9541 and @coezbek thanks for the discussion! You bring up interesting points, and the confusion sadly is wayyy to big. Let me try to clarify (maybe again, maybe just in other words?) what I believe what I have learned in the last days. Bear with me, for a moment :-). Thanks!

RKI receives data from local authorities in a (presumably) accumulative and structured way

yes

JHU CSSE scrapes the data from source from sources of (presumably) unknown structure

I don't know about that but I think the answer is "no".

Let's take JHU out of the picture. Let's explore where we think the best data for Germany comes from, in terms of credibility (quality data, official, and so on) and freshness, ok?

In Germany we have 16 states. They report their case data in an official, highly credible fashion. Not always via CSV or JSON files, but usually instead via HTML pages with high-quality tables, with annotated maps, with press announcements, sometimes even Excel sheets. I have linked some of these resources here. Take a look. The point is: everyone can look these data up at any point in time. And then sum the numbers up for the 16 states, yielding a credible case count (for that point in time) for all of Germany.

I have tried to explain this in a concise, complete walk-through in this (German) blog post: https://gehrcke.de/2020/03/ard-zdf-covid-19-fallzahlen/

I have followed the situation and data flow in depth (well, as a non-insider) for more than a week now.

Major conclusions are:

  • The JHU data for Germany correspond pretty accurately to the current and official case counts published by the individual state ministries.
  • Die RKI data are the same data (of the same state ministries), but published with a certain delay that is not constant, between 1 and 2 days. The RKI acknowledges that a delay exists.

The authorities on the data are the state ministries. Not so much the RKI, seemingly.

It does not look like the RKI changes (noticeably) data as part of their processing, they only "delay" (which is probably as of slow double-checking, curating, and some kind of manual processing that we do not entirely understand). We have strong indicators that data quality assurance happens perfectly well on the state level, and that we can use state level data as-is.

In other words: as far as I can tell, and I looked very closely, the data published by the individual German states is fine, credible, and pretty recent, whereas the RKI data is not as recent.

For understanding the real German situation of yesterday we better look at the data published by individual German states today.

When we look at the RKI data of today we probably see the state of reality of roughly two or three days ago.

Another important conclusion that I've drawn is that the grunt work of constantly watching out for data updates published by the individual states is done, very well, by two newspapers: ZEIT ONLINE, and Berliner Morgenpost. Getting their data count at any time is a fair approach to get the unmodified state level data (they don't invent numbers, they simply and predictably use what the states publish, pretty immediately after state-level updates).

We have two noticeable transparency challenges: within the JHU, and within the RKI. We don't know where the JHU takes the data for Germany from (we have strong indicators, but it would be good to see someone from their side comment on it), and we don't know what the RKI is doing exactly when they delay data publishing a bit (and why they are doing it).

These two transparency problems are not so bad, at least as far as I am concerned, because I know of a data source that seems to be pretty credible and transparent to me: the individual states. That's as transparent as it gets. And two newspapers, doing the summing-up grunt work for me, when I am too lazy.

These assumptions/conclusions/insights are built into https://github.com/jgehrcke/covid-19-germany-gae.

I hope that helps to clarify some things. If one of my observations and conclusions above sounds fishy or wrong then I'd love to learn!

@coezbek
Copy link

coezbek commented Mar 22, 2020

@jgehrcke Thank you! Your analysis makes entirely sense to me.

I would like to summarize the questions:

  • Where does JHU CSSE take their data for Germany from?
  • Why does RKI delay publication of data for 2-3 days?

In the meantime I found this link on the RKI site that gives access to their statistics data (fine granular to regions, ages, etc.). Unfortunately it is only refreshed weekly. So I guess, we keep being reliant on the interns at Zeit.de and Morgenpost for manually entering data.

@TG9541
Copy link

TG9541 commented Mar 22, 2020

@coezbek: the RKI has an updated explanation of the data collection method here.

@jgehrcke: the data collection process from local authorities, through federal states, and up to the RKI preserves time (Meldetag) on a case-by-case level. This means that the information can be used to draw conclusions in an epidemiological sound way. The JHU CSSE data doesn't preserve case information. In the best case it can be used to get an informal overview. In the worst case it leads to attrition of trust, and that's the least thing we need. At least the federal state numbers should be listed independently by JHU as to create a minimum of transparency.

@jgehrcke
Copy link

jgehrcke commented Mar 22, 2020

@coezbek: the RKI has an updated explanation of the data collection method here.

I've followed updates to this page carefully. At any given time the RKI has had statements like the following (taken from the page now):

Durch die Dateneingabe und Datenübermittlung entsteht von dem Zeitpunkt des Bekanntwerdens des Falls bis zur Veröffentlichung durch das RKI ein Zeitverzug, sodass es Abweichungen hinsichtlich der Fallzahlen zu anderen Quellen geben kann.

Acknowledging the delay, implying that other sources may be more recent (and not wrong).

the data collection process from local authorities, through federal states, and up to the RKI preserves time (Meldetag) on a case-by-case level.

@TG9541 yes and of course I really hope that's the case! We could always infer that from one of the info graphics in the RKI situarion report PDF documents. They have this kind of data. But they don't seem to publish these data, not even implicitly in their case count table.

This gets much more tangible by looking at an example. One specific question I asked here is (translated, modified for this context)

"why does the RKI report ~16.600 cases for the end of the day of March 20, whereas the state ministries reported ~19.800 cases for the end of the day of March 20?"

I'd love for others to answer this question in detail :-) My working hypothesis is said delay.

It's great that the Meldetag is preserved case-by-case in the RKI-internal data processing flow, but the RKI does not seem to use the Meldetag when publishing the "sum-per-individual-day", because that's pretty clearly behind, by at least one day.

There also is this statement on the RKI page you linked:

Es werden nur Fälle veröffentlicht, bei denen eine labordiagnostische Bestätigung (unabhängig vom klinischen Bild) vorliegt.

Pretty certainly that's also true for the individual state reports. I sincerely do hope so.

@nikola
Copy link

nikola commented Mar 24, 2020

The particular problem with the "inaccuracies" in the German datapoints as claimed by JHU is that it's JHU's numbers that seem to drive policies in Germany, not official RKI numbers.

I have read through all available FAQs and explanations provided by JHU's team, and so far it seems that JHU's project is maintained by three key people: Lauren Gardner, Ensheng Dong, and Hongru Du. I see no indications that either speaks German, or that any native German speaker is involved in collecting datapoints for Germany. (Disclosure: I speak native German, English and other languages.)

This makes the claims that, somehow, the team or person behind the datapoints for Germany developed a collection of highly sophisticated and versatile scrapers for German websites dubious, to put it mildly. I have read other theories where one person claimed that, indeed, JHU's developers simply scrape the official German numbers from the ECDC and then extrapolate from a simplistic model such as "cases double every 4 days". It'd be very easy for JHU to refute such claims by publishing, at a minimum, the sources of their scraped data.

And I'm not even singling out JHU here, as Zeit Online has the same credibility challenge. They have quite a large GitHub presence, and yet there's no transparency at all where and how their data was collected. How hard would it be to setup a simple repository with a single CSV file listing the case sources?

I find it quite ironic that, rightly so, politicians and newspapers have spent the better part of the last 4 years decrying the manufacturing of unsubstantiated sources to drive a false narrative, and setting up codes and policies that demand at a minimum the disclosure of publicly verifiable sources, and yet, here we are, with billions of clicks being generated every day on websites that do not reveal where and how their data was aggregated. Why the secrecy?

@jgehrcke
Copy link

@nikola sorry for replying so late and sporadically. Thanks for your lovely contribution to this discussion. I'd like to reply to this statement for now:

How hard would it be to setup a simple repository with a single CSV file listing the case sources?

Right, agree. The official institutions should provide this data set, fresh data from every Landkreis, providing a timeline for every case (at least providing the actual day of taking the sample that was then later physically tested).

They don't, however. We now have two very recent initiatives that try to get a little closer:

@jgehrcke
Copy link

jgehrcke commented Mar 24, 2020

and yet, here we are, with billions of clicks being generated every day on websites that do not reveal where and how their data was aggregated. Why the secrecy?

I also hate that. I really do. At least the media and politicians in Germany should stop citing JHU, even referring to them. They could just as well cite Landesbehörden, if they don't want to cite RKI.

@nikola
Copy link

nikola commented Mar 24, 2020

@jgehrcke I'm a Python coder myself, so more than happy to help out with the scrapers in the repo you mentioned.

@coezbek
Copy link

coezbek commented Mar 24, 2020

@nikola

here we are, with billions of clicks being generated every day on websites that do not reveal where and how their data was aggregated. Why the secrecy?

That is not secrecy, that is just sloppyness. RKI's data just isn't more fresh and their internal processes do not keep up. Lawmakers didn't put in any rules to make it any faster (reporting could directly go from counties to RKI) because they did not imagine you would need it any faster. As strange as it seems: the pandemic did not cause everyone to work weekends.

If you want to see citizens in action as a crowd-sourced juggernaut, then you can get your data a day faster than JHU CSSE:

https://docs.google.com/spreadsheets/d/1wg-s4_Lz2Stil6spQEYFdZaBEp8nWW26gVyfHqvcl8s/edit?pli=1#gid=0

@jgehrcke
Copy link

@coezbek very cool. Where can I find background information about the project behind this Google Sheet?

@coezbek
Copy link

coezbek commented Mar 24, 2020

@coezbek very cool. Where can I find background information about the project behind this Google Sheet?

https://twitter.com/risklayer - RiskLayer together with KIT's Center for Disaster Management and Risk Reduction Technology (CEDIM)

@nikola
Copy link

nikola commented Mar 24, 2020

@coezbek the approach over there is prone both to false positives and false negatives. It uses the term "Coronavirus" to aggregate cases of SARS-CoV-2 infections and symptomatic COVID-19. Asymptomatic cases of SARS-CoV-2 are not challenging our societies - symptomatic cases of COVID-19 are. Hospitals are not at the edge due to a rising number of positive SARS-CoV-2 test results, they are at the edge due to a rising number of severe cases of COVID-19, which are caused by SARS-CoV-2 and which require intensive care. A carrier of SARS-CoV-2 does not require intensive care if COVID-19 hasn't developed, hence most infected people being quarantined at home instead of sent to the hospital. Given that we can't quantify the moving target of SARS-CoV-2 infections we can't estimate how many carriers of SARS-CoV-2 will develop COVID-19.

A table that lumps both together and spills out primitive numbers is not helpful in my humble opinion. We all know that plenty of organisations will use that table's numbers and put the title "Confirmed cases of COVID-19" over the result.

@entron
Copy link

entron commented Mar 25, 2020

@nikola

here we are, with billions of clicks being generated every day on websites that do not reveal where and how their data was aggregated. Why the secrecy?

That is not secrecy, that is just sloppyness. RKI's data just isn't more fresh and their internal processes do not keep up. Lawmakers didn't put in any rules to make it any faster (reporting could directly go from counties to RKI) because they did not imagine you would need it any faster. As strange as it seems: the pandemic did not cause everyone to work weekends.

If you want to see citizens in action as a crowd-sourced juggernaut, then you can get your data a day faster than JHU CSSE:

https://docs.google.com/spreadsheets/d/1wg-s4_Lz2Stil6spQEYFdZaBEp8nWW26gVyfHqvcl8s/edit?pli=1#gid=0

Thanks for sharing. Is it possible to get historical data for each city from there?

@asmaier
Copy link

asmaier commented Mar 26, 2020

It seems to me the most reliable data is still from Berliner Morgenpost. At least they store a source url and a time stamp for each data point: #826 (comment)

@stefan123t
Copy link

I see multiple sources published here:

  1. local health authorities in each county Gesundheitsämter
  2. state level health authorities reporting aggregated numbers of counties
  3. CDC reported numbers by RKI
    As far as the newly introduced digital submission of case data from health authorities to RKI is concerned they seem to back propagate data based on the time the SARS-Cov-2 or COVID-19 case has been reported / qualified. Given they seem to update even more then seven days in the past they probably use the corrected figures obtained by the local Gesundheitsamt based on interviews.
  4. figures published by EuroStat.
    I do not know if they are corrected for the past. I only checked one instance reported as sixteen days ago with 6012 reported cases, which reportedly matches RKI figures for 17/03/2020.
    A) The figures from Morgenpost are obviously scraped from the state level Gesundheitsamt / Ministry and therefor a bit newer than the RKI data released the next day.
    B) the figures scraped from district Gesundheitsämter as by e.g. Jan-Philip Gehrkes API or Coezbek AND Risklayers crowdsourcing approach with a Google sheet might be even one day earlier.
    Please bear in mind that neither A nor B have been validated by the RKIs additional automation nor they have been back propagated to the original reporting date as RKI supposedly does.
    Therefor they represent a raw number of newly positive tested cases of SARS-Cov-2 with/without serious symptoms and/or reports of deceased persons supposedly died of COVID-19 symptoms or with a positive SARS-Cov-2 test.
    Because of the reporting delay and not propagating back the numbers to earlier dates on the levels 1 and 2 the numbers resulting from approach A and B must be higher than those corrected by report date by RKI some 1-2 days later.
    Regarding the distinction between SARS-Cov-2 infections tested positive and COVID-19 cases with ICU (Intensive Care Unit treatment) mentioned by Nikola this is true but the numbers are not reported by every Gesundheitsamt on district or state level nor by the RKI as of now. The same is true for the recovered persons. Though figures from China suggest that this is lagging some 21-28 days behind the new cases whereas the deceased cases from ICU unfortunately lag behind some 14-21 days only.
    Maybe it is good to note that Morgenpost already added a recovered column to their data.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests