-
Notifications
You must be signed in to change notification settings - Fork 18.4k
case numbers for Germany are diverging further from official numbers #1008
Comments
Good comments here! I wouldn't ignore though, but look at how it can be fixed... |
Looks like the data in this repo aggregates cases reported in each German state, as opposed to the totals reported by the Robert Koch Institut (the entity charged by the federal health administration with reporting). To me, that doesn't explain the divergence, but I at least understand where JHU is coming from. I wonder, though, whether it's possible that an individual case may get reported to multiple states' health authorities if detected in a metropolitan area that extends across more than one state. |
I have explained the discrepancy in German here: https://gehrcke.de/2020/03/deutschlands-covid-19-fallzahlen-des-rki-und-der-who-haben-inzwischen-2-3-tage-verzogerung/ |
@ms82494 not sure if you are German or English, and whether or not you can read my blog post. I want to confirm explicitly that as far as I have understood the situation the JHU data are actually the better ones. The RKI data is lagging behind by 1-3 days by now. The JHU data represent at every point in time the current count of the individual Gesundheitsaemter in Germany. These are official, good numbers, to my knowledge. ZEIT ONLINE is rather fast adding and publishing those (< 1 day), and I believe that's what JHU then uses. The RKI and WHO need 2-3 days to add and publish. That's ironic in an exponential growth phase, of course. |
On the contrary, for France, JHU often gives smaller values than official ones. |
Dear @jgehrcke, this would, however, imply, that the OFFICIAL GERMAN numbers are higher than the JHU numbers, right? However, it's the other way round.. The JHU numbers are higher than the official numbers 🤔 |
@ms82494 I went on the RKI site and the value you gave for 18/03 are annouced for 18/03 01:00am so they must correspond to the 17th , am I right ? |
complementary question :
Pls , does this low median age explains the very low death ratio and very low absolute number of dead people in Germany ? |
@jgehrcke Yes, I speak German and I did read your blog post. This AM the German RKI numbers caught up a little, being only about 3k behind the JHU/state numbers. That supports your thesis that this is simply processing delay at RKI. Your blog post actually captures the reason for my initial skepticism towards the "processing delay" explanation: Official state numbers are electronically submitted to the RKI. If the RKI only needs to tabulate and add these numbers and publish for the federal level, why does it take two days to accomplish that? And why did the numbers not just lag, but actually diverge? I don't know whether there's more to the RKI processing pipeline and whether that would be justified. Maybe the RKI numbers will even be found to be more accurate in the end. But, as I acknowledged in my second post, for the purpose of justifying the JHU numbers it doesn't really matter, This repo is following the rules it has set out for itself, and I am not arguing (anymore) that any revisions need to be made here. |
@JiPiBi I think a lot of the infections can be traced to German vacationers on ski trips to Northern Italy. Those would skew to a younger age, Also, German health authorities had both testing capacity and the determination to run down the contacts of each person tested positive. This probably led to infections being caught and treated a little earlier than in some other countries (looking at you, US). |
@ms82494 In France too , every new case was tracked at the beginning with numerous people acting like police squads and local confinment for villages (remember France was the first European country to be impacted end of January ), but with the current numbers , the testing is no more so largely used , even if you are sick and confined at home. About the treatment, apart from treating the symptoms like fever , I dont think any country can do more , some clinical test begin , but ... When I read people's stories , some old ones are tired for some days, like a flu , but not much more, probably less , even in good shape, and suddenly they have breath difficulties and in some hours , they die (same stories in Italy) Country/Region Confinment limits the number of infected people , but the number of deaths is a mystery : it seems in France that 50% of people in ICU finally die .... and the others remain 2-3 weeks in ICU . So the challenge now is to open as many ICU as we can and maintain confinment to limit the needs for ICU at the same time |
@johannesschobel ouch, which statement of mine sounds like that? That's of course not at all what I wanted to say. I re-read my article and couldn't quite find the sentence that led to this impression. Would however love to understand. Finding the wording problems in your own words is of course always a challenge. Please let me know! Btw, in the meantime I have published https://github.com/jgehrcke/covid-19-germany-gae -- please have a look! Feedback welcome. |
Thanks for the feedback and your explanations @ms82494 -- love this discussion.
Yes! Seems like we're on the same page. Especially about the fact that there are some mysteries left... what is the RKI doing with the data, and how important is their processing step? I would hate to copy pre-mature data, but I also tend to believe that the manual processing through some desks of some classical office workers in a German amt simply takes a while (read: my theory is that there is no magic to the delay, it's just old-school office work). Edit: @ms82494 as a shameless plug I'd also like to ask you to please have (another?) look at https://github.com/jgehrcke/covid-19-germany-gae and to share it with interested parties if you can. I invested a little bit of love and plan to maintain this properly. Thanks! |
Dear @jgehrcke , sorry, i may have mistaken your statement. In the table of the original poster the JHU numbers were higher than the official numbers. All the best |
Referring to #1165 @jgehrcke: thanks for your blob post!
Right now, I don't know how to test if the JHU gets all the data or if it counts data twice because a source entity is already included in a different source. In a situation like this, the credibility of sources is critical, and as @ms82494 pointed out, news media are increasingly publishing hypotheses about the shortcomings of the infrastructure used by RKI (i.e. the official data acquisition pipeline). I feel that alternative sources are detrimental to a coordinated response unless sources are transparent, differences between sources can be explained (e.g. in terms of lag), and error margins are known. |
It doesn't make any sense to me that JHU CSSE has more up-to-date data then RKI, because as stated on the JHU CSSE mainpage, JHU just uses the European data from https://www.ecdc.europa.eu/en/cases-2019-ncov-eueea (which I guess they just download here: https://www.ecdc.europa.eu/en/publications-data/download-todays-data-geographic-distribution-covid-19-cases-worldwide) First: Why would there be any difference between the European numbers and the RKI? Second: Why would JHU differ from the European numbers? The JHU CSSE Repository does not give me any indication that they do any sophisticated scrapping from German states or even more local level (Gesundheitsämter der Kreise). |
@TG9541 and @coezbek thanks for the discussion! You bring up interesting points, and the confusion sadly is wayyy to big. Let me try to clarify (maybe again, maybe just in other words?) what I believe what I have learned in the last days. Bear with me, for a moment :-). Thanks!
yes
I don't know about that but I think the answer is "no". Let's take JHU out of the picture. Let's explore where we think the best data for Germany comes from, in terms of credibility (quality data, official, and so on) and freshness, ok? In Germany we have 16 states. They report their case data in an official, highly credible fashion. Not always via CSV or JSON files, but usually instead via HTML pages with high-quality tables, with annotated maps, with press announcements, sometimes even Excel sheets. I have linked some of these resources here. Take a look. The point is: everyone can look these data up at any point in time. And then sum the numbers up for the 16 states, yielding a credible case count (for that point in time) for all of Germany. I have tried to explain this in a concise, complete walk-through in this (German) blog post: https://gehrcke.de/2020/03/ard-zdf-covid-19-fallzahlen/ I have followed the situation and data flow in depth (well, as a non-insider) for more than a week now. Major conclusions are:
The authorities on the data are the state ministries. Not so much the RKI, seemingly. It does not look like the RKI changes (noticeably) data as part of their processing, they only "delay" (which is probably as of slow double-checking, curating, and some kind of manual processing that we do not entirely understand). We have strong indicators that data quality assurance happens perfectly well on the state level, and that we can use state level data as-is. In other words: as far as I can tell, and I looked very closely, the data published by the individual German states is fine, credible, and pretty recent, whereas the RKI data is not as recent. For understanding the real German situation of yesterday we better look at the data published by individual German states today. When we look at the RKI data of today we probably see the state of reality of roughly two or three days ago. Another important conclusion that I've drawn is that the grunt work of constantly watching out for data updates published by the individual states is done, very well, by two newspapers: ZEIT ONLINE, and Berliner Morgenpost. Getting their data count at any time is a fair approach to get the unmodified state level data (they don't invent numbers, they simply and predictably use what the states publish, pretty immediately after state-level updates). We have two noticeable transparency challenges: within the JHU, and within the RKI. We don't know where the JHU takes the data for Germany from (we have strong indicators, but it would be good to see someone from their side comment on it), and we don't know what the RKI is doing exactly when they delay data publishing a bit (and why they are doing it). These two transparency problems are not so bad, at least as far as I am concerned, because I know of a data source that seems to be pretty credible and transparent to me: the individual states. That's as transparent as it gets. And two newspapers, doing the summing-up grunt work for me, when I am too lazy. These assumptions/conclusions/insights are built into https://github.com/jgehrcke/covid-19-germany-gae. I hope that helps to clarify some things. If one of my observations and conclusions above sounds fishy or wrong then I'd love to learn! |
@jgehrcke Thank you! Your analysis makes entirely sense to me. I would like to summarize the questions:
In the meantime I found this link on the RKI site that gives access to their statistics data (fine granular to regions, ages, etc.). Unfortunately it is only refreshed weekly. So I guess, we keep being reliant on the interns at Zeit.de and Morgenpost for manually entering data. |
@coezbek: the RKI has an updated explanation of the data collection method here. @jgehrcke: the data collection process from local authorities, through federal states, and up to the RKI preserves time (Meldetag) on a case-by-case level. This means that the information can be used to draw conclusions in an epidemiological sound way. The JHU CSSE data doesn't preserve case information. In the best case it can be used to get an informal overview. In the worst case it leads to attrition of trust, and that's the least thing we need. At least the federal state numbers should be listed independently by JHU as to create a minimum of transparency. |
I've followed updates to this page carefully. At any given time the RKI has had statements like the following (taken from the page now):
Acknowledging the delay, implying that other sources may be more recent (and not wrong).
@TG9541 yes and of course I really hope that's the case! We could always infer that from one of the info graphics in the RKI situarion report PDF documents. They have this kind of data. But they don't seem to publish these data, not even implicitly in their case count table. This gets much more tangible by looking at an example. One specific question I asked here is (translated, modified for this context) "why does the RKI report ~16.600 cases for the end of the day of March 20, whereas the state ministries reported ~19.800 cases for the end of the day of March 20?" I'd love for others to answer this question in detail :-) My working hypothesis is said delay. It's great that the Meldetag is preserved case-by-case in the RKI-internal data processing flow, but the RKI does not seem to use the Meldetag when publishing the "sum-per-individual-day", because that's pretty clearly behind, by at least one day. There also is this statement on the RKI page you linked:
Pretty certainly that's also true for the individual state reports. I sincerely do hope so. |
The particular problem with the "inaccuracies" in the German datapoints as claimed by JHU is that it's JHU's numbers that seem to drive policies in Germany, not official RKI numbers. I have read through all available FAQs and explanations provided by JHU's team, and so far it seems that JHU's project is maintained by three key people: Lauren Gardner, Ensheng Dong, and Hongru Du. I see no indications that either speaks German, or that any native German speaker is involved in collecting datapoints for Germany. (Disclosure: I speak native German, English and other languages.) This makes the claims that, somehow, the team or person behind the datapoints for Germany developed a collection of highly sophisticated and versatile scrapers for German websites dubious, to put it mildly. I have read other theories where one person claimed that, indeed, JHU's developers simply scrape the official German numbers from the ECDC and then extrapolate from a simplistic model such as "cases double every 4 days". It'd be very easy for JHU to refute such claims by publishing, at a minimum, the sources of their scraped data. And I'm not even singling out JHU here, as Zeit Online has the same credibility challenge. They have quite a large GitHub presence, and yet there's no transparency at all where and how their data was collected. How hard would it be to setup a simple repository with a single CSV file listing the case sources? I find it quite ironic that, rightly so, politicians and newspapers have spent the better part of the last 4 years decrying the manufacturing of unsubstantiated sources to drive a false narrative, and setting up codes and policies that demand at a minimum the disclosure of publicly verifiable sources, and yet, here we are, with billions of clicks being generated every day on websites that do not reveal where and how their data was aggregated. Why the secrecy? |
@nikola sorry for replying so late and sporadically. Thanks for your lovely contribution to this discussion. I'd like to reply to this statement for now:
Right, agree. The official institutions should provide this data set, fresh data from every Landkreis, providing a timeline for every case (at least providing the actual day of taking the sample that was then later physically tested). They don't, however. We now have two very recent initiatives that try to get a little closer:
|
I also hate that. I really do. At least the media and politicians in Germany should stop citing JHU, even referring to them. They could just as well cite Landesbehörden, if they don't want to cite RKI. |
@jgehrcke I'm a Python coder myself, so more than happy to help out with the scrapers in the repo you mentioned. |
That is not secrecy, that is just sloppyness. RKI's data just isn't more fresh and their internal processes do not keep up. Lawmakers didn't put in any rules to make it any faster (reporting could directly go from counties to RKI) because they did not imagine you would need it any faster. As strange as it seems: the pandemic did not cause everyone to work weekends. If you want to see citizens in action as a crowd-sourced juggernaut, then you can get your data a day faster than JHU CSSE: https://docs.google.com/spreadsheets/d/1wg-s4_Lz2Stil6spQEYFdZaBEp8nWW26gVyfHqvcl8s/edit?pli=1#gid=0 |
@coezbek very cool. Where can I find background information about the project behind this Google Sheet? |
https://twitter.com/risklayer - RiskLayer together with KIT's Center for Disaster Management and Risk Reduction Technology (CEDIM) |
@coezbek the approach over there is prone both to false positives and false negatives. It uses the term "Coronavirus" to aggregate cases of SARS-CoV-2 infections and symptomatic COVID-19. Asymptomatic cases of SARS-CoV-2 are not challenging our societies - symptomatic cases of COVID-19 are. Hospitals are not at the edge due to a rising number of positive SARS-CoV-2 test results, they are at the edge due to a rising number of severe cases of COVID-19, which are caused by SARS-CoV-2 and which require intensive care. A carrier of SARS-CoV-2 does not require intensive care if COVID-19 hasn't developed, hence most infected people being quarantined at home instead of sent to the hospital. Given that we can't quantify the moving target of SARS-CoV-2 infections we can't estimate how many carriers of SARS-CoV-2 will develop COVID-19. A table that lumps both together and spills out primitive numbers is not helpful in my humble opinion. We all know that plenty of organisations will use that table's numbers and put the title "Confirmed cases of COVID-19" over the result. |
Thanks for sharing. Is it possible to get historical data for each city from there? |
It seems to me the most reliable data is still from Berliner Morgenpost. At least they store a source url and a time stamp for each data point: #826 (comment) |
I see multiple sources published here:
|
This issue is very substantial and getting bigger. Let me illustrate:
This has been reported in this thread: #826, complete with an explanation of the likely source of the error. Unfortunately, the maintainers don't seem to have taken note of issue #826, but we are getting further and further away from the official numbers, and it's getting implausible to ascribe this to timing issues (today's official numbers are by more than 1,000 below YESTERDAY's JHU numbers), or presumptive cases (how could there be 4,000 or more presumably positive tests piling up in the processing queue?)
The JHU numbers are now widely reported back into Germany, where they cause confusion. Here's a German news article that discusses the discrepancy between JHU to official numbers. Note that the article vaguely attributes the difference to JHU having "real-time" data and/or knowledge of (or modeling somehow) so-far undetected cases that likely exist in the population. I don't think either of these theories holds any water.
Maybe I am missing something, in which case I'd welcome an explanation. Otherwise, I think it would be best to ignore the numbers reported on German COVID cases in this repo.
The text was updated successfully, but these errors were encountered: