Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split cases collection into collections by sourceId #2553

Open
abhidg opened this issue Feb 25, 2022 · 3 comments
Open

Split cases collection into collections by sourceId #2553

abhidg opened this issue Feb 25, 2022 · 3 comments
Labels
RFC Request for comments / enhancement proposal

Comments

@abhidg
Copy link
Contributor

abhidg commented Feb 25, 2022

Is your feature request related to a problem? Please describe.
Currently all data is in one cases collection. This causes issues with operations such as prune, which have to get a write lock on the collection and update millions of cases by using flags. Operations such as export also become slower with the size of the collection.

Describe the solution you'd like
Split the cases collection by sourceId. This way, parallel ingestions should be faster as no simultaneous ingestions will be operating on the same collection. Also operations such as export become simpler, especially if we change the export unit to be by source rather than by country. Prune need not be a separate time-consuming operation, we can .renameCollection() the current collection to collection-old and replace it with staging collection for a single source. As renameCollection() does not involve a copy (it changes the metadata), this should be much faster. The benefit is that housekeeping operations relating to ingestion (export, prune) can be done as part of the ingestion process or at the database level via triggers on collections, as suggested by @jim-sheldon. Making collections smaller will make these database operations faster as well.

Describe alternatives you've considered
Keep the status quo. It mostly works, though we would expect scaling issues to get worse if we get 2-3x the current number of cases (~100m).

@abhidg abhidg added the RFC Request for comments / enhancement proposal label Feb 25, 2022
@iamleeg
Copy link
Contributor

iamleeg commented Feb 28, 2022

Important to consider the trade-offs on searching, e.g. if I want all Brazil data or all cases with a particular symptom that's going to search across multiple collections: what's the performance impact in both time, and memory wherever we end up merging the results?

@abhidg
Copy link
Contributor Author

abhidg commented Feb 28, 2022

@iamleeg Most of the performance implications would come from sorting in the UI and/or in the API, as the rest is parallelizable (assuming MongoDB allows parallel reads across collections).

@jim-sheldon
Copy link
Collaborator

We could also move toward using a read database, a separate write database, and a replication script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RFC Request for comments / enhancement proposal
Projects
None yet
Development

No branches or pull requests

3 participants