Skip to content
This repository has been archived by the owner on Dec 10, 2018. It is now read-only.

CosmosDB / DocDB storage support #30

Open
clehene opened this issue May 19, 2017 · 6 comments
Open

CosmosDB / DocDB storage support #30

clehene opened this issue May 19, 2017 · 6 comments

Comments

@clehene
Copy link

clehene commented May 19, 2017

Creating this as a child of #8
@praveenbarli has something working.

Some things to consider and ideally discuss before an implementation.
Zipkin backend model and queries. There's no formal specification but @adriancole has mentioned

As CosmosDB has multiple APIs (key-value, document and graph) it would be interesting to know what makes most sense for this backend and have a discussion on the model. Ideally we'd be able to make sure it's cost efficient at a good performance .
See pricing https://azure.microsoft.com/en-us/pricing/details/cosmos-db/

  • $0.25/GB
  • $0.008 100 / RUs / second (min 400)

Perhaps data retention could also be discussed? What's typically used?

Note that if Event Hubs is used as a queue before data lands in storage, that will impose some limits on the throughput (EH is limited to 20K rps).
Since Cosmos DB can handle way more than that, perhaps it would make sense to be able to push directly without going through EH?

@aliostad
Copy link
Collaborator

aliostad commented Jun 20, 2017

Hi, sorry been away. Would you mind sharing where you saw 20K RPS for EventHub? AFAIK Microsoft never really defined an official limit although I have heard of stories where they had issues with scale.

Update

Huh, found it - small print in here https://azure.microsoft.com/en-gb/pricing/details/event-hubs/

When we first used EH there was no such thing.

Update 2

Contacted Azure friends. With Dedicated tier you can go up to 2 million events per second:

Scalable between 1 and 8 capacity units (CU) – providing up to 2 million ingress events per second.

From https://docs.microsoft.com/en-us/azure/event-hubs/event-hubs-dedicated-overview

@clehene
Copy link
Author

clehene commented Jun 20, 2017

@aliostad glad to see there's a new tier for Event Hubs. I think it would be worth documenting these limits / options for Zipkin. Unfortunately not sure how to keep them up to date as there seems to be no standard way (e.g. API) to get the units, limits and pricing.

Also this is an interesting detail about dedicated pricing:

Dedicated Event Hubs is only available to EA customers and is sold at a retail price of $733.13 per day for 1 Capacity Unit.
https://azure.microsoft.com/en-us/pricing/details/event-hubs/

It seems to be a big jump to over $250K / year.

@aliostad
Copy link
Collaborator

@clehene hi. Running at that level will not be cheaper with DocumentDB. I have done some PoC on extreme load and frankly the cost will be higher since it does more. Here is some details:

The test that we were doing was with 4KB docs. Storing each one used by ~70RU. So 200K RU will cost £108K a year while giving you enough RU to store mere 3K events per second! (assuming 4KB each)

Also on the read side you will start having problems with one-by-one read and delete and you do not have checkpointing that comes free with EventHub.

So in short, CosmosDB - when it will have table support - will be good but right now only works if you use it with Azure Search (there is an option to store docs and have Azure Search index them). I can work on this if enough people interested but it seems we already have a working version?

@codefromthecrypt
Copy link

codefromthecrypt commented Jun 21, 2017 via email

@clehene
Copy link
Author

clehene commented Jun 21, 2017

@adriancole the main reason of discussing EH here, is in the context of the original question of whether it's worth having it in front of Cosmos DB.
My assumption is / was that it may be more expensive and with less availability and throughput and we may implement CosmosDB as a storage and transport at the same time.

That said, @aliostad points related to CosmosDB are valid and relevant within the CosmosDB discussion and it would be worth expanding on a few topics like what would the ideal data model for Zipkin data in CosmosDB be from a size and query capability perspective?
This may be useful to determine size / cost mappings https://docs.microsoft.com/en-us/azure/cosmos-db/request-units.

@codefromthecrypt
Copy link

@clehene ok I think I understand. yeah for example there are folks who use storage directly instead of having a separate transport (I've heard of this used for both elasticsearch and also cassandra although it isn't common practice). There are impacts to how you'd design the data model if you think people would be doing this, and yeah there'd be no way for zipkin to prevent people from skipping a separate transport if they wanted to.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants