Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cassandra serializes Dependencies when only dependencies.links is ever read #1008

Open
codefromthecrypt opened this issue Mar 1, 2016 · 4 comments

Comments

@codefromthecrypt
Copy link
Member

This may be a historical thing, but the Dependencies type is only used by Cassandra, and only its links field is ever used. I'd recommend we remove the type and change the data stored in cassandra to just serialize links.

Here's a plan:

  • Add CassandraDependencyStore.storeDependencies(epochDayMillis: Long, links: Seq[DependencyLink]) which serializes only the link list as a byte buffer.
  • Deprecate Dependencies, DependencyStore.storeDependencies(dependencies: Dependencies)
    • Note cassandra is currently the only user of storeDependencies anyway, and its only user, the spark job uses CassandraDependencyStore directly.
  • Change the deserializer to peek to tell if the dependencies are a list of links or a dependencies object (like Supports reading multiple spans per Kafka message #995). That means the query server can operate on both types without a schema change.

Once this code is out, zipkin-dependencies-spark could release a version which moves to the new method, which serializes only the links object. The instructions could imply a minimum zipkin-query-server version.

Note that this is different than changes around spans, which often imply affecting end-user apps. Currently, only zipkin-dependencies-spark is dependent on this data structure and directly affected by this change.

@codefromthecrypt
Copy link
Member Author

@yurishkuro any thoughts on this?

@yurishkuro
Copy link
Contributor

The other fields of this struct are also used. Spark job does this:

  def saveToCassandra(sc: SparkContext, keyspace: String, dependencies: DependenciesInfo): Unit = {
    val thrift = dependencies.toDependencies(startTs = startTs, endTs = endTs).toThrift

and Cassandra store looks at these timestamps when filtering iirc. That's not to say that it must stay this way, a cleaner approach would've been to extend the table schema and store those values as proper fields in the table.

Is anything else dependent on the Link struct? If not, this could be a candidate for de-thrifting as well, e.g. by using native vector types that Cassandra supports (not sure how easy it is to write those from Spark).

As far as transition from old schema to new, I would instead go with a simple migration script that would copy from thrift-based dependencies table into a new table with clean schema, without thrift. Then people can upgrade both the server and Spark job and still have all the historical data (e.g. we have no TTL on the dependencies table, we keep it forever).

@codefromthecrypt
Copy link
Member Author

@yurishkuro nothing in the CQL deserializes thrift. While it is true that we store these dates, they aren't used anywhere except to make epochDayMillis, which could be done directly.

I'd like to separate complete model overhaul from the topic of extra fields, if possible.

@codefromthecrypt
Copy link
Member Author

@yurishkuro ps I opened #1009 on a thrift-less schema. If you don't mind, please jot your thoughts about what it would look like there, so the comments don't end up lost. cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants