Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mesos 1.1.2 #1571

Merged
merged 58 commits into from
Oct 11, 2017
Merged

Mesos 1.1.2 #1571

merged 58 commits into from
Oct 11, 2017

Conversation

ssalinas
Copy link
Member

@ssalinas ssalinas commented Jun 20, 2017

This upgrades us to mesos 1.1.2. We can't go straight to newest because masters in 1.2 onward will no longer accept connections from 0.x slaves, so the upgrade path would be 😭 .

The api and protos objects also change quite a bit, meaning that all of our current task history with MesosTaskInfo, Offer, etc saved in the json will not be readable in the new version unless we write the code to convert it. As an alternative to keeping the pieces in json, which requires Singularity client users to pull in a mesos dep, I would propose we find a way to wrap the data in our own POJO instead (like we have done for much of the rest of the objects)

TODOs

FYIs

  • --work_dir flag must be set for all mesos slaves/agents
  • internally the slave -> agent rename is there, but all apis endpoints and fields still reference slave as they did before for now

Eventual things we should do to keep up with newer features:

  • executor should honor the kill_policy setting of kill messages
  • executor shutdown grace period is set in ExecutorInfo
  • slave -> agent rename
  • use labels in ExecutorInfo in favor of source

The Fun Stuff

Things that we can explore once we upgrade:

  • support --http_command_executor
  • explore use of mesos-native health checks
  • per container linux capabilities
  • partition-aware mesos

frameworks can opt-in to the new PARTITION_AWARE capability. If they do this, their tasks will not be killed when a partition is healed. This allows frameworks to define their own policies for how to handle partitioned tasks. Enabling the PARTITION_AWARE capability also introduces a new set of task states: TASK_UNREACHABLE, TASK_DROPPED, TASK_GONE, TASK_GONE_BY_OPERATOR, and TASK_UNKNOWN. These new states are intended to eventually replace the TASK_LOST state.

@ssalinas
Copy link
Member Author

Updates here:

  • current code runs and schedules tasks locally. Subscription, offers, framwork messages, etc all working smoothly
  • When written to json, the new protos objects still conform to the same structure as the old ones. So, while I would like to remove mesos protos from things written to json in zk, it isn't a requirement for upgrading
  • The executor can be left on the unversioned mesos library binding for now. It can still connect to newer masters with the older library as long as it is running the newer mesos libs underneath
  • a user for the framework is now required and is used in certain isolator calls. Defaulting this to root, overridable in the yaml config
  • docker images are using 1.1.1 because mesosphere never published a 1.1.2. Not a large enough difference to build my own

Also /cc @tpetr in case you're interested in the upgrade at all ;)

@ssalinas
Copy link
Member Author

ssalinas commented Jun 23, 2017

Next round of updates:

  • finished refactoring to use the http api via https://github.com/mesosphere/mesos-rxjava . The scheduler lock is still in place, but the addition of observables frees us up to do some more interesting things internally if we want to. But, this means we are free from native mesos lib bindings!
  • Likely going to leave the protos stuff as-is for now. I'll do some additional tests for backwards compatibility, but looks like we should be all set there.

@ssalinas ssalinas modified the milestone: 0.17.0 Jun 23, 2017
@ssalinas ssalinas changed the title (WIP) Mesos 1.1.2 Mesos 1.1.2 Jun 23, 2017
@ssalinas
Copy link
Member Author

ssalinas commented Jun 30, 2017

Remaining TODO on this PR:

  • Update the new client so that we will can take a list of mesos master hosts to attempt to connect to. Right now it's just a single one

@ssalinas ssalinas added the hs_qa label Jul 20, 2017
@baconmania
Copy link
Contributor

🚢

private Thread subscriberThread;

@Inject
public SingularityMesosSchedulerClient(SingularityConfiguration configuration, @Named(SingularityMainModule.SINGULARITY_URI_BASE) final String singularityUriBase) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to have this implement AutoCloseable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My eventual thought for this is to have it be able to fix #273 , meaning the same singleton scheduler client could subscribe again or renegotiate its connection. The client isn't ever used in a single try-with-resource type of scope so the only closing that would have to be done is on shutdown. Also, the dropwizard guicier we use here doesn't have the auto-close-singleton-closeables bits that the other version does

@baconmania
Copy link
Contributor

🚢

@ssalinas ssalinas merged commit b5b9bfb into master Oct 11, 2017
@ssalinas ssalinas deleted the mesos_1 branch October 11, 2017 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants