Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate moving to Circle CI #2319

Closed
kailuowang opened this issue Jul 9, 2018 · 30 comments
Closed

Evaluate moving to Circle CI #2319

kailuowang opened this issue Jul 9, 2018 · 30 comments

Comments

@kailuowang
Copy link
Contributor

Travis' memory issue is a bit too much and our build there now takes more than 3 hours.

@tpolecat
Copy link
Member

tpolecat commented Jul 9, 2018

Might also look at BuildKite if we can get people to donate hardware. It's easy enough for me to set up, which means it's easy. An agent running on @larsrh's 3874629847653-core machine would be 🔥

@larsrh
Copy link
Contributor

larsrh commented Jul 9, 2018 via email

@ghost
Copy link

ghost commented Aug 13, 2018

@djspiewak
Copy link
Member

@kailuowang I'm a bit out of the loop, I think. The build takes 3 hours? How long does it take locally? What is it doing? Thanks to working at SlamData, I have a remarkably vast swath of experience debugging slow Travis builds. I'd be happy to take a look if you want.

@ghost
Copy link

ghost commented Aug 13, 2018

the 2-3 hours is the combined total of the builds, each job is typically 20-30 minutes - https://travis-ci.org/typelevel/cats/builds/415542808

And a lot of that time is coverage testing, tut testing, doc testing, site building and so on.

@djspiewak
Copy link
Member

Ok taking a quick look at things, literally the first things that occur to me:

  • Oh god, you're using a separated build script… I hate that convention.
  • Why is sudo: required? I'm relatively certain those VMs are slower. Is it just for codecov? See below.
  • The travis-publish.sh script goes to great lengths to push things all into a single SBT instance. In my experience, this is exactly the opposite of what you want to do when you have a slow build. Separate SBT processes, sequentially invoked, gives you better memory characteristics and is better understood by Travis (especially if you don't split the build script out of .travis.yml).
  • .jvmopts uses -Xmx6g. This is problematic because Travis doesn't have that much memory! You should strongly consider dropping that option altogether and allowing it to be the default (ditto with -Xms), which will be scaled off of the reported system memory.
  • We should have a discussion about whether or not code coverage is actually worth anything. Frankly, I've never seen it provide any value whatsoever, and it doubles the duration of the JVM build.
  • Why is the Ivy cache not being sanitized prior to publication? This is resulting in re-caching quite often.
  • Random best-practice: consider commenting on each of the secure variables so we know which one is which.

I didn't look at SBT itself. Looks like a lot of the logic is in tasks, so that may also contribute.

@ghost
Copy link

ghost commented Aug 13, 2018

The build script actually does invoke sbt multiple times, but for jvm we could split even more as per the js build - but the jvm issue normally happens relatively early in the build.

for the sudo - that is a slower startup but you get the 7.5 Gb memory, we could try a lower setting. ref https://docs.travis-ci.com/user/reference/overview/

@rossabaker
Copy link
Member

Why is sudo: required? I'm relatively certain those VMs are slower.

sudo: required gets 7.5 GB as opposed to 4GB. http4s adopted it because the IO was untenable on the container builds, but that should be far less a factor in cats.

We should have a discussion about whether or not code coverage is actually worth anything. Frankly, I've never seen it provide any value whatsoever, and it doubles the duration of the JVM build.

👍

@ghost
Copy link

ghost commented Aug 13, 2018

My main concern before moving would be to ensure that it really is not our build at fault! one simple option is to add parallelExecution := false to the jvm settings, already in js

@ghost
Copy link

ghost commented Aug 13, 2018

re scoverage times... be careful here. The scoverage tests also run the scalacheck tests, but with larger parameters than js. And after a successful coverage run, the code is just rebuilt not tested.

So whilst coverage will always be slower, i doubt it's causing any issues. What we might want to do is try running the scoverage with very low parameters (just to get coverage) and then run the full scalachecks with no scoverage.

IMHO, keeping/ditching coverage is best discussed as a separate issue

@kailuowang
Copy link
Contributor Author

kailuowang commented Aug 13, 2018

@djspiewak thanks so much for helping. And @BennyHill thanks for answering some of the questions.

To answer your questions above.

  • I'm not a fan of the separated build script either. Maybe we can replace it, but it didn't bother me enough to spend time on that either.
  • that is, errr, a way to tell Travis to use a different VM (see @BennyHill's answer above). I don't believe sudo is actually needed for the build to run. We added at least a year ago when we had memory issues with Travis last time. Might worth a try to remove it if we can squeeze
  • +1 on dropping -Xmx6g especially if we can use a different VM
  • code coverage combined with codecover chrome extension made it very easy to identify uncovered code in PRs. I agree that the overall coverage number for a PR isn't that critical. We can probably improve the build by limiting the code coverage in a single scala 2.12 jvm build job. Right now it's performed on both scala 2.11 and 2.12 jvm job.
  • no idea. worth a try.
  • also +1 on adopting that best practice. I think the two we have are sonatype credentials.

@ghost
Copy link

ghost commented Aug 13, 2018

re the parallelExecution := false idea, this came up the other day on the scala native channel - https://gitter.im/scala-native/scala-native?at=5b6d631fa6af14730b170260

@ghost
Copy link

ghost commented Aug 13, 2018

Finally, re the "separated build scripts" this was orignally done as per the ci docs,

But of course, that was a while back , so perhaps we can revisit that

@ghost
Copy link

ghost commented Aug 13, 2018

And finally, finally.... one small advantage of separate build script is that it's far easier to "run* from the command line without having a local travis - see https://github.com/typelevel/cats/blob/master/scripts/travis-publish.sh#L17-L18

@DavidGregory084
Copy link
Member

If you drop sudo: required it would be a good idea to add -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap to ensure that the heap size is set according to the container's memory limits

@softinio
Copy link
Member

If a decision is made to move to circleci let me know as I would gladly help. Have used circleci last few years exclusively.

Are there any other alternatives being considered?

@DavidGregory084
Copy link
Member

I have also heard really good things about Semaphore and BuildKite, although BuildKite requires its own infrastructure (I can highly recommend packet.com for that) and Semaphore's OSS policy seems to have mysteriously become a "Please email us if you are an OSS project" policy

@DavidGregory084
Copy link
Member

I had a look at this a few days ago and my overwhelming impression was that it's hard to define a build matrix in a nice way in every one of the hosted services other than Travis. It's possible in Circle CI but relies on YAML dictionary operations rather than being a construct in its own right.

@DavidGregory084
Copy link
Member

Whether that's a problem or not depends on how much faster (if at all) the builds run on those services IMO 😀

@kailuowang
Copy link
Contributor Author

kailuowang commented Apr 26, 2019

Thanks guys. We haven't seriously looked at the any of the alternatives yet. But we probably should soon given the elevated uncertainty in Travis future and it's suboptimal reliability lately. An easier migration from Travis is a nice to have, reason being if we have to switch yet again, it's slightly more likely to find another service somewhat confirm to the Travis way. How easier to set up a trial on circle ci?

@DavidGregory084
Copy link
Member

I'd be glad to give a few different services a go and report back @kailuowang?

@kailuowang
Copy link
Contributor Author

@DavidGregory084 that would be amazing. Thanks!

@DavidGregory084
Copy link
Member

There's something interesting
image
about testing new CI systems
image
that brings out all the weird bugs 😄

This was referenced Apr 30, 2019
@DavidGregory084
Copy link
Member

DavidGregory084 commented Apr 30, 2019

Guys, I've opened a few PRs which demonstrate the config required to use different services.

I evaluated CircleCI too but I found that the container memory limit of 4GB was just not enough to run cats builds reliably. I found the configuration to be quite verbose, and I also had issues where the config validation in the CircleCI CLI disagreed with the service itself and my build didn't run after passing validation locally.

These services do experience intermittent failures with builds, but they all seem to be caused by a single flaky test (ApplicativeTests.monoid.combineAll).

I think we should focus on fixing that whatever we decide to do about CI in the future.

So far my instinct is that Drone.io is probably the best option as it is free for open source, easy to configure and super fast.

Semaphore has a very unclear open source policy and although Buildkite is very nice, I think that managing hardware in addition to the build itself could become a bit of a chore.

@kailuowang
Copy link
Contributor Author

Thanks, @DavidGregory084 that's a lot of work. I will checkout their configs in your PRs , and take stab at ApplicativeTests.monoid.combineAll.

@softinio
Copy link
Member

@DavidGregory084 Out of curiosity what specific memory related issues did you hit with CircleCI? Where you leveraging any of circle's parallel processing features?

@DavidGregory084
Copy link
Member

@softinio you can see the config I used here. I tried using the cgroup memory limit detection (-XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap), which didn't work correctly on CircleCI and resulted in the JVM allocating way too much memory. I also tried reducing the JVM memory allocation to 3.5G but I was still getting multiple jobs on each build killed by the CircleCI infra (Exited with code 137). You can see some example runs here.

@DavidGregory084
Copy link
Member

DavidGregory084 commented May 21, 2019

@softinio it seems like exceeding 4GB of available memory requires using a paid plan; as an open source project we could probably use the resource_class: large if we contacted CircleCI support.

@kailuowang
Copy link
Contributor Author

Update on this: Semaphore would like to donate Cats 8 bare metal performance agents for cats CI. In my tests, it cuts Cats’ build time by half. I think we should consider migrating to Semaphore, main reason being that we have so many TL projects on Travis all sharing 6 slow agents, it’s nice to have some more powerful CI resources.

@larsrh
Copy link
Contributor

larsrh commented Oct 20, 2020

Since nobody has worked on this for quite a while, I'm closing all old CI-related PRs.

@larsrh larsrh closed this as completed Oct 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants