Upgrade Semaphore CI [Approved]

Topic: Upgrade Semaphore CI instance to 4x boxes

Proposal type: Hardware [ ] , Software [ ] , Other [ X ] : CI services

Description: The current state of the CI tools is incredibly painful. For a developer feedback perspective the lag times on getting results is incredibly slow. At the time of writing Travis is running 5 hours behind activity. This has spill over affects into other ArduPilot associated repos like ArduPilot/mavlink which while a very fast task is stuck behind all other scheduled Travis builds.

I’d like to see us shift more work on to Semaphore CI as it is much faster at a given task then Travis is. Semaphore CI is completing 3 Linux board builds in 8 minutes, while Travis takes 14 minutes to do 3 Linux board builds.

At the moment we only have 2 boxes from Semaphore CI on the free plan. I’d like to propose moving this to 4 boxes, (which per Semaphore docs would only cost us the price of 2 boxes as an open source project). This would allow us to offload more work onto Seamphore from Travis to bring the entire CI time frame back down to a more reasonable value.

The other alternative that doesn’t require money would be to look into CircleCI which offers 4x instances to open source projects. Circle CI used to be much slower then Semaphore, but has apparently sped up with their 2.0 release. My inclination is to stick with Semaphore given the known build times and easier setup to increase the number of builds, and having 3 CI tools starts to add a fundamental overhead.

Planned amount $$ (USD): $747 $996 annually EDIT: Updated total to reflect NFP status.

Estimated time for completion: 1 day

@OXINARF pointed me at this: https://semaphoreci.com/docs/does-semaphore-offer-a-discount-for-non-profits.html which would reduce the cost by 25%. I’ve contacted SemaphoreCI to see if this would in fact be possible in addition to the 2 free boxes we are currently using.

@WickedShell and I were talking about the whole CI situation before him making the proposal, so he basically knows my thoughts, but here they go so everyone (and specially the funding committee) know.

I agree with Michael that CI is taking too much time currently. In the last months we added Sub, px4-v3 and px4-v4pro, which together have all impacted Travis in a big way; mainly the last two, because, just like all other px4-v* jobs, they are doing two builds, each one using one of the two builds systems we still support.

When we stop doing builds with the Make build system in Travis, runtime of five jobs will be cut by half (or more). It will certainly help a lot, but I’m not sure if it will be enough - it’s the kind of change that can only be tested on production, with PRs being opened, updated and merged. We have now switched our firmware server to build everything using waf and I think this is a good time to change CI too. For a long time, Linux boards and SITL have been built in CI just with waf, with a Cron job running once a day to make sure the Make build system still works - that has lately timed out in Travis too.

Regarding Semaphore, I have two issues with it:

  • its cache feature is almost useless, because it is shared by all parallel jobs (meaning we can’t cache ccache database for example)
  • its configuration is done on their website instead of using a file in our repo like Travis and others do

I’ll explain why the second one is important: just like ArduPilot has free Travis and Semaphore accounts, everyone can have them too. This means that everyone can put their code to the test before even opening a PR. The good part about Travis is that anyone using master as a base will run exactly the same test that we do when a PR is open. Semaphore doesn’t allow that - if we change the Semaphore configuration and a user doesn’t know about it, their personal Semaphore account will run with any older configuration the user has put there.

I’ve started reading CircleCI documentation, but with the amount of work I have daily, I can’t give any promises of a deadline to make it work.

It pains me a bit to put this amount of money to get just two more jobs, but if the team wants to, I’ll certainly work to put them to good use.

I’m pretty certain that while it will help reduce times a lot, it will be insufficient and we will still be looking at painfully long build times. And even if we decide to tolerate it for a bit, it’s just going to slowly build back up to being very painful again.

While I agree that more caching can only be better, I’m going to stick with the point that even without optimal caching Semaphore is 33% faster then Travis which has the better caching. I know a lot of time went into trying to get this to work better, but given that it’s already a fundamentally faster service I’m willing to accept not having some caches’s.

This I will agree is a bit more problematic, but it can be somewhat mitigated by leveraging scripts to do most of the actual CI running. I grant that there are external steps that aren’t scriptable, but I’m of the opinion it’s an acceptable tradeoff for now.

I think we should only do make for px4-v3 and sitl now. That will be enough to ensure it doesn’t break completely until we get all devs on waf, but should reduce time a lot.

also, I support the idea of buying some more CI time if it will help. I think we should have someone in the team be the CI maintainer, and let them choose the best option

OK, maybe we should define what are the acceptable times? Just remember that we also have a build server and I’m starting to feel that we are duplicating effort here. Is CI supposed to be much better than our build server? If yes, then we are wasting resources in our server because we can just deploy the artifacts from CI.

Sure, I concede that it may be my countless hours of trying to make Semaphore cache work properly talking in my behalf.
I just remembered another point I don’t like about Semaphore: when you close a PR, all build information is lost. I know that at first sight this doesn’t look problematic, but I’ve wanted to look at Semaphore logs in the past that were just not there anymore.
And so that I just don’t point negatives, I love that you can do a debug build in Semaphore and SSH into their machine.

I’ll say this: should it block buying more 2 jobs? No, I don’t think so. Is it a huge bummer? Yes, I don’t know anyone that updates their Semaphore configuration (if they even have one) when we change it - well, except me, but that’s because I usually test changes in my personal account first.

I was thinking (and I am actually testing it now) of doing it like we’ve done for Linux: every build is just waf and then once a day master is built with Make. I would also remove testing Linux boards with Make, it has been enough time.
By the way, SITL has been running like this for a long time too (just as long as Linux boards) and there hasn’t been any problem.

Although not officially I’m sure I’ve been the CI maintainer for quite some time. I’m happy to pass it along, as long as it continues to work well - CI is also important for my work as a reviewer.

I actually think deploying from CI would be an absolutely acceptable outcome. It seems to me, that there is a lot of duplication in terms of maintenance and functionality between CI and the build server. I’d be in favor of reducing that to either be using the build servers for CI, or deploying from the CI system. It’s highly possible I don’t know enough about what’s happening in the build server though, and it might have some feature we couldn’t otherwise get.

I’d prefer to keep one build in the CI system with Make as long as Make is supported. Otherwise there is quite a high chance with new libs/files of not appropriately updating the scripts for it. I do agree it can be reduced to only a single target though.

I don’t mind taking on some of this if you want, but my main concerns are:

  1. I’m much more familiar with Semaphore then the Travis, and would need some support with the Travis stuff at first.
  2. My interaction with CI is primarily as a dev, and my complaints are when it is causing me headaches. You are much more tied to the CI side/interactions then I am. I don’t want to volunteer for it if there is a chance I’d be slowing you down on a regular basis as maintenance.
  3. You’ve already invested the time into evaluating some of the CircleCI side for testing, and are probably better informed on the tradeoffs/setup them I am at this point.

Just butting in: this is clearly something for the dev team to decide.
As a somewhat ignorant question, could some of the CI work be offloaded to the firmware server by setting up Jenkins? Jenkins is free, and the firmware server is already a sunk cost.
I think this topic should also be linked in to Lucas’ build target question: reducing the build targets, or offloading less commonly used targets to a lower priority service, might help alleviate congestion.

@james_pattison I admit that I’m a bit leary of the extra overhead of self hosting and maintaining it, but if someone is volunteering to do all the maintenance on it then I’m fine with it. I suspect that even with Jenkins we would end up needing a couple of extra VDS’s to actually perform all the builds which would still be a rise in hardware cost. (It could easily come out as a cheaper cost rise though).

I am very reluctant to do that. I just don’t trust the config of the CI servers to stay consistent enough, plus our build server is also our firmware download server.

we would need far too powerful a server to run our CI for all contributor branches on our own build server, plus is would be a massive security risk. We’d be running unreviewed code on our servers. The server admin costs of properly constraining that would be huge.
I also don’t like releasing firmwares from the CI servers, as I don’t think it will give us the reliable and consistent releases we need.

As a follow up I have gotten confirmation that as a not for profit we do qualify for the 25% discount in addition to the 2 free boxes we get for being an open source project. I’ve updated the proposal to take the lower total into account.

Just as a note if we wanted to spend ~3,000 USD a year we could shift all CI load to semaphore and be <10 minute CI times. The only one that really is a question for that would be sitltest but the others all look like they arrive in that time period on Semaphore. I’m not necessarily pushing this, but I thought it was worth noting.

@OXINARF has indicated he things current travis times can come down to the 20 minute mark (without spending anything). (I’m not sure if this is full build or per task)

The funding committee has consulted with several dev team members and has decided to approve this proposal.

@WickedShell is responsible to set up the required procedures and will forward an invoice to the funding committee, to be processed and paid, up to a total of 1000USD/year.

Thanks to all that commented here, and were kind enough to put up with the questions from the funding committee, and to Michael to forward this initiative.