The world demands compromises. If you want to run a 100-mile ultramarathon, you’re going to have to make sacrifices to train for that. We simply have limited time and energy.

Ship that product

Software development often requires more compromises than I’d like to make, and I’d like to share a time when the wrong compromise led us down a dark path.

Infrastructure as a crutch

The scene: Our entire office is based on macOS, except for one lonely windows server that provides integration with a vendor. Our entire development department is comfortable with macOS and can debug issues on this operating system.

I had developed some proof-of-concepts that showed the advantage(many advantages) of containerizing services. Our development team was on board.

Containers are linux. Our office is macOS. Can you feel the tension?

Compromise incoming

If our entire staff is proficient with macOS and has limited linux experience, can we rely on the 1 person who knows linux and hypervisors to keep the systems administered? Will that mean my development work actually slows down? Why doesn’t the rest of the team feel comfortable with linux, and how can we make them comfortable? How long will that take? What if I get hit by a bus?

As a side note, before we dive into it, you should know that at this point we were already on Apple’s ARM chips.

Shot to the heart

“Can we stay on macOS as a hypervisor?” Yes, at a cost. There’s a neat piece of software called UTM which is a type 2 hypervisor that runs virtual machines on top of macOS. UTM is what we used to build the proof-of-concept.

macOS > UTM > Ubuntu server > Docker engine > Containerized services

One advantage would be that we could easily run it on premise without acquiring additional hardware. If it worked for the developers, it should work for production right?

That assumption has led to after-hours calls and Christmas day calls with production outages.

The problem

UTM virtual machines were crashing. We initially had all docker containers running on one large virtual machine. It was stable for only a few months before we started experiencing intermittent crashing, and no reason as far as we could diagnose. Nothing in the debug logs showed errors.

We split all services into their own virtual machine, in the hopes that we’d find it was one service running amuck. No luck, that turned into multiple machines with intermittent crashes.

The only pattern we could find was that heavy IO services were crashing frequently. It seemed a few other uses had similar issues with UTM. Some blamed UTM, some blamed Ubuntu.

We were out of time

Services going down was unacceptable.

The solution

We decided to make the move to AWS and use their managed products. Amazon’s Elastic Container Service fit the bill for running our containerized services and it solved multiple concerns including administration, reliability, and performance.

AWS architecture is the topic for another article, maybe a part 2. If you are looking for a quick solution to your problem: Have you tried throwing money at it?