Building Infrastructure

In this post I’m going to focus on new projects that need to deal with infrastructure. I’m sure I will speak to fixing infrastructure for existing projects at some point in the future.

Infrastructure is generally a problem domain that isn’t very complicated. There are things that make it complicated. I generally find this to be a combination of a few things:

  1. Willingness to accrue technical debt
  2. Unwillingness to deeply invest in building robust infrastructure early
  3. Scale

I have a lot of experience dealing with infrastructure at scale. In three years at Uber I watched us grow literally 10x. It was a tremendously hard problem to solve, but mostly because of 1 & 2. Scale is intrinsically difficult, but you’ll notice it is the last item on my list.

First I should define infrastructure. To me infrastructure is everything from the power and cooling in the datacenter, up to the point where software is serving business need. This is a gigantic slice of pie. There are dozens of disciplines involved in this definition. Thus it is unfortunate when people try to defer their infrastructure work until they have a MVP. Such a vision of your product or business is not holistic. This is not to argue that you must speak to every component of your infrastructure up front with a long-term plan, but you should think more holistically.

I define infrastructure as starting with the datacenter. Most of my career has been spent managing bare metal deployments and as such I am biased towards running my own hardware. You should be able to ask the question “is it more economical for me to run my own hardware?” and not be saddled with the issue of whether or not such a migration is even possible. At the very least you should be able to migrate between cloud providers. Money and costs are always an important component of serving business needs, and infrastructure teams have a lot of power to be economical and save the business money in the long run.

To this end you must think how you are building your infrastructure from the ground up. Generally this just means that when you are using your favorite cloud provider (because we all know that’s how you’re going to start out) you should strongly consider how you are using networking and OS components. These fundamental components are often the most difficult to migrate to a better pattern depending on how they were first architected. For instance, if you deeply embrace cloud provider components (like AMIs) that you are going to have very large system to address down the line. It may be impossible to change your network architecture to be more compatible with a new provider without major disruption. This is not to say you shouldn’t use these cloud features, but for instance if your entire deployment process involves building an AMI to deploy to EC2 you’re going to have a hard time migrating somewhere else.

Willingness to accrue technical debt

I understand that bootstrapping a business from scratch is already difficult, and there are time sensitive components that must be addressed at all costs. In the context of costs, you should consider what it means to accrue technical debt.

First, nobody wants to fix technical debt. It means cleaning up after other people, it means having to be the bad guy, it means being faced with impossible decisions. It generally means sacrifice to some degree, and while we value being a team player, nobody wants to work selflessly all the time. This also means that engineers who do not step up to be a team player put more burden on others to address tech debt.

Second, it instills a culture that bottom line results are the only thing that matter. If people are not discouraged from accruing technical debt it is likely engineering values are such that cleaning up is not properly rewarded. There are many great reasons to clean up tech debt whenever possible, but without motivation or reward nobody will actually do it.

Third, technical debt slows down progress long term. I’ve heard so many managers talk about how important it is to automate (spending x hours to automate instead of spending y minutes every z occurrences is worth it!). These same managers tend to underestimate the amount of time that goes into working around tech debt. The same mentality toward automation should be applied to tech debt.

Unwillingness to deeply invest in building robust infrastructure early

This likely will be a major point of contention. I am not advocating for a pie in the sky approach here, but there is plenty of middle ground to meet at relative to some approaches I’ve witnessed. “It works” is generally the bar people rise to when building components before moving on.

I think there is a great fear that too many hours spent building infrastructure instead of attending to business needs is time wasted. Instead of thinking of these efforts as costs, they should be considered investments. The time your engineers spend making reliable CI that has reliable builds and allows for quick onboarding of future projects may not ever pay off, but when it does it pays dividends.

This philosophy should be carefully balanced against the mistake of over-engineering. Your CI system doesn’t need to be useful for every use case, but it definitely shouldn’t be built for only a single use case.

The goal of investing in infrastructure should be that the time spent is useful both to your business and to your engineers. Future engineers should be able to learn from and utilize these investments even indirectly. A well constructed project may not be used to its full potential, but maybe it becomes a good example for building other projects from. Often these investments means your engineers become more experienced and educated which improves the overall quality of your engineering. Trying to quantify these intangible things up front is difficult if not impossible, but it’s fairly obvious when these intangibilities become beneficial. Accepting a holistic approach toward engineering generally will result in these intangible benefits more often than a lean and aggressive approach. This is especially true for infrastructure.

Scale

I think most people think of scale as a super sexy problem to solve. In my experiences scale often was finding new bottlenecks in frustrating ways. I fully maintain that humans are terrible at building for scale, and often the only way we succeed is trial and error.

By investing in modular, high quality infrastructure, you set your engineers up for success long term by increasing the agility at which they can respond to scaling pains.

Everybody wants their business to find explosive success and growth, though this often isn’t going to be the case. In the event your wildest dreams do come true, though you may be caught by surprise, you shouldn’t be without a plan or foundation to work through it.

Conclusion

I understand that there is a lot I haven’t addressed here to the contrary of my approach. You don’t need to look very hard for people advocating nearly the complete opposite of what I’m saying: the product is everything and everything else is secondary. I’ll let those of that opinion speak for themselves. I generally try to take the most pragmatic approach I can to solving problems, which almost always means compromise and finding middle ground. To me this means holistically solving problems, as there are rarely shortcuts worth taking.