Taming your microservices
Organizations that operate microservices/service-oriented architecture face a variety of technical and people related problems. Over the past few years, companies have been migrating massive monolithic applications into smaller microservices (think services that do one thing, like archive a pdf) or even service oriented architectures (e.g. service-per-domain, payments, billing, auth).
There are undoubtedly many issues that come from monolithic architectures: easy to tightly couple domains and data, slower deployment process, and long test runs to name a few. Breaking the monolith can seem like a panacea for these issues. An SOA can provide team level ownership of domains and their systems, data is decoupled, and features/fixes can be deployed separately. Unfortunately, a SOA comes with its own set of issues.
Monoliths have been around for a while. There's a lot that's been written about "taming the monolith". As we get to work building Cortex, we've been talking to engineers about their experiences dealing with microservice architectures. Here's some things we've learned about taming your microservices.
Why do my microservices need to be tamed?
A monolith, for all its warts, is simple to reason about. There are no network calls. There's no overhead to calling a function elsewhere in the codebase. If you're using a typed language, you can catch API changes at compile time. All the code is in one place – if you're so inclined, you can easily peek under the hood.
An SOA brings with it a suite of technical challenges, including operational complexity and performance hits. You now have to think about:
- Network latency
- API versioning - what if the payments team makes a breaking change to its API without notifying me?
- Network issues - what happens if I don't receive confirmation that the payment went through?
- Change management/releases - when an issue occurs
Yet, these technical challenges are the least of your worries.
PEBCAK - Problem exists between chair and keyboard
While it may sound strange to think that cultural and people problems are a significant challenge stemming from architectural choices, keep in mind that most organizations move to an SOA to scale their team, not their software.
Distributing ownership of domains, data, and release cadence across services and teams can increase the velocity of your engineering org.
Unfortunately, distributing ownership does just that - teams start diverging in their operational procedures, documentation practices, and more. As an organization scales, this can wreak havoc on productivity, and ultimately bring your velocity back to square one.
People, process, and cultural issues around service oriented architectures have far reaching consequences:
- Ramping up engineers becomes costly - each team maintains their own processes. An engineer moving between teams may have to learn as much as a brand new hire.
- Operational triage takes longer - tribal knowledge dissipates as an organization grows, making it harder to answer questions such as: what services exist, what they do, and who owns them
- Communication overhead between teams - oversharing information is noisy, undersharing (e.g. when making API changes) can, in the worst case, cause outages or functionality issues and directly impact the bottom line
Looking to scale your team? It's time to start thinking these challenges and how you can counter them.
Service oriented teams
There's no silver bullet to solve these problems. No combination of tools or processes can protect you. However, prevention really is better than the cure.
Think of it like a flu shot - you know its flu season, getting the flu sucks, and even though you may still get the flu, you'll never regret setting yourself up for success.
Spec first design
An excellent habit to promote is thinking about APIs and data models before jumping into the implementation.
It's important to ensure your SOA doesn't turn into a distributed monolith.
Signs of a distributed monolith:
- No separation of concerns
- Shared databases and data models
- Releasing one service requires coordination of deploys across multiple other services
- Consuming a service's APIs requires knowledge about its data models, side effects, etc
One way to prevent this is following the Amazon model - APIs are the only way teams communicate. This means that a service owner should:
- Think about their API request/responses before implementation
- Share the API for review so that use cases for dependents are well covered
- Communicate API changes ahead of time
- Version their APIs to maintain backwards compatibility
This workflow provides additional benefits when used in conjunction with tooling, such as OpenAPI or gRPC. If schemas and APIs are designed ahead of time, you can use:
- Code gen for clients across different languages
- Catch breaking changes to your API at build time
- Use your code review process to loop in the right stakeholders before a change is released
- Contract testing to ensure you don't go out of compliance with your predetermined schema
Atlassian is an example of successfully using spec-first API design to improve their development lifecycle. I found https://www.atlassian.com/blog/technology/spec-first-api-development to be an excellent overview of using OpenAPI as the tooling to enable spec-first development.
The breaking point for a monolith is when multiple teams start depending on the same data models across domains. This slows down the development cadence across the org and the resulting spaghetti data models make it exponentially more difficult to reason about the code base.
It's easy to go down the same route in an SOA as well. There are some habits that help prevent this nightmare:
- Separate data stores for each service
- Thinking carefully about data models and links before implementing APIs
- Reducing dependencies at the data level between services - for example, store reference IDs to resources in a different service, and fetch data through their API
- Ensuring stakeholders (business, developers, service consumers) don't build tooling directly on your data. This locks in your data model and prevents you from iterating as a service owner. Instead, expose data (with its own contract) through APIs or analytical data streams.
Standardization of platform/process
There's one obvious solution to "each team doing things their own way" - make it easy for teams to do things in a standardized way.
Atlassian, for example, has built out an internal PaaS that is essentially a thin wrapper around AWS. By doing so, developers are provisioned their desired resources but the platform automatically augments it with standard monitoring, logging, and more.
Well designed tooling that provides a standard way of doing things speeds up developer velocity and makes sure teams are working in consistent ways.
This even extends to documentation. Spotify and Atlassian have built internal tooling that standardizes information discovery around their microservices. System-Z at Spotify and Microscope at Atlassian provide a source of truth for details about each service, such as system ownership, links to documentation/runbooks, and recent deploys. What makes them even more powerful is that they are the entrypoint into further information about services – on-call rotations, links to monitoring dashboards, logging, etc.
This kind of standardization of information improves engineering ramp up time and operational efficiency - all services are operated in the same way and developers have a single source to look for critical information about their services.
Cortex - the entrypoint into your service architecture
Many organizations end up building an internal service registry for humans after the problem has gotten out of hand – too many services, growing teams and turnover, remote developers, etc.
Cortex provides a standard, opinionated way to organize information about all of your services and enables functionality such as:
- A dashboard to answer questions such as "what services exist?", "who owns this service?", "where are the runbooks for this service?"
- An audit report that shows you the health of your services: does every service have an owner, an oncall rotation, etc.
- Integrations with third party tools - for example, when Pagerduty triggers an alert, Cortex can send a slack message with information about the service such as ownership, latest commits from github, latest deploys, and more.
Reach out to us on our home page to request a demo, we'd love to show you what we've built and help prevent some of the headaches that come with a service oriented architecture.