From the beginning, we knew that a complicated, distributed system like MetaRouter is going to need some serious monitoring. Sure, you can usually see if a message isn't delivered to its destination, but with a microservice architecture processing millions of messages a day (and scaling fast), it's extremely important to have insight into the guts of the system.
We decided to take Google’s Site Reliability Engineering (SRE) structured approach, which applies software engineering principles to infrastructure and operations. Basically, it automates tedious, manual tasks, removing the risk of human error. Of course, that means the system has to comprehensively cover important metrics—and be rock solid when it comes to reliability.
Building a scalable, reliable system
As indicated by Tesla’s reports on self-driving cars, we believe machines can be far more reliable than humans. Since reliability is our ultimate goal as we grow, we jumped into the SRE process headfirst.
1. Identify key metrics
The first step was to identify key metrics. Some of them were obvious:
- How many messages do we process per second?
- How many messages could we feasibly expect to handle over a certain time period?
- What are the % of messages processed successfully over a certain time period vs. those that caused errors?
- How long does it take for these messages to travel through the system?
On top of those key metrics, however, it was important to keep an eye on the message queues that we leverage to fan out our messages. So we included metrics like: “How long have these messages sat in the queue?” and “Is there an abnormal buildup of messages in the queue?”
With metric identification, it’s important to get it right the first time so that nothing is missed. Thinking critically through every metric—including those you’ll need at scale—means you’ll create a system that gets the next part right:
2. Gather, Display and Alert on Key Metrics
To gather, display, and alert on these metrics, we implemented a few different systems in order to build exactly the automation we needed. Currently, we use Google Cloud Platform's robust monitoring solution for insight into the message queue. In addition to that, we leverage a locally-hosted InfluxDB stack for internal processing metrics and OpsGenie as an Alert Pager system. We have access to these in a usable/query-able form with their Chronograf UI, and can alert our team when performance is out of tolerance with their Kapacitor service, which we use as our downsampling engine.
3. Tackle the Challenges
With so many moving parts, both in our system’s architecture and when it comes to our metrics, there have been a multitude of challenges.
For one thing, processing millions of data points can be cumbersome, even with processes in place to downsample our data into a more usable format. When you're working with that amount of data, it becomes a game of “cat and mouse”—trying to optimize the number of resources you provide to the monitoring services without undercutting them and corrupting your data.
Of course, optimizing resources can easily lead to over-allocating and spending more money than is necessary. At the same time, a reliable system generates cash, so it's a balancing act. For now, we have a dedicated Site Reliability Engineer optimizing this system, but there is a push towards moving to a cloud-hosted solution in the near future to reduce toil internally.
Another challenge we've recently faced is determining what level of alerting and insight we should provide for customers. While overall system health is the top priority, we often have instances where a customer's messages may stop being ingested without their knowledge. When there is so much data flowing through the system, a handful of errors can get lost in the noise and that's a big challenge we're trying to solve to give our customers the best user experience possible—whether they are sending us millions of messages a month or simply trying out the product.
By allocating resources to bigger datasets (generated from bigger customers), we can see overall system failures most effectively. Due to the nature of our platform, errors for larger customers are inherently going to be a bigger problem percentage-wise than smaller customers. Our solution was to implement are alerts on a variety of larger customers with multiple sources, as this can give us a little better insight if a certain integration or source code has become problematic.
We've also systematically tried to solve certain weak points in the system evinced by our metrics and have reached a level where a handful of errors from smaller customers are visible enough for us to take action. From a data-loss perspective, working from big problems down to smaller ones has had the most positive impact on everyone.
SRE so far
Since implementing this monitoring system, we've been able to uncover more and more opportunities to improve how we process data, and that really goes to show the power of visibility.
So far, we’ve found that making sure to account for every possible alert or variable can make SRE a tedious process to set up. But ensuring high reliability in a truly sustainable way is absolutely worth the front-end investment.
With the visibility these metrics give us, in fact, we can not only see these errors in real-time, but also find them easily in historical data. This allows us to proactively monitor and improve our system before the issues grow into more complex problems to solve. Seeing and reacting to outages right away has minimized downtime considerably, while having access to historical data allows us to see patterns and gain perspective only hindsight provides to issues our SRE team experiences.
With flexibility as a core philosophy of our engineering team—and innovation as a regular discipline—we'll continue to assess our SRE processes and expect that the increased visibility that has already exposed new opportunities will continue to drive change and growth. SRE so far has been so good ... and we imagine it'll only get better!