On making our Kubernetes cluster observable

We are a high-growth startup serving millions of users across India and these numbers are ever-increasing. Hence, we decided to re-architect our platform and make it microservices-driven. The advantages we were seeking with the decision are:

  1. Independent scaling of services.
  2. Leverage different technology stacks.
  3. Smaller distributed teams/pods working on individual services.

The primary requirement for this design was to satisfy the need for high reliability, availability, security, and scalability. Kubernetes is the obvious choice for this. Kubernetes is the battle-tested gold standard for container orchestration. It unlocks many obstacles, such as workload distribution and management across machines, achieving fault tolerance, and many more.

Kubernetes - a double-edged sword

Kubernetes is great at the management of the containers. It simplifies the task for the Infra team by adding abstraction and dynamic layers. This abstraction can be a double-edged sword as it adds complexity which translates to more components and modules.

This essential abstraction can induce varieties of erroneous circumstances that are challenging to identify, troubleshoot, and prevent. Also, the ephemeral and transient nature of containers only amplifies the matter.

The inherent nature of distributed systems makes it difficult to understand the internal state of the application. That’s why service owners and the Infra team depend on telemetry data. The three pillars of observability - logs, metrics, and traces when combined, present ample insights to better understand the code's execution during runtime. This makes Kubernetes monitoring and observability even more critical and a top priority for the Infra teams.

Requirements

To improve the observability and monitoring of the Kubernetes cluster, we were looking for solutions that can help us solve these issues:

  • Retrieve near real-time application logs.
  • Log aggregation system which supports live tailing of logs with search capabilities.
  • Retrieve infrastructure metrics for the Infra team and service owners.
  • Send real-time pod crash alerts.
  • Recommend optimal CPU and memory for services.
  • Capture Kubernetes events.
  • Collect and analyze cost trends.

Micro-solutions that helped us achieve observability

The EFK stack

EFK stands for Elasticsearch, Fluentd, and Kibana. Elasticsearch is an open-source, analytics and full-text search engine. Fluentd is a log collecting tool that allows logs collections from anywhere and export to any data store. Kibana is a frontend application that provides search and data visualization capabilities for the data indexed in Elasticsearch.

For this article let's focus on Fluentd as it's the one retrieving logs from Kubernetes pods. The service hosted on Kubernetes pushes log records to standard output (stdout), the Fluentd agent accumulates the logs, processes them, and stores them into the Elasticsearch. It takes care of all aspects of processing log streams from input, parsing, filter, buffer, and output. Fluentd can be easily instrumented on Kubernetes and it runs like a daemonset.

The most useful feature of Fluentd is a large number of plugins that enable it to get integrated with literally any log source and destination. These plugins are useful as we have applications(source) written in multiple languages/frameworks and the destination can also be more than one like Elasticsearch, AWS S3, etc.

Fluent Bit is essentially a lightweight version of Fluentd. It is designed with performance in mind: high throughput with low CPU and Memory usage. Also, it solves the problem of having a Fluentd agent running on all the cluster nodes.

https://pocket-image-cache.com//filters:format(jpg):extract_focal()/https%3A%2F%2Fmiro.medium.com%2Fmax%2F1600%2F1*y0wsDlr657B4zjnKHjPqXQ.png
credit: fluentbit.io

At Unacademy, we use a forward and aggregate pattern for logs where we have lightweight Fluent Bit collecting the logs from the application and send them to Fluentd. This helped us achieve the same performance with low CPU and Memory usage. Fluentd (the aggregator) then processes, manipulates and enriches the raw logs to a usable form before storing them in Elasticsearch. This allows us to scale the aggregators based on the traffic pattern. We retain the logs in Elasticsearch  7 days as we need them to debug any issue. It's super fast and easy to search data on Elasticsearch. Finally, logs are stored in AWS S3.

We will soon have a separate article on our EFK logging architecture and how are we operating it at scale?

credit: https://www.elastic.co

Prometheus

Prometheus is an open-source tool for event monitoring and alerting with a built-in rich query language. Using an HTTP pull model, it records real-time metrics in a time-series database.

Usually, Prometheus scrapes specific metrics from an HTTP endpoint to read text-based metrics from it. Prometheus employs exporters to support third-party data to be brought into its data store. There are several ready-to-use exporters maintained as part of the official Prometheus GitHub organization.

Prometheus is notoriously famous for not being scalable, one must be conscious of the number of metrics processed and always keep an eye on the resource utilization. Prometheus does not have authentication/authorization and dashboards. These drawbacks are not deal breakers and can be easily circumvented by using Grafana.

We saw room for cost-saving hence replaced APM with Prometheus to monitor infrastructure metrics; not just Kubernetes but for the AWS EC2 as well. We capture 400+ metrics from each EC2 with 15 seconds granularity. We retain metrics for only 45 days and by this, we make sure the storage layer is not full. For high availability, we run Prometheus in HA mode.

Grafana

Grafana is an open-source visualization tool. It queries the data source such as Prometheus, InfluxDB, Graphite, SQL, etc., and the results are displayed on Grafana dashboards. According to the Grafana documentation, this tool “allows you to query, visualize, alert on, and understand your metrics no matter where they are stored.”

Grafana also features a built-in alerting system, along with filtering capabilities, authentication, and authorization, annotations, data-source specific querying, etc.

Grafana stands out because it connects with a long list of databases. We have Grafana set up to query multiple data sources such as Prometheus, Elasticsearch, MySQL, CloudWatch, etc.

Apart from this, we have custom metrics support which we monitor using the Prometheus-Grafana stack as it is cheaper to do so compared to AWS Cloudwatch. Talking about alerts on these metrics and logs, we have Grafana integrated with Slack and PagerDuty. We have built extensive monitoring dashboards for service owners to check all kinds of logs and metrics in one place. For this, we used a lot of ready-to-use Kubernetes monitoring dashboards.

Kubernetes events watcher

Kubernetes events are one of the most informational data sources which are often overlooked by everyone. Kubernetes stores these events only for an hour. Without preprocessing, these events prove difficult to consume, search and debug.

Kibana enables to filter, search, aggregate, visualize, and debug these events. It notifies the Infra team of potential scheduling or capacity problems, increases the transparency of pods scaling behavior, and plans appropriate optimizations of workload distribution.

At Unacademy, we integrated Sentry with Kubernetes to alert us about any errors or warnings events using Slack and PagerDuty integrations. For introspection and behavioral analysis of our workloads, we store all events using the EFK stack. It empowers our Infra team to debug workloads, plan optimizations and run visualizations using Kibana.

credit: https://engineering.opsgenie.com

Kubecost

As per their official documentation "Kubecost provides real-time cost visibility and insights for teams using Kubernetes, helping you continuously reduce your cloud costs." It analyzes cost data by aggregating on any Kubernetes concepts such as deployment, service, namespace label, etc. Kubecost enables users to view costs across multiple clusters on a single dashboard.

Kubecost queries the Prometheus server for the cost-related metrics. Infra team can quickly catch cost overruns and infrastructure outage risks before they become a problem with real-time notifications. With real-time notifications, it alerts us for any cost overruns and infrastructure outage risks. For alerts, it supports integration with tools like PagerDuty and Slack.

We utilize Kubecost to attribute cost to the services and this translates to better budgeting. As we operate a multi-tenant Kubernetes cluster where teams share workloads, we appreciate how Kubecost enables the Infra team to track costs incurred by tenants.

Apart from cost optimization recommendations we use Kubecost for right-sizing the cluster. Based on the Kubernetes-native metrics it provides right-sizing recommendations. This enables better resource utilization.

credit: https://medium.com/kubecost

Loki

Loki is an open-source, multi-tenant log aggregation system. It provides a query language called LogQL, which allows users to query logs. It is inspired by Prometheus’ PromQL and can be considered to be a distributed grep that aggregates log sources.

While one can use Kibana and Elasticsearch to make advanced data analysis and visualizations, the Loki-based logging stack focuses on being lightweight and easy to operate.

By storing compressed, unstructured logs and only indexing metadata rather than entire log content, Loki is simpler to operate and cheaper to run. Logs are stored directly in cloud storage such as AWS S3 without the need of having to store files on disk.

This leads to one of the drawbacks that queries might be less performant than having everything indexed and loaded in memory. To tackle this we retain logs for 24 hours. For the logs older than 24 hours and to perform advanced data analysis, we already have Elasticsearch.

The reason to add Loki to our stack when we already had ELK stack was the service owner's convenience. Loki supports near real-time live-tailing of logs filtered using LogQL. To investigate the cause and context of a logline that one is interested in, Loki provides contextual information about the log messages before and after that log message. This is super useful during a service outage. Loki natively supports Grafana hence service owners can find logs and metrics all on the same page.

Unexec

Kubectl exec is an invaluable, in-house tool that allows one to execute commands remotely inside a running container of a pod.

For kubectl to work, EKS uses IAM credentials for authentication. This makes access management cumbersome. Audit logs capture only the initial exec command, generally, bash. This puts security and compliance at a risk.

The containers by design are dynamic in nature, i.e. they can be rescheduled, evicted, or scaled at any time. This puts long-running scripts at risk of being killed at any time. Also, executing commands in a container serving live traffic is not a good idea as it may lead to a degraded customer experience if some files are altered.

At Unacademy, to insist on the highest standards without compromising developer productivity, we designed and developed an in-house tool, Unexec to solve this problem.

Unexec empowers the service owners to execute shell commands directly in a Kubernetes container and audit the commands executed. It achieves its purpose by scheduling an isolated pod on the on-demand EC2 instances and annotations to prevent Cluster Autoscaler from evicting it. Then, developers can log into the container at any time and start long-running scripts and Unexec ensures they continue to run even if the remote connection breaks or the developer logs out. The logs are stored for easy debugging and ensuring that the script ran successfully.

K8S crash monitor

K8S crash monitor is an in-house monitoring tool that monitors pods' failure, alerts, and captures crash logs.

Our goal was to closely monitor pod failures, retrieve the crash logs while ensuring the entire workflow is simple and intuitive. There are several open-source tools that alert if a pod status changes but more often than not they are very noisy.

In Kubernetes when a pod crashes and reschedules on a different node, logs are lost. The process of extracting those logs is often very complex and time-consuming as it requires ssh into the node and finding the logs stored in /var/log.

At Unacademy, to combat this problem, we built our own K8S crash monitor. The service is written in Golang which allows it to be highly scalable, efficient, and fast. It uses Kubernetes API to watch for container failures and automatically sends the last 10MB of crash logs to Elasticsearch for debugging. It is integrated with Slack to notify of failures and send messages with details like the last container state, reason for pod failure along with a direct link to the crash logs stored in Elasticsearch. A detailed Kibana dashboard keeps track of which cluster or service is experiencing a high failure rate and visualizing their trends.

This setup helped increasing transparency in Kubernetes and ensuring pod failures don't go unnoticed. A direct impact was higher availability, reduced MTTR, and improved developer productivity.

Summary

We learned about how complicated it can be to understand the internal state of applications running on Kubernetes. We learned about how we leveraged open-source tools to improve observability. For a few of the requirements, we could not find an ideal open-source tool hence we developed our own auditing, monitoring, and alerting in-house tools.

Because of this setup, service owners' confidence jumped leaps and bounds. They are now more assured of running their service on Kubernetes with all these monitoring tools at their disposal.

This translated into faster and better production incident debugging, a decrease in MTTR, and in general higher availability. There were positive outcomes for the Infra team as well. They now have a better understanding of what is happening inside the cluster. They also have cost and efficiency metrics to keep the cost in check as well as improve resource utilization.

Along with the primary author Rahil Sheth, the essay sought a substantial contribution from Shivam Gupta.

Rahil Sheth

Rahil Sheth

Senior Software Engineer, SRE | Making platform and infrastructure reliable, performant, and efficient.