Skip to main content
Version: 1.29.6

Kubernetes Fury Monitoring

Overview

This module is designed to give you full control and visibility over your cluster operations. Metrics from the cluster and the applications are collected and clean analytics are offered via a visualization platform, Grafana.

The centerpiece of this module is the [prometheus-operator], which offers the easy deployment of the following as controllers:

  • Prometheus: An open-source monitoring and alerting toolkit for cloud-native applications
  • Alertmanager: Manages alerts sent by the Prometheus server and route them through receiver integrations such as email, Slack, or PagerDuty
  • ServiceMonitor: Declaratively specifies how groups of services should be monitored, by automatically generating Prometheus scrape configuration based on the definition

Since the export of certain metrics can be heavily cloud-provider specific, we provide a bunch of cloud-provider specific configurations. The setups we currently support include:

  • Google Kubernetes Engine (GKE)
  • Azure Kubernetes Service (AKS)
  • Elastic Kubernetes Service (EKS)
  • on-premises or self-managed cloud clusters

Module's repository: https://github.com/sighupio/fury-kubernetes-monitoring

Packages

Kubernetes Fury Monitoring provides the following packages:

PackageDescription
prometheusPrometheus instance deployed with Prometheus Operator's CRD
alertmanagerAlertmanager instance deployed with Prometheus Operator's CRD
grafanaGrafana deployment to query and visualize metrics collected by Prometheus
blackbox-exporterPrometheus exporter that allows blackbox probing of endpoints over HTTP, HTTPS, DNS, TCP, ICMP and gRPC.
kube-proxy-metricsRBAC proxy to securely expose kube-proxy metrics
kube-state-metricsService that generates metrics from Kubernetes API objects
node-exporterPrometheus exporter for hardware and OS metrics exposed by *NIX kernels
prometheus-adapterKubernetes resource metrics, custom metrics, and external metrics APIs implementation.
x509-exporterProvides monitoring for certificates
mimirMimir is an open source, horizontally scalable, highly available, multi-tenant TSDB for long-term storage for Prometheus.
info

Most of the components in this module are deployed in the monitoring namespace, unless the functionality requires permissions that force it to be deployed in the kube-system namespace.

Introduction: Monitoring in Kubernetes

Monitoring is a crucial aspect of managing modern applications and infrastructure. It involves collecting, analyzing, and visualizing metrics and logs to ensure systems operate efficiently, detect potential issues, and maintain performance. In a distributed environment, monitoring helps track resource utilization, identify bottlenecks, and gain insights into system behavior over time, enabling proactive issue resolution and informed decision-making.

In Kubernetes, monitoring is especially important due to the complexity and dynamic nature of containerized workloads. Applications are composed of multiple microservices, often scaled across nodes, which can make performance monitoring and issue detection challenging.

Metrics

Metrics are quantitative data points that provide insights into the performance, health, and behavior of the system, its workloads, and the underlying infrastructure. Metrics are typically time-series data, meaning they are measured and recorded over time, allowing for trend analysis, anomaly detection, and capacity planning.

Key aspects of monitoring in Kubernetes include:

  • Cluster Metrics: Monitoring CPU, memory, and network usage across nodes and pods.
  • Application Metrics: Tracking application-specific metrics like request rates, latencies, and error counts.
  • Health and Availability: Ensuring workloads are running as expected, with proper liveness and readiness probes.
  • Event Tracking: Logging Kubernetes events to understand state changes and troubleshoot failures.

Key characteristics of metrics are:

  • Granularity: Metrics can vary in granularity, from high-level summaries (e.g., average CPU usage per node) to fine-grained details (e.g., per-container memory usage).
  • Real-Time Collection: Metrics are typically collected and made available in near real-time, allowing for responsive monitoring.
  • Retention: Metrics are stored for varying periods, depending on the need to analyze trends or historical data.

In Kubernetes, metrics are often exposed by components and applications in a standardized format, such as Prometheus metrics (plain-text key-value pairs), making them easily ingestible by monitoring tools.

These metrics form the foundation of monitoring, alerting, and visualization systems, enabling teams to maintain optimal cluster performance, troubleshoot issues, and plan for future scaling.

KFD: Monitoring module

In KFD clusters, the following components are always installed:

  • Prometheus Operator
  • kube-proxy-metrics
  • kube-state-metrics
  • node-exporter
  • x509-exporter
  • blackbox-exporter

On top of that, you can choose three different configurations using the type parameter:

  • prometheus: will install a preconfigured Prometheus instance, Alertmanager, a set of alert rules, Grafana and a series of dashboards to view the collected metrics, and more.
  • mimir: install all the components of the prometheus alternative, plus Grafana Mimir to provide long-term storage of metrics using either a dedicated instance of MinIO or another S3-compatible bucket.
  • prometheusAgent: will install an instance of Prometheus in Agent mode (no alerting, no queries, no storage). Useful when having a centralized (remote) Prometheus where to ship the metrics and not storing them locally in the cluster.

You can find all the available parameters to configure this module in the provider's reference schemas.

note

You will need to define a StorageClass inside your cluster to be able to install these components. If you don't have one, furyctl will let you know that it skipped the installation of some components.

Prometheus

Prometheus is an open-source monitoring and alerting system widely used in Kubernetes environments. It collects metrics from targets using a pull-based model, stores them in a time-series database, and enables querying via PromQL. Key features of Prometheus include:

  • Native Kubernetes Support: Automatically discovers services and pods using Kubernetes APIs.
  • Flexible Metric Collection: Supports custom metrics alongside system metrics.
  • Alerting Rules: Enables setting thresholds to detect anomalies.

In KFD, Prometheus will be installed using the Prometheus Operator, which consists of CRDs that make really easy deploying a Prometheus instance in Kubernetes.

The operator takes care of Prometheus deployment and monitors Services as illustrated in this image from Prometheus Operator repository:

operator architecture

Exporters

Prometheus is a collector, so other software is responsible of generating relevant metrics. KFD will install the following exporters out of the box:

  • kube-proxy-metrics: kube-proxy is a critical part of every Kubernetes cluster, so it's crucial to monitor it appropriately. This package adds an RBAC proxy to securely expose kube-proxy metrics towards Prometheus.
  • kube-state-metrics: listens to the Kubernetes API server and generates metrics about the state of Kubernetes objects like Deployments, Nodes, or Pods.
  • node-exporter: provides monitoring for hardware and OS metrics exposed by *NIX kernels by installing the node-exporter service.
  • x509-exporter: provides monitoring metrics for certificates.
  • blackbox-exporter: allows blackbox probing of endpoints over HTTP, HTTPS, DNS, TCP, ICMP and gRPC.
  • prometheus-adapter: implementation of Kubernetes' Metrics API to enable Prometheus metrics to be used in HorizontalPodAutoscalers.

Alertmanager

Alertmanager handles alerts sent by Prometheus server and routes them to configured receiver integrations such as email, Slack, PageDuty, or OpsGenie. It helps you to manage alerts flexibly with its grouping, inhibition and silencing features.

To generate alerts, you must provide Prometheus with Prometheus Rules, which instruct the system on what to monitor and what conditions are necessary to trigger an alert.

In KFD, there is a number of preconfigured alerts with Prometheus Rules provided by all modules. See the Alerts section to see the Prometheus Rules provided by this module.

Mimir

Grafana Mimir is an open source, horizontally scalable, highly available, multi-tenant TSDB for long-term storage for Prometheus. It integrates out of the box with Grafana to provide a consistent tool that visualizes metrics collected over a specified amount of time.

It will store the collected metrics to an S3-compatible bucket, which is provided by default by a MinIO instance and can be also configured to use another Object Storage of your choice.

Grafana

Grafana is a powerful visualization and analytics platform that integrates seamlessly with Prometheus. It provides:

  • Customizable Dashboards: Create interactive dashboards to visualize Kubernetes and application metrics.
  • Multi-Source Data Support: Combine Prometheus metrics with logs, traces, or external databases.
  • Alert Visualization: Displays alerts alongside metrics for deeper insights.

Where relevant, KFD modules will add their own Grafana Dashboards to provide a great default experience.

Alerts

The following alerts are already defined for this module:

kubernetes-apps

ParameterDescriptionSeverityInterval
KubePodCrashLoopingThis alert fires if the per-second rate of the total number of restart of a given pod in a 15 minutes time window was above 0 in the last hour, i.e. the pod is stuck in a crash loop.warning1h
KubePodNotReadyThis alert fires if at least one pod was stuck in the Pending or Unknown phase in the last hour.warning1h
KubeDeploymentGenerationMismatchThis alert fires if in the last hour a deployment's observed generation (the revision number recorded in the object status) was different from the metadata generation (the revision number in the deployment metadata).warning15m
KubeDeploymentReplicasMismatchThis alert fires if a deployment's replicas number specification was different from the available replicas in the last hour.warning1h
KubeStatefulSetReplicasMismatchThis alert fires if a deployment's replicas number specification was different from the available replicas in the last hour.warning15m
KubeStatefulSetGenerationMismatchThis alert fires if a StatefulSet's replicas number specification was different from the available replicas in the 15 minutes.warning15m
KubeDaemonSetRolloutStuckThis alert fires if the percentage of DaemonSet in the ready phase was less than 100% in the last 15 minutes.warning15m
KubeDaemonSetNotScheduledThis alert fires if the desired number of scheduled DaemonSet was higher than the number of currently scheduled DaemonSet in the last 10 minutes.warning10m
KubeDaemonSetMisScheduledThis alert fires if at least one DaemonSet was running where it was not supposed to run in the last 10 minutes.warning10m
KubeCronJobRunningThis alert fires if at least one CronJob took more than one hour to complete.warning1h
KubeJobCompletionThis alert fires if at least on Job took more than one hour to complete.warning1h
KubeJobFailedThis alert fires if at least one Job failed in the last hour.warning1h
KubeLatestImageTagThis alert fires if there are images deployed in the cluster tagged with :latest and this is really dangerouswarning1h

kube-prometheus-node-alerting.rules

ParameterDescriptionSeverityInterval
NodeCPUSaturatingThis alert fires if, for a given instance, CPU utilisation and saturation were higher than 90% in the last 30 minutes.warning30m
NodeCPUStuckInIOWaitThis alert fires if CPU time in IOWait mode calculated on a 5 minutes window for a given instance was more than 50% in the last 15 minutes.warning15m
NodeMemoryRunningFullThis alert fires if memory utilisation on a given node was higher than 85% in the last 30 minutes.warning30m
NodeFilesystemUsageCriticalThis alert fires if in the last minute the filesystem usage was more than 90%.critical1m
NodeFilesystemFullInFourDaysThis alert fires if in the last 5 minutes the filesystem usage was more than 85% and, based on a linear prediction on the volume usage in the last 6 hours, the volume will be full in four days.warning5m
NodeFilesystemInodeUsageCriticalThis alert fires if the available inodes in a given filesystem were less than 10% in the last minute.critical1m
NodeFilesystemInodeFullInFourDaysThis alert fires if, based on a linear prediction on the inodes usage in the last 6 hours, the filesystem will exhaust its inodes in four days.warning5m
NodeNetworkDroppingPacketsThis alerts fires if a given physical network interface was dropping more than 10 pkt/s in the last 30 minutes.warning30m

prometheus

ParameterDescriptionSeverityInterval
PrometheusConfigReloadFailedThis alert fires if Prometheus's configuration failed to reload in the last 10 minutes.critical10m
PrometheusNotificationQueueRunningFullThis alert fires if Prometheus's alert notification queue is running full in the next 30 minutes, based on a linear prediction on the usage in the last 5 minutes.critical10m
PrometheusErrorSendingAlertsThis alert fires if the error rate calculated in a 5 minutes time windows was more than 1% in the last 10 minutes.critical10m
PrometheusErrorSendingAlertsThis alert fires if the error rate calculated in a 5 minutes time windows was more than 3% in the last 10 minutes.critical10m
PrometheusNotConnectedToAlertmanagersThis alert fires if Prometheus was not connected to at least one Alertmanager in the last 10 minutes.critical10m
PrometheusTSDBReloadsFailingThis alert fires if Prometheus had any failure to reload data blocks from disk in the last 12 hours.critical12h
PrometheusTSDBCompactionsFailingThis alert fires if Prometheus had any failure to compact sample blocks in the last 12 hours.critical12h
PrometheusTSDBWALCorruptionsThis alert fires if Prometheus had detected any corruption in the write-ahead log in the last 4 hours.critical4h
PrometheusNotIngestingSamplesThis alert fires if Prometheus sample ingestion rate calculated on a 5 minutes time window was below or equal to 0 in the last 10 minutes, i.e. Prometheus is failing to ingest samples.critical10m
PrometheusTargetScrapesDuplicateThis alert fires if Prometheus was discarding many samples due to duplicated timestamps but different values in the last 10 minutes.warning10m

general

ParameterDescriptionSeverityInterval
TargetDownThis alert fires if more than 10% of the targets were down in the last 10 minutes.critical10m
FdExhaustionThis alert fires if, based on a linear prediction on file descriptors usage in the last hour minutes, the instance will exhaust its file descriptors in 4 hours.warning10m
FdExhaustionThis alert fires if, based on a linear prediction on file descriptors usage in the last 10 minutes, the instance will exhaust its file descriptors in one hour.critical10m
DeadMansSwitchThis is a DeadMansSwitch meant to ensure that the entire Alerting Pipeline is functional.none

kubernetes-system

ParameterDescriptionSeverityInterval
KubeNodeNotReadyThis alert fires if a given node was not in Ready status in the last hour.critical1h
KubeVersionMismatchThis alert fires if the versions of the Kubernetes components were mismatching in the last hour.warning1h
KubeClientErrorsThis alert fires if the Kubernetes API client error responses rate calculated in a 5 minutes window was more than 1% in the last 15 minutes.warning15m
KubeClientErrorsThis alert fires if the Kubernetes API client error responses rate calculated in a 5 minutes window was more than 0.1 errors / sec in the last 15 minutes.warning15m
KubeletTooManyPodsThis alert fires if a given kubelet is running more than 100 pods and is approaching the hard limit of 110 pods per node.warning15m
KubeAPILatencyHighThis alert fires if the API server 99th percentile latency was more than 1 second in the last 10 minutes.warning10m
KubeAPILatencyHighThis alert fires if the API server 99th percentile latency was more than 4 second in the last 10 minutes.critical10m
KubeAPIErrorsHighThis alert fires if the requests error rate calculated in a 5 minutes window was more than 5% in the last 10 minutes.critical10m

kubernetes-storage

ParameterDescriptionSeverityInterval
KubePersistentVolumeStuckThis alert fires if a given persisten volume was stuck in the Pending or Failed phase in the last hour.warning1h
KubePersistentVolumeUsageCriticalThis alert fires if the available space in a given PersistentVolumeClaim was less than 10% in the last minute.critical1m
KubePersistentVolumeFullInFourDaysThis alert fires if, based on a linear prediction on the volume usage in the last 6 hours, the volume will be full in four days.warning5m
KubePersistentVolumeInodeUsageCriticalThis alert fires if the available inodes in a given PersistentVolumeClaim were less than 10% in the last minute.critical1m
KubePersistentVolumeInodeFullInFourDaysThis alert fires if, based on a linear prediction on the inodes usage in the last 6 hours, the volume will exhaust its inodes in four days.warning5m

kubernetes-absent

ParameterDescriptionSeverityInterval
AlertmanagerDownThis alert fires if Prometheus target discovery was not able to reach AlertManager in the last 15 minutes.critical15m
KubeAPIDownThis alert fires if Prometheus target discovery was not able to reach kube-apiserver in the last 15 minutes.critical15m
KubeStateMetricsDownThis alert fires if Prometheus target discovery was not able to reach kube-state-metrics in the last 15 minutes.critical15m
KubeletDownThis alert fires if Prometheus target discovery was not able to reach the kubelet in the last 15 minutes.critical15m
NodeExporterDownThis alert fires if Prometheus target discovery was not able to reach node-exporter in the last 15 minutes.critical15m
PrometheusDownThis alert fires if Prometheus target discovery was not able to reach Prometheus in the last 15 minutes.critical15m
PrometheusOperatorDownThis alert fires if Prometheus target discovery was not able to reach the Prometheus Operator in the last 15 minutes.critical15m

alertmanager

ParameterDescriptionSeverityInterval
AlertmanagerConfigInconsistentThis alert fires if the configuration of the instances of the Alertmanager cluster were out of sync in the last 5 minutes.critical5m
AlertmanagerDownOrMissingThis alert fires if in the last 5 minutes an unexpected number of Alertmanagers were scraped or Alertmanagers disappered from target discovery.critical5m
AlertmanagerFailedReloadThis alert fires if the Alertmanager's configuration reload failed in the last 10 minutes.critical10m

kubernetes-resources

ParameterDescriptionSeverityInterval
KubeCPUOvercommitThis alert fires if the cluster-wide CPU requests from pods in the last 5 minutes were so high to not tolerate a node failure.warning5m
KubeMemOvercommitThis alert fires if the cluster-wide memory requests from pods in the last 5 minutes were so high to not tolerate a node failure.warning5m
KubeCPUOvercommitThis alert fires if the hard limit of CPU resources quota in the last 5 minutes is more than 150% of the available resources, i.e. the hard limit is set too high.warning5m
KubeMemOvercommitThis alert fires if the hard limit of memory resources quota in the last 5 minutes is more than 150% of the available resources, i.e. the hard limit is set too high.warning5m
KubeQuotaExceededThis alert fires if a given resource was used for more than 90% of the corresponding hard quota in the last 15 minutes.warning15m

grafana

ParameterDescriptionSeverityInterval
GrafanaRequestsFailingThis alert fires when the rate of errors for Grafana requests is more than 50% over the last 5 minuteswarning5m

kube-state-metrics

ParameterDescriptionSeverityInterval
KubeStateMetricsListErrorsThis alert fires if the rate of kube-state-metrics list operations errors is > 1% over the last 15 minutes.critical15m
KubeStateMetricsWatchErrorsThis alert fires if the rrate of kube-state-metrics watch operations errrors is > 1% over the last 15 minutes.critical15m
KubeStateMetricsShardingMismatchThis alert fires if kube-state-metrics sharding is misconfigured.critical15m
KubeStateMetricsShardsMissingThis alert fires if kube-state-metrics shards are missing.critical15m

kubernetes-absent-kubeadm

(Only for OnPremises and KFDDistribution clusters)

ParameterDescriptionSeverityInterval
KubeControllerManagerDownThis alert fires if Prometheus target discovery was not able to reach the kube-controller-manager in the last 15 minutes.critical15m
KubeSchedulerDownThis alert fires if Prometheus target discovery was not able to reach the kube-scheduler in the last 15 minutes.critical15m
KubeClientCertificateExpirationThis alert fires when the Kubernetes API client certificate is expiring in less than 30 days.warning
KubeClientCertificateExpirationThis alert fires when the Kubernetes API client certificate is expiring in less than 7 days.critical

coredns

(Only for OnPremises and KFDDistribution clusters)

ParameterDescriptionSeverityInterval
CoreDNSPanicThis alert fires if CoreDNS total panic count increased by at least 1 in the last 10 minutes.warning
CoreDNSRequestsLatencyThis alert fires if CoreDNS 99th percentile requests latency was higher than 100ms in the last 10 minutes.warning10m
CoreDNSHealthRequestsLatencyThis alert fires if CoreDNS 99th percentile health requests latency was higher than 10ms in the last 10 minutes.warning10m
CoreDNSProxyRequestsLatencyThis alert fires if CoreDNS 99th percentile proxy requests latency was higher than 500ms in the last 10 minutes.warning10m

etcd3

(Only for OnPremises and KFDDistribution clusters)

ParameterDescriptionSeverityInterval
EtcdInsufficientMembersThis alert fires if less than half of Etcd cluster members were online in the last 3 minutes.critical3m
EtcdNoLeaderThis alert fires if the Etcd cluster had no leader in the last minute.critical1m
EtcdHighNumberOfLeaderChangesThis alert fires if the Etcd cluster changed leader more than 3 times in the last hour.warning
EtcdHighNumberOfFailedProposalsThis alert fires if there were more than 5 proposal failure in the last hour.warning
EtcdHighFsyncDurationsThis alert fires if the WAL fsync 99th percentile latency was higher than 0.5s in the last 10 minutes.warning10m
EtcdHighCommitDurationsThis alert fires if the backend commit 99th percentile latency was higher than 0.25s in the last 10 minutes.warning10m

haproxy

(Only for OnPremises clusters with HAProxy instances managed by furyctl)

ParameterDescriptionSeverityInterval
HaproxyHighHttp4xxErrorRateBackendThis alert fires if the rate of HTTP 4xx requests is > 5% on backend in the last minute.critical1m
HaproxyHighHttp5xxErrorRateBackendThis alert fires if the rate of HTTP 5xx requests is > 5% on backend in the last minute.critical1m
HaproxyHighHttp4xxErrorRateServerThis alert fires if the rate of HTTP 4xx requests is > 5% on server in the last minute.critical1m
HaproxyHighHttp5xxErrorRateServerThis alert fires if the rate of HTTP 5xx requests is > 5% on server in the last minute.critical1m
HaproxyServerResponseErrorsThis alert fires if the rate of error requests is > 5% on server in the last minute.critical1m
HaproxyBackendConnectionErrorsThis alert fires if a certain backend had too many (> 100) connection errors in the last minute.critical1m
HaproxyServerConnectionErrorsThis alert fires if a certain server had too many (> 100) connection errors in the last minute.critical1m
HaproxyBackendMaxActiveSession>80%This alert fires if the number of active session is 80% or more of the max allowed number.warning
HaproxyPendingRequestsThis alert fires if there are pending backend requests for more than 2 minutes.warning2m
HaproxyRetryHighThis alert fires if the number of retries over the last minutes is 10 or more.warning1m
HaproxyHasNoAliveBackendsThis alert fires when HAProxy has no alive or backup backends.critical
HaproxyFrontendSecurityBlockedRequestsThis alert fires if HAProxy has blocked requests for security reasons for 10 or more times over the last 2 minutes.warning2m
HaproxyServerHealthcheckFailureThis alert fires when there are failed heathchecks on HAProxy servers.warning1m

Read More

You can find out more info about monitoring in Kubernetes on the following links: