Version: 1.29.6

Kubernetes Fury Monitoring

Overview

This module is designed to give you full control and visibility over your cluster operations. Metrics from the cluster and the applications are collected and clean analytics are offered via a visualization platform, Grafana.

The centerpiece of this module is the [prometheus-operator], which offers the easy deployment of the following as controllers:

Prometheus: An open-source monitoring and alerting toolkit for cloud-native applications
Alertmanager: Manages alerts sent by the Prometheus server and route them through receiver integrations such as email, Slack, or PagerDuty
ServiceMonitor: Declaratively specifies how groups of services should be monitored, by automatically generating Prometheus scrape configuration based on the definition

Since the export of certain metrics can be heavily cloud-provider specific, we provide a bunch of cloud-provider specific configurations. The setups we currently support include:

Google Kubernetes Engine (GKE)
Azure Kubernetes Service (AKS)
Elastic Kubernetes Service (EKS)
on-premises or self-managed cloud clusters

Module's repository: https://github.com/sighupio/fury-kubernetes-monitoring

Packages

Kubernetes Fury Monitoring provides the following packages:

Package	Description
prometheus	Prometheus instance deployed with Prometheus Operator's CRD
alertmanager	Alertmanager instance deployed with Prometheus Operator's CRD
grafana	Grafana deployment to query and visualize metrics collected by Prometheus
blackbox-exporter	Prometheus exporter that allows blackbox probing of endpoints over HTTP, HTTPS, DNS, TCP, ICMP and gRPC.
kube-proxy-metrics	RBAC proxy to securely expose kube-proxy metrics
kube-state-metrics	Service that generates metrics from Kubernetes API objects
node-exporter	Prometheus exporter for hardware and OS metrics exposed by *NIX kernels
prometheus-adapter	Kubernetes resource metrics, custom metrics, and external metrics APIs implementation.
x509-exporter	Provides monitoring for certificates
mimir	Mimir is an open source, horizontally scalable, highly available, multi-tenant TSDB for long-term storage for Prometheus.

info

Most of the components in this module are deployed in the monitoring namespace, unless the functionality requires permissions that force it to be deployed in the kube-system namespace.

Introduction: Monitoring in Kubernetes

Monitoring is a crucial aspect of managing modern applications and infrastructure. It involves collecting, analyzing, and visualizing metrics and logs to ensure systems operate efficiently, detect potential issues, and maintain performance. In a distributed environment, monitoring helps track resource utilization, identify bottlenecks, and gain insights into system behavior over time, enabling proactive issue resolution and informed decision-making.

In Kubernetes, monitoring is especially important due to the complexity and dynamic nature of containerized workloads. Applications are composed of multiple microservices, often scaled across nodes, which can make performance monitoring and issue detection challenging.

Metrics

Metrics are quantitative data points that provide insights into the performance, health, and behavior of the system, its workloads, and the underlying infrastructure. Metrics are typically time-series data, meaning they are measured and recorded over time, allowing for trend analysis, anomaly detection, and capacity planning.

Key aspects of monitoring in Kubernetes include:

Cluster Metrics: Monitoring CPU, memory, and network usage across nodes and pods.
Application Metrics: Tracking application-specific metrics like request rates, latencies, and error counts.
Health and Availability: Ensuring workloads are running as expected, with proper liveness and readiness probes.
Event Tracking: Logging Kubernetes events to understand state changes and troubleshoot failures.

Key characteristics of metrics are:

Granularity: Metrics can vary in granularity, from high-level summaries (e.g., average CPU usage per node) to fine-grained details (e.g., per-container memory usage).
Real-Time Collection: Metrics are typically collected and made available in near real-time, allowing for responsive monitoring.
Retention: Metrics are stored for varying periods, depending on the need to analyze trends or historical data.

In Kubernetes, metrics are often exposed by components and applications in a standardized format, such as Prometheus metrics (plain-text key-value pairs), making them easily ingestible by monitoring tools.

These metrics form the foundation of monitoring, alerting, and visualization systems, enabling teams to maintain optimal cluster performance, troubleshoot issues, and plan for future scaling.

KFD: Monitoring module

In KFD clusters, the following components are always installed:

Prometheus Operator
kube-proxy-metrics
kube-state-metrics
node-exporter
x509-exporter
blackbox-exporter

On top of that, you can choose three different configurations using the type parameter:

prometheus: will install a preconfigured Prometheus instance, Alertmanager, a set of alert rules, Grafana and a series of dashboards to view the collected metrics, and more.
mimir: install all the components of the prometheus alternative, plus Grafana Mimir to provide long-term storage of metrics using either a dedicated instance of MinIO or another S3-compatible bucket.
prometheusAgent: will install an instance of Prometheus in Agent mode (no alerting, no queries, no storage). Useful when having a centralized (remote) Prometheus where to ship the metrics and not storing them locally in the cluster.

You can find all the available parameters to configure this module in the provider's reference schemas.

note

You will need to define a StorageClass inside your cluster to be able to install these components. If you don't have one, furyctl will let you know that it skipped the installation of some components.

Prometheus

Prometheus is an open-source monitoring and alerting system widely used in Kubernetes environments. It collects metrics from targets using a pull-based model, stores them in a time-series database, and enables querying via PromQL. Key features of Prometheus include:

Native Kubernetes Support: Automatically discovers services and pods using Kubernetes APIs.
Flexible Metric Collection: Supports custom metrics alongside system metrics.
Alerting Rules: Enables setting thresholds to detect anomalies.

In KFD, Prometheus will be installed using the Prometheus Operator, which consists of CRDs that make really easy deploying a Prometheus instance in Kubernetes.

The operator takes care of Prometheus deployment and monitors Services as illustrated in this image from Prometheus Operator repository:

operator architecture

Exporters

Prometheus is a collector, so other software is responsible of generating relevant metrics. KFD will install the following exporters out of the box:

kube-proxy-metrics: kube-proxy is a critical part of every Kubernetes cluster, so it's crucial to monitor it appropriately. This package adds an RBAC proxy to securely expose kube-proxy metrics towards Prometheus.
kube-state-metrics: listens to the Kubernetes API server and generates metrics about the state of Kubernetes objects like Deployments, Nodes, or Pods.
node-exporter: provides monitoring for hardware and OS metrics exposed by *NIX kernels by installing the node-exporter service.
x509-exporter: provides monitoring metrics for certificates.
blackbox-exporter: allows blackbox probing of endpoints over HTTP, HTTPS, DNS, TCP, ICMP and gRPC.
prometheus-adapter: implementation of Kubernetes' Metrics API to enable Prometheus metrics to be used in HorizontalPodAutoscalers.

Alertmanager

Alertmanager handles alerts sent by Prometheus server and routes them to configured receiver integrations such as email, Slack, PageDuty, or OpsGenie. It helps you to manage alerts flexibly with its grouping, inhibition and silencing features.

To generate alerts, you must provide Prometheus with Prometheus Rules, which instruct the system on what to monitor and what conditions are necessary to trigger an alert.

In KFD, there is a number of preconfigured alerts with Prometheus Rules provided by all modules. See the Alerts section to see the Prometheus Rules provided by this module.

Mimir

Grafana Mimir is an open source, horizontally scalable, highly available, multi-tenant TSDB for long-term storage for Prometheus. It integrates out of the box with Grafana to provide a consistent tool that visualizes metrics collected over a specified amount of time.

It will store the collected metrics to an S3-compatible bucket, which is provided by default by a MinIO instance and can be also configured to use another Object Storage of your choice.

Grafana

Grafana is a powerful visualization and analytics platform that integrates seamlessly with Prometheus. It provides:

Customizable Dashboards: Create interactive dashboards to visualize Kubernetes and application metrics.
Multi-Source Data Support: Combine Prometheus metrics with logs, traces, or external databases.
Alert Visualization: Displays alerts alongside metrics for deeper insights.

Where relevant, KFD modules will add their own Grafana Dashboards to provide a great default experience.

Alerts

The following alerts are already defined for this module:

kubernetes-apps

Parameter	Description	Severity	Interval
KubePodCrashLooping	This alert fires if the per-second rate of the total number of restart of a given pod in a 15 minutes time window was above 0 in the last hour, i.e. the pod is stuck in a crash loop.	warning	1h
KubePodNotReady	This alert fires if at least one pod was stuck in the Pending or Unknown phase in the last hour.	warning	1h
KubeDeploymentGenerationMismatch	This alert fires if in the last hour a deployment's observed generation (the revision number recorded in the object status) was different from the metadata generation (the revision number in the deployment metadata).	warning	15m
KubeDeploymentReplicasMismatch	This alert fires if a deployment's replicas number specification was different from the available replicas in the last hour.	warning	1h
KubeStatefulSetReplicasMismatch	This alert fires if a deployment's replicas number specification was different from the available replicas in the last hour.	warning	15m
KubeStatefulSetGenerationMismatch	This alert fires if a StatefulSet's replicas number specification was different from the available replicas in the 15 minutes.	warning	15m
KubeDaemonSetRolloutStuck	This alert fires if the percentage of DaemonSet in the ready phase was less than 100% in the last 15 minutes.	warning	15m
KubeDaemonSetNotScheduled	This alert fires if the desired number of scheduled DaemonSet was higher than the number of currently scheduled DaemonSet in the last 10 minutes.	warning	10m
KubeDaemonSetMisScheduled	This alert fires if at least one DaemonSet was running where it was not supposed to run in the last 10 minutes.	warning	10m
KubeCronJobRunning	This alert fires if at least one CronJob took more than one hour to complete.	warning	1h
KubeJobCompletion	This alert fires if at least on Job took more than one hour to complete.	warning	1h
KubeJobFailed	This alert fires if at least one Job failed in the last hour.	warning	1h
KubeLatestImageTag	This alert fires if there are images deployed in the cluster tagged with `:latest` and this is really dangerous	warning	1h

kube-prometheus-node-alerting.rules

Parameter	Description	Severity	Interval
NodeCPUSaturating	This alert fires if, for a given instance, CPU utilisation and saturation were higher than 90% in the last 30 minutes.	warning	30m
NodeCPUStuckInIOWait	This alert fires if CPU time in IOWait mode calculated on a 5 minutes window for a given instance was more than 50% in the last 15 minutes.	warning	15m
NodeMemoryRunningFull	This alert fires if memory utilisation on a given node was higher than 85% in the last 30 minutes.	warning	30m
NodeFilesystemUsageCritical	This alert fires if in the last minute the filesystem usage was more than 90%.	critical	1m
NodeFilesystemFullInFourDays	This alert fires if in the last 5 minutes the filesystem usage was more than 85% and, based on a linear prediction on the volume usage in the last 6 hours, the volume will be full in four days.	warning	5m
NodeFilesystemInodeUsageCritical	This alert fires if the available inodes in a given filesystem were less than 10% in the last minute.	critical	1m
NodeFilesystemInodeFullInFourDays	This alert fires if, based on a linear prediction on the inodes usage in the last 6 hours, the filesystem will exhaust its inodes in four days.	warning	5m
NodeNetworkDroppingPackets	This alerts fires if a given physical network interface was dropping more than 10 pkt/s in the last 30 minutes.	warning	30m

prometheus

Parameter	Description	Severity	Interval
PrometheusConfigReloadFailed	This alert fires if Prometheus's configuration failed to reload in the last 10 minutes.	critical	10m
PrometheusNotificationQueueRunningFull	This alert fires if Prometheus's alert notification queue is running full in the next 30 minutes, based on a linear prediction on the usage in the last 5 minutes.	critical	10m
PrometheusErrorSendingAlerts	This alert fires if the error rate calculated in a 5 minutes time windows was more than 1% in the last 10 minutes.	critical	10m
PrometheusErrorSendingAlerts	This alert fires if the error rate calculated in a 5 minutes time windows was more than 3% in the last 10 minutes.	critical	10m
PrometheusNotConnectedToAlertmanagers	This alert fires if Prometheus was not connected to at least one Alertmanager in the last 10 minutes.	critical	10m
PrometheusTSDBReloadsFailing	This alert fires if Prometheus had any failure to reload data blocks from disk in the last 12 hours.	critical	12h
PrometheusTSDBCompactionsFailing	This alert fires if Prometheus had any failure to compact sample blocks in the last 12 hours.	critical	12h
PrometheusTSDBWALCorruptions	This alert fires if Prometheus had detected any corruption in the write-ahead log in the last 4 hours.	critical	4h
PrometheusNotIngestingSamples	This alert fires if Prometheus sample ingestion rate calculated on a 5 minutes time window was below or equal to 0 in the last 10 minutes, i.e. Prometheus is failing to ingest samples.	critical	10m
PrometheusTargetScrapesDuplicate	This alert fires if Prometheus was discarding many samples due to duplicated timestamps but different values in the last 10 minutes.	warning	10m

general

Parameter	Description	Severity	Interval
TargetDown	This alert fires if more than 10% of the targets were down in the last 10 minutes.	critical	10m
FdExhaustion	This alert fires if, based on a linear prediction on file descriptors usage in the last hour minutes, the instance will exhaust its file descriptors in 4 hours.	warning	10m
FdExhaustion	This alert fires if, based on a linear prediction on file descriptors usage in the last 10 minutes, the instance will exhaust its file descriptors in one hour.	critical	10m
DeadMansSwitch	This is a DeadMansSwitch meant to ensure that the entire Alerting Pipeline is functional.	none

kubernetes-system

Parameter	Description	Severity	Interval
KubeNodeNotReady	This alert fires if a given node was not in Ready status in the last hour.	critical	1h
KubeVersionMismatch	This alert fires if the versions of the Kubernetes components were mismatching in the last hour.	warning	1h
KubeClientErrors	This alert fires if the Kubernetes API client error responses rate calculated in a 5 minutes window was more than 1% in the last 15 minutes.	warning	15m
KubeClientErrors	This alert fires if the Kubernetes API client error responses rate calculated in a 5 minutes window was more than 0.1 errors / sec in the last 15 minutes.	warning	15m
KubeletTooManyPods	This alert fires if a given kubelet is running more than 100 pods and is approaching the hard limit of 110 pods per node.	warning	15m
KubeAPILatencyHigh	This alert fires if the API server 99th percentile latency was more than 1 second in the last 10 minutes.	warning	10m
KubeAPILatencyHigh	This alert fires if the API server 99th percentile latency was more than 4 second in the last 10 minutes.	critical	10m
KubeAPIErrorsHigh	This alert fires if the requests error rate calculated in a 5 minutes window was more than 5% in the last 10 minutes.	critical	10m

kubernetes-storage

Parameter	Description	Severity	Interval
KubePersistentVolumeStuck	This alert fires if a given persisten volume was stuck in the Pending or Failed phase in the last hour.	warning	1h
KubePersistentVolumeUsageCritical	This alert fires if the available space in a given PersistentVolumeClaim was less than 10% in the last minute.	critical	1m
KubePersistentVolumeFullInFourDays	This alert fires if, based on a linear prediction on the volume usage in the last 6 hours, the volume will be full in four days.	warning	5m
KubePersistentVolumeInodeUsageCritical	This alert fires if the available inodes in a given PersistentVolumeClaim were less than 10% in the last minute.	critical	1m
KubePersistentVolumeInodeFullInFourDays	This alert fires if, based on a linear prediction on the inodes usage in the last 6 hours, the volume will exhaust its inodes in four days.	warning	5m

kubernetes-absent

Parameter	Description	Severity	Interval
AlertmanagerDown	This alert fires if Prometheus target discovery was not able to reach AlertManager in the last 15 minutes.	critical	15m
KubeAPIDown	This alert fires if Prometheus target discovery was not able to reach kube-apiserver in the last 15 minutes.	critical	15m
KubeStateMetricsDown	This alert fires if Prometheus target discovery was not able to reach kube-state-metrics in the last 15 minutes.	critical	15m
KubeletDown	This alert fires if Prometheus target discovery was not able to reach the kubelet in the last 15 minutes.	critical	15m
NodeExporterDown	This alert fires if Prometheus target discovery was not able to reach node-exporter in the last 15 minutes.	critical	15m
PrometheusDown	This alert fires if Prometheus target discovery was not able to reach Prometheus in the last 15 minutes.	critical	15m
PrometheusOperatorDown	This alert fires if Prometheus target discovery was not able to reach the Prometheus Operator in the last 15 minutes.	critical	15m

alertmanager

Parameter	Description	Severity	Interval
AlertmanagerConfigInconsistent	This alert fires if the configuration of the instances of the Alertmanager cluster were out of sync in the last 5 minutes.	critical	5m
AlertmanagerDownOrMissing	This alert fires if in the last 5 minutes an unexpected number of Alertmanagers were scraped or Alertmanagers disappered from target discovery.	critical	5m
AlertmanagerFailedReload	This alert fires if the Alertmanager's configuration reload failed in the last 10 minutes.	critical	10m

kubernetes-resources

Parameter	Description	Severity	Interval
KubeCPUOvercommit	This alert fires if the cluster-wide CPU requests from pods in the last 5 minutes were so high to not tolerate a node failure.	warning	5m
KubeMemOvercommit	This alert fires if the cluster-wide memory requests from pods in the last 5 minutes were so high to not tolerate a node failure.	warning	5m
KubeCPUOvercommit	This alert fires if the hard limit of CPU resources quota in the last 5 minutes is more than 150% of the available resources, i.e. the hard limit is set too high.	warning	5m
KubeMemOvercommit	This alert fires if the hard limit of memory resources quota in the last 5 minutes is more than 150% of the available resources, i.e. the hard limit is set too high.	warning	5m
KubeQuotaExceeded	This alert fires if a given resource was used for more than 90% of the corresponding hard quota in the last 15 minutes.	warning	15m

grafana

Parameter	Description	Severity	Interval
GrafanaRequestsFailing	This alert fires when the rate of errors for Grafana requests is more than 50% over the last 5 minutes	warning	5m

kube-state-metrics

Parameter	Description	Severity	Interval
KubeStateMetricsListErrors	This alert fires if the rate of kube-state-metrics list operations errors is > 1% over the last 15 minutes.	critical	15m
KubeStateMetricsWatchErrors	This alert fires if the rrate of kube-state-metrics watch operations errrors is > 1% over the last 15 minutes.	critical	15m
KubeStateMetricsShardingMismatch	This alert fires if kube-state-metrics sharding is misconfigured.	critical	15m
KubeStateMetricsShardsMissing	This alert fires if kube-state-metrics shards are missing.	critical	15m

kubernetes-absent-kubeadm

(Only for OnPremises and KFDDistribution clusters)

Parameter	Description	Severity	Interval
KubeControllerManagerDown	This alert fires if Prometheus target discovery was not able to reach the kube-controller-manager in the last 15 minutes.	critical	15m
KubeSchedulerDown	This alert fires if Prometheus target discovery was not able to reach the kube-scheduler in the last 15 minutes.	critical	15m
KubeClientCertificateExpiration	This alert fires when the Kubernetes API client certificate is expiring in less than 30 days.	warning
KubeClientCertificateExpiration	This alert fires when the Kubernetes API client certificate is expiring in less than 7 days.	critical

coredns

(Only for OnPremises and KFDDistribution clusters)

Parameter	Description	Severity	Interval
CoreDNSPanic	This alert fires if CoreDNS total panic count increased by at least 1 in the last 10 minutes.	warning
CoreDNSRequestsLatency	This alert fires if CoreDNS 99th percentile requests latency was higher than 100ms in the last 10 minutes.	warning	10m
CoreDNSHealthRequestsLatency	This alert fires if CoreDNS 99th percentile health requests latency was higher than 10ms in the last 10 minutes.	warning	10m
CoreDNSProxyRequestsLatency	This alert fires if CoreDNS 99th percentile proxy requests latency was higher than 500ms in the last 10 minutes.	warning	10m

etcd3

(Only for OnPremises and KFDDistribution clusters)

Parameter	Description	Severity	Interval
EtcdInsufficientMembers	This alert fires if less than half of Etcd cluster members were online in the last 3 minutes.	critical	3m
EtcdNoLeader	This alert fires if the Etcd cluster had no leader in the last minute.	critical	1m
EtcdHighNumberOfLeaderChanges	This alert fires if the Etcd cluster changed leader more than 3 times in the last hour.	warning
EtcdHighNumberOfFailedProposals	This alert fires if there were more than 5 proposal failure in the last hour.	warning
EtcdHighFsyncDurations	This alert fires if the WAL fsync 99th percentile latency was higher than 0.5s in the last 10 minutes.	warning	10m
EtcdHighCommitDurations	This alert fires if the backend commit 99th percentile latency was higher than 0.25s in the last 10 minutes.	warning	10m

haproxy

(Only for OnPremises clusters with HAProxy instances managed by furyctl)

Parameter	Description	Severity	Interval
HaproxyHighHttp4xxErrorRateBackend	This alert fires if the rate of HTTP 4xx requests is > 5% on backend in the last minute.	critical	1m
HaproxyHighHttp5xxErrorRateBackend	This alert fires if the rate of HTTP 5xx requests is > 5% on backend in the last minute.	critical	1m
HaproxyHighHttp4xxErrorRateServer	This alert fires if the rate of HTTP 4xx requests is > 5% on server in the last minute.	critical	1m
HaproxyHighHttp5xxErrorRateServer	This alert fires if the rate of HTTP 5xx requests is > 5% on server in the last minute.	critical	1m
HaproxyServerResponseErrors	This alert fires if the rate of error requests is > 5% on server in the last minute.	critical	1m
HaproxyBackendConnectionErrors	This alert fires if a certain backend had too many (> 100) connection errors in the last minute.	critical	1m
HaproxyServerConnectionErrors	This alert fires if a certain server had too many (> 100) connection errors in the last minute.	critical	1m
HaproxyBackendMaxActiveSession>80%	This alert fires if the number of active session is 80% or more of the max allowed number.	warning
HaproxyPendingRequests	This alert fires if there are pending backend requests for more than 2 minutes.	warning	2m
HaproxyRetryHigh	This alert fires if the number of retries over the last minutes is 10 or more.	warning	1m
HaproxyHasNoAliveBackends	This alert fires when HAProxy has no alive or backup backends.	critical
HaproxyFrontendSecurityBlockedRequests	This alert fires if HAProxy has blocked requests for security reasons for 10 or more times over the last 2 minutes.	warning	2m
HaproxyServerHealthcheckFailure	This alert fires when there are failed heathchecks on HAProxy servers.	warning	1m

You can find out more info about monitoring in Kubernetes on the following links:

Overview​

Packages​

Introduction: Monitoring in Kubernetes​

Metrics​

KFD: Monitoring module​

Prometheus​

Exporters​

Alertmanager​

Mimir​

Grafana​

Alerts​

kubernetes-apps​

kube-prometheus-node-alerting.rules​

prometheus​

general​

kubernetes-system​

kubernetes-storage​

kubernetes-absent​

alertmanager​

kubernetes-resources​

grafana​

kube-state-metrics​

kubernetes-absent-kubeadm​

coredns​

etcd3​

haproxy​

Read More​

Overview

Packages

Introduction: Monitoring in Kubernetes

Metrics

KFD: Monitoring module

Prometheus

Exporters

Alertmanager

Mimir

Grafana

Alerts

kubernetes-apps

kube-prometheus-node-alerting.rules

prometheus

general

kubernetes-system

kubernetes-storage

kubernetes-absent

alertmanager

kubernetes-resources

grafana

kube-state-metrics

kubernetes-absent-kubeadm

coredns

etcd3

haproxy

Read More