Alerting

Alerts and pagers for your cluster

Alertmanager

Alertmanager, a tool from the Prometheus stack, handles alerts sent by Prometheus server. Alertmanager lets you manage alerts flexibly and route them through receiver integrations such as email, Slack or PagerDuty.

You can learn how to configure your Alertmanager and its integrations from its own documentation section.

Fury Kubernetes Monitoring module ships with a simple and useful Alermanager pre-configuration.

Alerts dispatch and support

Fury Kubernetes Monitoring module comes pre-configured with a series of tested and validated alerts and rules, which will cover most of your use cases.

You can add your own alerts on top of ours. Alerts relating to the Fury Kubernetes Cluster can be dispatched to your on-call/SRE team or to the SIGHUP Support team, depending on your Support contract. Alerts relating to your apllication can be dispatched to your on-call/dev teams.

Alerts

The followings alerts, listed by the alert group they belong to, come pre-defined with this package.

kube-state-metrics

Alert name Description Severity
KubeStateMetricsListErrors kube-state-metrics is experiencing errors at an elevated rate in list operations. This is likely causing it to not be able to expose metrics about Kubernetes objects correctly or at all. critical
KubeStateMetricsWatchErrors kube-state-metrics is experiencing errors at an elevated rate in watch operations. This is likely causing it to not be able to expose metrics about Kubernetes objects correctly or at all. critical

node-exporter

Alert name Description Severity
NodeFilesystemSpaceFillingUp Filesystem is predicted to run out of space within the next 24 hours. warning
NodeFilesystemSpaceFillingUp Filesystem is predicted to run out of space within the next 4 hours. critical
NodeFilesystemAlmostOutOfSpace Filesystem has less than 5% space left. warning
NodeFilesystemAlmostOutOfSpace Filesystem has less than 3% space left. critical
NodeFilesystemFilesFillingUp Filesystem is predicted to run out of inodes within the next 24 hours. warning
NodeFilesystemFilesFillingUp Filesystem is predicted to run out of inodes within the next 4 hours. critical
NodeFilesystemAlmostOutOfFiles Filesystem has less than 5% inodes left. warning
NodeFilesystemAlmostOutOfFiles Filesystem has less than 3% inodes left. critical
NodeNetworkReceiveErrs Network interface is reporting many receive errors. warning
NodeNetworkTransmitErrs Network interface is reporting many transmit errors. warning
NodeMachineIDCollision Machine ID collision. critical

kubernetes-apps

Alert name Description Severity
KubePodCrashLooping Pod is restarting too manu times / 5 minutes. critical
KubePodNotReady Pod has been in a non-ready state for longer than 15 minutes. critical
KubeDeploymentGenerationMismatch Deployment generation for a deployment does not match, this indicates that the Deployment has failed but has not been rolled back. critical
KubeDeploymentReplicasMismatch Deployment has not matched the expected number of replicas for longer than 15 minutes. critical
KubeStatefulSetReplicasMismatch StatefulSet has not matched the expected number of replicas for longer than 15 minutes. critical
KubeStatefulSetGenerationMismatch StatefulSet generation for a statefulset does not match, this indicates that the StatefulSet has failed but has not been rolled back critical
KubeStatefulSetUpdateNotRolledOut StatefulSet update has not been rolled out. critical
KubeDaemonSetRolloutStuck Only a percentage of the desired Pods of a DaemonSet are scheduled and ready. critical
KubeContainerWaiting Pod container has been in waiting state for longer than 1 hour. warning
KubeDaemonSetNotScheduled A number of Pods of a DaemonSet are not scheduled. warning
KubeDaemonSetMisScheduled A number of Pods of a DaemonSet are running where they are not supposed to run. warning
KubeCronJobRunning CronJob is taking morE than 1h to complete. warning
KubeJobCompletion Job is taking more than one hour to complete. warning
KubeJobFailed Job failed to complete. warning
KubeHpaReplicasMismatch HPA has not matched the desired number of replicas for longer than 15 minutes. warning
KubeHpaMaxedOuT HPA has been running at max replicas for longer than 15 minutes. warning

kubernetes-resources

Alert name Description Severity
KubeCPUOvercommit Cluster has overcommitted CPU resource requests for Pods and cannot tolerate node failure. warning
KubeMemOvercommit Cluster has overcommitted memory resource requests for Pods and cannot tolerate node failure. warning
KubeCPUOvercommit Cluster has overcommitted CPU resource requests for Namespaces. warning
KubeMemOvercommit Cluster has overcommitted memory resource requests for Namespaces. warning
KubeQuotaExceeded Namespace is using a high percentage of its resource quota. warning
CPUThrottlingHigh High throttling of CPU in a namespace for a container in a pod. warning

kubernetes-storage

Alert name Description Severity
KubePersistentVolumeUsage The PersistentVolume claimed by a persistent volume in a Namespace is only partially free. critical
KubePersistentVolumeFullInFourDays Based on recent sampling, the PersistentVolume claimed by a persistent volume in a Namespace is expected to fill up within four days. critical
KubePersistentVolumeErrors The persistent volume has status Failed or Pending critical

kubernetes-system

Alert name Description Severity
KubeVersionMismatch There are different semantic versions of Kubernetes components running. warning
KubeClientErrors Kubernetes API server client is experiencing a number of errors warning

kube-apiserver-error

Alert name Description Severity
ErrorBudgetBurn Error Budget Burn rate too high critical
ErrorBudgetBurn Error Budget Burn rate is high warning

kubernetes-system-apiserver

Alert name Description Severity
KubeAPILatencyHigh The API server has an abnormal latency of several seconds for a resource. warning
KubeAPILatencyHigh The API server has a 99th percentile latency of of several seconds for a resource. critical
KubeAPIErrorsHigh API server is returning errors for a Percentage of requests. critical
KubeAPIErrorsHigh API server is returning errors for a Percentage of requests. warning
KubeAPIErrorsHigh API server is returning errors for a Percentage of requests for a resource. critical
KubeAPIErrorsHigh API server is returning errors fora Percentage of requests for a resource. warning
KubeClientCertificateExpiration A client certificate used to authenticate to the apiserver is expiring in less than 7.0 days. warning
KubeClientCertificateExpiration A client certificate used to authenticate to the apiserver is expiring in less than 24.0 hours. critical
KubeAPIDown KubeAPI has disappeared from Prometheus target discovery. critical

kubernetes-system-kubelet

Alert name Description Severity
KubeNodeNotReady A node has been unready for more than 15 minutes. warning
KubeNodeUnreachable A node is unreachable and some workloads may be rescheduled. warning
KubeletTooManyPods Kubelet node is running at a Percentage of its Pod capacity. warning
KubeletDown Kubelet has disappeared from Prometheus target discovery. critical

prometheus

Alert name Description Severity
PrometheusBadConfig Failed Prometheus configuration reload. critical
PrometheusNotificationQueueRunningFull Prometheus alert notification queue predicted to run full in less than 30m. warning
PrometheusErrorSendingAlertsToSomeAlertmanagers Prometheus has encountered more than 1% errors sending alerts to a specific Alertmanager. warning
PrometheusErrorSendingAlertsToAnyAlertmanager Prometheus encounters more than 3% errors sending alerts to any Alertmanager. critical
PrometheusNotConnectedToAlertmanagers Prometheus is not connected to any Alertmanagers. warning
PrometheusTSDBReloadsFailing Prometheus has issues reloading blocks from disk. warning
PrometheusTSDBCompactionsFailing Prometheus has issues compacting blocks. warning
PrometheusNotIngestingSamples Prometheus is not ingesting samples. warning
PrometheusDuplicateTimestamps Prometheus is dropping samples with duplicate timestamps. warning
PrometheusOutOfOrderTimestamps Prometheus drops samples with out-of-order timestamps. warning
PrometheusRemoteStorageFailures Prometheus fails to send samples to remote storage. critical
PrometheusRemoteWriteBehind Prometheus remote write is behind. critical
PrometheusRemoteWriteDesiredShards Prometheus remote write desired shards calculation wants to run more than configured max shards. warning
PrometheusRuleFailures Prometheus is failing rule evaluations. critical
PrometheusMissingRuleEvaluations Prometheus is missing rule evaluations due to slow rule group evaluation. warning

alertmanager.rules

Alert name Description Severity
AlertmanagerConfigInconsistent The configuration of the instances of the Alertmanager cluster are out of sync. critical
AlertmanagerFailedReload Reloading Alertmanager's configuration has failed. warning
AlertmanagerMembersInconsistent Alertmanager has not found all other members of the cluster. critical

general.rules

Alert name Description Severity
TargetDown A percentage of the targets in anamespace are down. warning
DeadMansSwitch This is an alert meant to ensure that the entire alerting pipeline is functional. This alert is always firing, therefore it should always be firing in Alertmanager and always fire against a receiver. There are integrations with various notification mechanisms that send a notification when this alert is not firing. For example the “DeadMansSnitch” integration in PagerDuty. none

node-time

Alert name Description Severity
ClockSkewDetected Clock skew detected on a node-exporter. Ensure NTP is configured correctly on this host. warning

node-network

Alert name Description Severity
NodeNetworkInterfaceFlapping Network interface is changing it's up status too often on node-exporter”. warning

prometheus-operator

Alert name Description Severity
PrometheusOperatorReconcileErrors Errors while reconciling a controller }} in a Namespace. warning
PrometheusOperatorNodeLookupErrors Errors while reconciling Prometheus in a Namespace. warning

Last modified 19.05.2020: Fixing typos (d65a551)