Skip to main content
Version: 1.27.2

Prometheus Operated

Prometheus Operated deploys Prometheus instances via Prometheus CRD as defined by Prometheus Operator.

Prometheus is a monitoring tool to collect metrics-based time series data and provides a functional expression language that lets the user select and aggregate time-series data in real-time. Prometheus's expression browser makes it possible to analyze queried data as a graph or view it as tabular data, but it's also possible to integrate it with third-party time-series analytics tools like Grafana. Grafana integration is provided in Fury monitoring katalog, please see Grafana package's documentation.

Requirements

Image repository and tag

Configuration

Fury distribution Prometheus is deployed with the following configuration:

  • Replica number: 1
  • Retention for 30 days
  • Retention Storage 120 Gb (80% of the default 150)
  • Requires 150Gi storage (with default storage type of Provider)
  • Listens on port 9090
  • Alertmanager endpoint set to alertmanager-main

Deployment

You can deploy Prometheus Operated by running the following command:

kustomize build | kubectl apply -f -

To learn how to customize it for your needs please see the examples folder.

Accessing Prometheus UI

You can access to Prometheus expression browser by port-forwarding on port 9090:

kubectl port-forward svc/prometheus-k8s 9090:9090 --namespace monitoring

Now if you go to http://127.0.0.1:9090 on your browser you can execute queries and visualize query results.

Service Monitoring

Target discovery is achieved via ServiceMonitor CRD, to learn more about ServiceMonitor please follow Prometheus Operator's documentation.

To learn how to create ServiceMonitor resources for your services please see the example.

Prometheus Rules and Alerts

Alerting rules are created via PrometheusRule CRD, to learn more about PrometheusRule please follow Prometheus Operator documentation and Prometheus documentation.

To learn how to define alert rules for your services please see the example.

Alerts

The following alerts are already defined for this package.

kubernetes-apps

ParameterDescriptionSeverityInterval
KubePodCrashLoopingThis alert fires if the per-second rate of the total number of restart of a given pod in a 15 minutes time window was above 0 in the last hour, i.e. the pod is stuck in a crash loop.warning1h
KubePodNotReadyThis alert fires if at least one pod was stuck in the Pending or Unknown phase in the last hour.warning1h
KubeDeploymentGenerationMismatchThis alert fires if in the last hour a deployment's observed generation (the revision number recorded in the object status) was different from the metadata generation (the revision number in the deployment metadata).warning15m
KubeDeploymentReplicasMismatchThis alert fires if a deployment's replicas number specification was different from the available replicas in the last hour.warning1h
KubeStatefulSetReplicasMismatchThis alert fires if a deployment's replicas number specification was different from the available replicas in the last hour.warning15m
KubeStatefulSetGenerationMismatchThis alert fires if a StatefulSet's replicas number specification was different from the available replicas in the 15 minutes.warning15m
KubeDaemonSetRolloutStuckThis alert fires if the percentage of DaemonSet in the ready phase was less than 100% in the last 15 minutes.warning15m
KubeDaemonSetNotScheduledThis alert fires if the desired number of scheduled DaemonSet was higher than the number of currently scheduled DaemonSet in the last 10 minutes.warning10m
KubeDaemonSetMisScheduledThis alert fires if at least one DaemonSet was running where it was not supposed to run in the last 10 minutes.warning10m
KubeCronJobRunningThis alert fires if at least one CronJob took more than one hour to complete.warning1h
KubeJobCompletionThis alert fires if at least on Job took more than one hour to complete.warning1h
KubeJobFailedThis alert fires if at least one Job failed in the last hour.warning1h
KubeLatestImageTagThis alert fires if there are images deployed in the cluster tagged with :latest and this is really dangerouswarning1h

kube-prometheus-node-alerting.rules

ParameterDescriptionSeverityInterval
NodeCPUSaturatingThis alert fires if, for a given instance, CPU utilisation and saturation were higher than 90% in the last 30 minutes.warning30m
NodeCPUStuckInIOWaitThis alert fires if CPU time in IOWait mode calculated on a 5 minutes window for a given instance was more than 50% in the last 15 minutes.warning15m
NodeMemoryRunningFullThis alert fires if memory utilisation on a given node was higher than 85% in the last 30 minutes.warning30m
NodeFilesystemUsageCriticalThis alert fires if in the last minute the filesystem usage was more than 90%.critical1m
NodeFilesystemFullInFourDaysThis alert fires if in the last 5 minutes the filesystem usage was more than 85% and, based on a linear prediction on the volume usage in the last 6 hours, the volume will be full in four days.warning5m
NodeFilesystemInodeUsageCriticalThis alert fires if the available inodes in a given filesystem were less than 10% in the last minute.critical1m
NodeFilesystemInodeFullInFourDaysThis alert fires if, based on a linear prediction on the inodes usage in the last 6 hours, the filesystem will exhaust its inodes in four days.warning5m
NodeNetworkDroppingPacketsThis alerts fires if a given physical network interface was dropping more than 10 pkt/s in the last 30 minutes.warning30m

prometheus

ParameterDescriptionSeverityInterval
PrometheusConfigReloadFailedThis alert fires if Prometheus's configuration failed to reload in the last 10 minutes.critical10m
PrometheusNotificationQueueRunningFullThis alert fires if Prometheus's alert notification queue is running full in the next 30 minutes, based on a linear prediction on the usage in the last 5 minutes.critical10m
PrometheusErrorSendingAlertsThis alert fires if the error rate calculated in a 5 minutes time windows was more than 1% in the last 10 minutes.critical10m
PrometheusErrorSendingAlertsThis alert fires if the error rate calculated in a 5 minutes time windows was more than 3% in the last 10 minutes.critical10m
PrometheusNotConnectedToAlertmanagersThis alert fires if Prometheus was not connected to at least one Alertmanager in the last 10 minutes.critical10m
PrometheusTSDBReloadsFailingThis alert fires if Prometheus had any failure to reload data blocks from disk in the last 12 hours.critical12h
PrometheusTSDBCompactionsFailingThis alert fires if Prometheus had any failure to compact sample blocks in the last 12 hours.critical12h
PrometheusTSDBWALCorruptionsThis alert fires if Prometheus had detected any corruption in the write-ahead log in the last 4 hours.critical4h
PrometheusNotIngestingSamplesThis alert fires if Prometheus sample ingestion rate calculated on a 5 minutes time window was below or equal to 0 in the last 10 minutes, i.e. Prometheus is failing to ingest samples.critical10m
PrometheusTargetScrapesDuplicateThis alert fires if Prometheus was discarding many samples due to duplicated timestamps but different values in the last 10 minutes.warning10m

general

ParameterDescriptionSeverityInterval
TargetDownThis alert fires if more than 10% of the targets were down in the last 10 minutes.critical10m
FdExhaustionThis alert fires if, based on a linear prediction on file descriptors usage in the last hour minutes, the instance will exhaust its file descriptors in 4 hours.warning10m
FdExhaustionThis alert fires if, based on a linear prediction on file descriptors usage in the last 10 minutes, the instance will exhaust its file descriptors in one hour.critical10m
DeadMansSwitchThis is a DeadMansSwitch meant to ensure that the entire Alerting Pipeline is functional.none

kubernetes-system

ParameterDescriptionSeverityInterval
KubeNodeNotReadyThis alert fires if a given node was not in Ready status in the last hour.critical1h
KubeVersionMismatchThis alert fires if the versions of the Kubernetes components were mismatching in the last hour.warning1h
KubeClientErrorsThis alert fires if the Kubernetes API client error responses rate calculated in a 5 minutes window was more than 1% in the last 15 minutes.warning15m
KubeClientErrorsThis alert fires if the Kubernetes API client error responses rate calculated in a 5 minutes window was more than 0.1 errors / sec in the last 15 minutes.warning15m
KubeletTooManyPodsThis alert fires if a given kubelet is running more than 100 pods and is approaching the hard limit of 110 pods per node.warning15m
KubeAPILatencyHighThis alert fires if the API server 99th percentile latency was more than 1 second in the last 10 minutes.warning10m
KubeAPILatencyHighThis alert fires if the API server 99th percentile latency was more than 4 second in the last 10 minutes.critical10m
KubeAPIErrorsHighThis alert fires if the requests error rate calculated in a 5 minutes window was more than 5% in the last 10 minutes.critical10m

kubernetes-storage

ParameterDescriptionSeverityInterval
KubePersistentVolumeStuckThis alert fires if a given persisten volume was stuck in the Pending or Failed phase in the last hour.warning1h
KubePersistentVolumeUsageCriticalThis alert fires if the available space in a given PersistentVolumeClaim was less than 10% in the last minute.critical1m
KubePersistentVolumeFullInFourDaysThis alert fires if, based on a linear prediction on the volume usage in the last 6 hours, the volume will be full in four days.warning5m
KubePersistentVolumeInodeUsageCriticalThis alert fires if the available inodes in a given PersistentVolumeClaim were less than 10% in the last minute.critical1m
KubePersistentVolumeInodeFullInFourDaysThis alert fires if, based on a linear prediction on the inodes usage in the last 6 hours, the volume will exhaust its inodes in four days.warning5m

kubernetes-absent

ParameterDescriptionSeverityInterval
AlertmanagerDownThis alert fires if Prometheus target discovery was not able to reach AlertManager in the last 15 minutes.critical15m
KubeAPIDownThis alert fires if Prometheus target discovery was not able to reach kube-apiserver in the last 15 minutes.critical15m
KubeStateMetricsDownThis alert fires if Prometheus target discovery was not able to reach kube-state-metrics in the last 15 minutes.critical15m
KubeletDownThis alert fires if Prometheus target discovery was not able to reach the kubelet in the last 15 minutes.critical15m
NodeExporterDownThis alert fires if Prometheus target discovery was not able to reach node-exporter in the last 15 minutes.critical15m
PrometheusDownThis alert fires if Prometheus target discovery was not able to reach Prometheus in the last 15 minutes.critical15m
PrometheusOperatorDownThis alert fires if Prometheus target discovery was not able to reach the Prometheus Operator in the last 15 minutes.critical15m

alertmanager

ParameterDescriptionSeverityInterval
AlertmanagerConfigInconsistentThis alert fires if the configuration of the instances of the Alertmanager cluster were out of sync in the last 5 minutes.critical5m
AlertmanagerDownOrMissingThis alert fires if in the last 5 minutes an unexpected number of Alertmanagers were scraped or Alertmanagers disappered from target discovery.critical5m
AlertmanagerFailedReloadThis alert fires if the Alertmanager's configuration reload failed in the last 10 minutes.critical10m

kubernetes-resources

ParameterDescriptionSeverityInterval
KubeCPUOvercommitThis alert fires if the cluster-wide CPU requests from pods in the last 5 minutes were so high to not tolerate a node failure.warning5m
KubeMemOvercommitThis alert fires if the cluster-wide memory requests from pods in the last 5 minutes were so high to not tolerate a node failure.warning5m
KubeCPUOvercommitThis alert fires if the hard limit of CPU resources quota in the last 5 minutes is more than 150% of the available resources, i.e. the hard limit is set too high.warning5m
KubeMemOvercommitThis alert fires if the hard limit of memory resources quota in the last 5 minutes is more than 150% of the available resources, i.e. the hard limit is set too high.warning5m
KubeQuotaExceededThis alert fires if a given resource was used for more than 90% of the corresponding hard quota in the last 15 minutes.warning15m