Rules

ansible managed alert rules

8.992s ago

30.91ms

Rule State Error Last Evaluation Evaluation Time
alert: Watchdog expr: vector(1) for: 10m labels: severity: warning annotations: description: This is an alert meant to ensure that the entire alerting pipeline is functional. This alert is always firing, therefore it should always be firing in Alertmanager and always fire against a receiver. There are integrations with various notification mechanisms that send a notification when this alert is not firing. For example the "DeadMansSnitch" integration in PagerDuty. summary: Ensure entire alerting pipeline is functional ok 8.992s ago 912us
alert: InstanceDown expr: up == 0 for: 5m labels: severity: critical annotations: description: '{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes.' summary: Instance {{ $labels.instance }} down ok 8.991s ago 3.595ms
alert: SshServiceDown expr: probe_success == 0 for: 5m labels: severity: critical annotations: description: '{{ $labels.target }} of job {{ $labels.job }} has been down for more than 5 minutes.' summary: Ssh Service {{ $labels.target }} down ok 8.988s ago 744.7us
alert: CriticalCPULoad expr: instance:node_cpu:load > 98 for: 30m labels: severity: critical annotations: description: '{{ $labels.instance }} of job {{ $labels.job }} has Critical CPU load for more than 30 minutes.' summary: Instance {{ $labels.instance }} - Critical CPU load ok 8.987s ago 996us
alert: CriticalRAMUsage expr: instance:node_ram:usage > 98 for: 5m labels: severity: critical annotations: description: '{{ $labels.instance }} has Critical Memory Usage more than 5 minutes.' summary: Instance {{ $labels.instance }} has Critical Memory Usage ok 8.986s ago 914.5us
alert: CriticalDiskSpace expr: instance:node_fs:disk_space < 20 for: 4m labels: severity: critical annotations: description: '{{ $labels.instance }} of job {{ $labels.job }} has less than 10% space remaining.' summary: Instance {{ $labels.instance }} - Critical disk space usage ok 8.986s ago 1.177ms
alert: ClockSkewDetected expr: abs(node_timex_offset_seconds) * 1000 > 30 for: 2m labels: severity: warning annotations: description: Clock skew detected on {{ $labels.instance }}. Ensure NTP is configured correctly on this host. summary: Instance {{ $labels.instance }} - Clock skew detected ok 8.984s ago 1.549ms
alert: NumberRestartPerInitContainerContainer expr: kube_pod_init_container_status_restarts_total > 10 for: 1m labels: severity: critical annotations: description: ' Container {{ $labels.container }} of pod {{ $labels.kubernetes_pod_name }} in namespace {{ $labels.kubernetes_namespace }} keeps restarting.' summary: Container {{ $labels.container }} - Too many restarts ok 8.983s ago 20.75ms
alert: PendingJobs expr: gitlab_runner_jobs{state="pending"} > 0 for: 10m labels: severity: critical annotations: description: '{{ $labels.instance }} of job {{ $labels.job }} has pending jobs for more than 5 minutes..' summary: Instance {{ $labels.instance }} - Gitlab runner pending jobs ok 8.963s ago 235.6us

ceph dashboard rules

19.758s ago

25.97ms

Rule State Error Last Evaluation Evaluation Time
alert: Ceph Health Warning expr: ceph_health_status == 1 for: 1m labels: severity: page annotations: description: Overall Ceph Health summary: Ceph Health Warning ok 19.759s ago 361.3us
alert: Ceph Health Error expr: ceph_health_status > 1 for: 1m labels: severity: page annotations: description: The Ceph cluster health is in an error state summary: Ceph Health Error ok 19.759s ago 224.3us
alert: Disk(s) Near Full expr: (ceph_osd_stat_bytes_used / ceph_osd_stat_bytes) * 100 > 85 for: 1m labels: severity: page annotations: description: This shows how many disks are at or above 85% full. Performance may degrade beyond this threshold on filestore (XFS) backed OSD's. summary: Disk(s) Near Full ok 19.759s ago 406.3us
alert: OSD(s) Down expr: ceph_osd_up < 0.5 for: 1m labels: severity: page annotations: description: This indicates that one or more OSDs is currently marked down in the cluster. summary: OSD(s) Down ok 19.759s ago 289.6us
alert: OSD Host(s) Down expr: count by(instance) (ceph_disk_occupation * on(ceph_daemon) group_right(instance) ceph_osd_up == 0) - count by(instance) (ceph_disk_occupation) == 0 for: 1m labels: severity: page annotations: description: This indicates that one or more OSD hosts is currently down in the cluster. summary: OSD Host(s) Down ok 19.758s ago 802.6us
alert: PG(s) Stuck expr: max(ceph_osd_numpg) > scalar(ceph_pg_active) for: 1m labels: severity: page annotations: description: This indicates there are pg's in a stuck state, manual intervention needed to resolve. summary: PG(s) Stuck ok 19.758s ago 581.7us
alert: Slow OSD Responses expr: ((irate(node_disk_read_time_seconds_total[5m]) / clamp_min(irate(node_disk_reads_completed_total[5m]), 1) + irate(node_disk_write_time_seconds_total[5m]) / clamp_min(irate(node_disk_writes_completed_total[5m]), 1)) and on(instance, device) ceph_disk_occupation) > 1 for: 1m labels: severity: page annotations: description: This indicates that some OSD Latencies are above 1s. summary: Slow OSD Responses ok 19.757s ago 16.51ms
alert: Network Errors expr: sum by(instance, device) (irate(node_network_receive_drop_total{device=~"(eth|en|bond|ib|mlx|p).*"}[5m]) + irate(node_network_receive_errs_total{device=~"(eth|en|bond|ib|mlx|p).*"}[5m]) + irate(node_network_transmit_drop_total{device=~"(eth|en|bond|ib|mlx|p).*"}[5m]) + irate(node_network_transmit_errs_total{device=~"(eth|en|bond|ib|mlx|p).*"}[5m])) > 10 for: 1m labels: severity: page annotations: description: This indicates that more than 10 dropped/error packets are seen in a 5m interval summary: Network Errors ok 19.741s ago 5.355ms
alert: Pool Capacity Low expr: (ceph_pool_bytes_used / (ceph_pool_bytes_used + ceph_pool_max_avail) * 100 + on(pool_id) group_left(name) (ceph_pool_metadata * 0)) > 85 for: 1m labels: severity: page annotations: description: This indicates a low capacity in a pool. summary: Pool Capacity Low ok 19.736s ago 567us
alert: MON(s) Down expr: ceph_mon_quorum_status != 1 for: 1m labels: severity: page annotations: description: This indicates that one or more MON(s) is down. summary: MON(s) down ok 19.736s ago 193.5us
alert: Cluster Capacity Low expr: sum(ceph_osd_stat_bytes_used) / sum(ceph_osd_stat_bytes) > 0.85 for: 1m labels: severity: page annotations: description: This indicates raw used space crosses the 85% capacity threshold of the ceph cluster. summary: Cluster Capacity Low ok 19.735s ago 447.8us
alert: OSD(s) with High PG Count expr: ceph_osd_numpg > 275 for: 1m labels: severity: page annotations: description: This indicates there are some OSDs with high PG count (275+). summary: OSD(s) with High PG Count ok 19.735s ago 189.6us

elasticsearch

3.168s ago

3.443ms

Rule State Error Last Evaluation Evaluation Time
record: elasticsearch_filesystem_data_used_percent expr: 100 * (elasticsearch_filesystem_data_size_bytes - elasticsearch_filesystem_data_free_bytes) / elasticsearch_filesystem_data_size_bytes ok 3.168s ago 1.28ms
record: elasticsearch_filesystem_data_free_percent expr: 100 - elasticsearch_filesystem_data_used_percent ok 3.167s ago 1.533ms
alert: ElasticsearchTooFewNodesRunning expr: elasticsearch_cluster_health_number_of_nodes < 3 for: 5m labels: severity: critical annotations: description: There are only {{$value}} < 3 ElasticSearch nodes running summary: ElasticSearch running on less than 3 nodes ok 3.165s ago 210.6us
alert: ElasticsearchHeapTooHigh expr: elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"} > 0.9 for: 15m labels: severity: critical annotations: description: The heap usage is over 90% for 15m summary: ElasticSearch node {{$labels.node}} heap usage is high ok 3.165s ago 378us

kube-apiserver-slos

12.89s ago

2.488ms

Rule State Error Last Evaluation Evaluation Time
alert: KubeAPIErrorBudgetBurn expr: sum(apiserver_request:burnrate1h) > (14.4 * 0.01) and sum(apiserver_request:burnrate5m) > (14.4 * 0.01) for: 2m labels: severity: critical annotations: message: The API server is burning too much error budget runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapierrorbudgetburn ok 12.89s ago 868.1us
alert: KubeAPIErrorBudgetBurn expr: sum(apiserver_request:burnrate6h) > (6 * 0.01) and sum(apiserver_request:burnrate30m) > (6 * 0.01) for: 15m labels: severity: critical annotations: message: The API server is burning too much error budget runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapierrorbudgetburn ok 12.89s ago 379.8us
alert: KubeAPIErrorBudgetBurn expr: sum(apiserver_request:burnrate1d) > (3 * 0.01) and sum(apiserver_request:burnrate2h) > (3 * 0.01) for: 1h labels: severity: warning annotations: message: The API server is burning too much error budget runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapierrorbudgetburn ok 12.889s ago 461.1us
alert: KubeAPIErrorBudgetBurn expr: sum(apiserver_request:burnrate3d) > (1 * 0.01) and sum(apiserver_request:burnrate6h) > (1 * 0.01) for: 3h labels: severity: warning annotations: message: The API server is burning too much error budget runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapierrorbudgetburn ok 12.889s ago 758.2us

kubernetes-apps

18.224s ago

487.8ms

Rule State Error Last Evaluation Evaluation Time
alert: KubePodCrashLooping expr: rate(kube_pod_container_status_restarts_total{job="kube-state-metrics"}[15m]) * 60 * 5 > 0 for: 15m labels: severity: critical annotations: message: Pod {{ $labels.namespace }}/{{ $labels.pod }} ({{ $labels.container }}) is restarting {{ printf "%.2f" $value }} times / 5 minutes. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepodcrashlooping ok 18.224s ago 67.01ms
alert: KubePodNotReady expr: sum by(namespace, pod) (max by(namespace, pod) (kube_pod_status_phase{job="kube-state-metrics",phase=~"Pending|Unknown"}) * on(namespace, pod) group_left(owner_kind) max by(namespace, pod, owner_kind) (kube_pod_owner{owner_kind!="Job"})) > 0 for: 15m labels: severity: critical annotations: message: Pod {{ $labels.namespace }}/{{ $labels.pod }} has been in a non-ready state for longer than 15 minutes. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepodnotready ok 18.158s ago 82.5ms
alert: KubeDeploymentGenerationMismatch expr: kube_deployment_status_observed_generation{job="kube-state-metrics"} != kube_deployment_metadata_generation{job="kube-state-metrics"} for: 15m labels: severity: critical annotations: message: Deployment generation for {{ $labels.namespace }}/{{ $labels.deployment }} does not match, this indicates that the Deployment has failed but has not been rolled back. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedeploymentgenerationmismatch ok 18.075s ago 4.775ms
alert: KubeDeploymentReplicasMismatch expr: (kube_deployment_spec_replicas{job="kube-state-metrics"} != kube_deployment_status_replicas_available{job="kube-state-metrics"}) and (changes(kube_deployment_status_replicas_updated{job="kube-state-metrics"}[5m]) == 0) for: 15m labels: severity: critical annotations: message: Deployment {{ $labels.namespace }}/{{ $labels.deployment }} has not matched the expected number of replicas for longer than 15 minutes. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedeploymentreplicasmismatch ok 18.071s ago 7.298ms
alert: KubeStatefulSetReplicasMismatch expr: (kube_statefulset_status_replicas_ready{job="kube-state-metrics"} != kube_statefulset_status_replicas{job="kube-state-metrics"}) and (changes(kube_statefulset_status_replicas_updated{job="kube-state-metrics"}[5m]) == 0) for: 15m labels: severity: critical annotations: message: StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} has not matched the expected number of replicas for longer than 15 minutes. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubestatefulsetreplicasmismatch ok 18.063s ago 36.83ms
alert: KubeStatefulSetGenerationMismatch expr: kube_statefulset_status_observed_generation{job="kube-state-metrics"} != kube_statefulset_metadata_generation{job="kube-state-metrics"} for: 15m labels: severity: critical annotations: message: StatefulSet generation for {{ $labels.namespace }}/{{ $labels.statefulset }} does not match, this indicates that the StatefulSet has failed but has not been rolled back. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubestatefulsetgenerationmismatch ok 18.027s ago 20.41ms
alert: KubeStatefulSetUpdateNotRolledOut expr: max without(revision) (kube_statefulset_status_current_revision{job="kube-state-metrics"} unless kube_statefulset_status_update_revision{job="kube-state-metrics"}) * (kube_statefulset_replicas{job="kube-state-metrics"} != kube_statefulset_status_replicas_updated{job="kube-state-metrics"}) for: 15m labels: severity: critical annotations: message: StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} update has not been rolled out. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubestatefulsetupdatenotrolledout ok 18.006s ago 59.04ms
alert: KubeDaemonSetRolloutStuck expr: kube_daemonset_status_number_ready{job="kube-state-metrics"} / kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics"} < 1 for: 15m labels: severity: critical annotations: message: Only {{ $value | humanizePercentage }} of the desired Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are scheduled and ready. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedaemonsetrolloutstuck ok 17.948s ago 1.614ms
alert: KubeContainerWaiting expr: sum by(namespace, pod, container) (kube_pod_container_status_waiting_reason{job="kube-state-metrics"}) > 0 for: 1h labels: severity: warning annotations: message: Pod {{ $labels.namespace }}/{{ $labels.pod }} container {{ $labels.container}} has been in waiting state for longer than 1 hour. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecontainerwaiting ok 17.946s ago 190.4ms
alert: KubeDaemonSetNotScheduled expr: kube_daemonset_status_desired_number_scheduled{job="kube-state-metrics"} - kube_daemonset_status_current_number_scheduled{job="kube-state-metrics"} > 0 for: 10m labels: severity: warning annotations: message: '{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are not scheduled.' runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedaemonsetnotscheduled ok 17.756s ago 1.375ms
alert: KubeDaemonSetMisScheduled expr: kube_daemonset_status_number_misscheduled{job="kube-state-metrics"} > 0 for: 15m labels: severity: warning annotations: message: '{{ $value }} Pods of DaemonSet {{ $labels.namespace }}/{{ $labels.daemonset }} are running where they are not supposed to run.' runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubedaemonsetmisscheduled ok 17.754s ago 629us
alert: KubeCronJobRunning expr: time() - kube_cronjob_next_schedule_time{job="kube-state-metrics"} > 3600 for: 1h labels: severity: warning annotations: message: CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is taking more than 1h to complete. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecronjobrunning ok 17.754s ago 331.7us
alert: KubeJobCompletion expr: kube_job_spec_completions{job="kube-state-metrics"} - kube_job_status_succeeded{job="kube-state-metrics"} > 0 for: 1h labels: severity: warning annotations: message: Job {{ $labels.namespace }}/{{ $labels.job_name }} is taking more than one hour to complete. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubejobcompletion ok 17.754s ago 9.083ms
alert: KubeJobFailed expr: kube_job_failed{job="kube-state-metrics"} > 0 for: 15m labels: severity: warning annotations: message: Job {{ $labels.namespace }}/{{ $labels.job_name }} failed to complete. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubejobfailed ok 17.745s ago 5.492ms
alert: KubeHpaReplicasMismatch expr: (kube_hpa_status_desired_replicas{job="kube-state-metrics"} != kube_hpa_status_current_replicas{job="kube-state-metrics"}) and changes(kube_hpa_status_current_replicas[15m]) == 0 for: 15m labels: severity: warning annotations: message: HPA {{ $labels.namespace }}/{{ $labels.hpa }} has not matched the desired number of replicas for longer than 15 minutes. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubehpareplicasmismatch ok 17.739s ago 603.4us
alert: KubeHpaMaxedOut expr: kube_hpa_status_current_replicas{job="kube-state-metrics"} == kube_hpa_spec_max_replicas{job="kube-state-metrics"} for: 15m labels: severity: warning annotations: message: HPA {{ $labels.namespace }}/{{ $labels.hpa }} has been running at max replicas for longer than 15 minutes. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubehpamaxedout ok 17.739s ago 261us

kubernetes-resources

25.847s ago

17.72ms

Rule State Error Last Evaluation Evaluation Time
alert: KubeCPUOvercommit expr: sum(namespace:kube_pod_container_resource_requests_cpu_cores:sum) / sum(kube_node_status_allocatable_cpu_cores) > (count(kube_node_status_allocatable_cpu_cores) - 1) / count(kube_node_status_allocatable_cpu_cores) for: 5m labels: severity: warning annotations: message: Cluster has overcommitted CPU resource requests for Pods and cannot tolerate node failure. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecpuovercommit ok 25.848s ago 2.691ms
alert: KubeMemoryOvercommit expr: sum(namespace:kube_pod_container_resource_requests_memory_bytes:sum) / sum(kube_node_status_allocatable_memory_bytes) > (count(kube_node_status_allocatable_memory_bytes) - 1) / count(kube_node_status_allocatable_memory_bytes) for: 5m labels: severity: warning annotations: message: Cluster has overcommitted memory resource requests for Pods and cannot tolerate node failure. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubememoryovercommit ok 25.845s ago 3.084ms
alert: KubeCPUQuotaOvercommit expr: sum(kube_resourcequota{job="kube-state-metrics",resource="cpu",type="hard"}) / sum(kube_node_status_allocatable_cpu_cores) > 1.5 for: 5m labels: severity: warning annotations: message: Cluster has overcommitted CPU resource requests for Namespaces. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecpuquotaovercommit ok 25.842s ago 1.011ms
alert: KubeMemoryQuotaOvercommit expr: sum(kube_resourcequota{job="kube-state-metrics",resource="memory",type="hard"}) / sum(kube_node_status_allocatable_memory_bytes{job="node-exporter"}) > 1.5 for: 5m labels: severity: warning annotations: message: Cluster has overcommitted memory resource requests for Namespaces. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubememoryquotaovercommit ok 25.841s ago 320.4us
alert: KubeQuotaExceeded expr: kube_resourcequota{job="kube-state-metrics",type="used"} / ignoring(instance, job, type) (kube_resourcequota{job="kube-state-metrics",type="hard"} > 0) > 0.9 for: 15m labels: severity: warning annotations: message: Namespace {{ $labels.namespace }} is using {{ $value | humanizePercentage }} of its {{ $labels.resource }} quota. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubequotaexceeded err found duplicate series for the match group {namespace="ncra", resource="limits.cpu", resourcequota="compute-resources"} on the right hand-side of the operation: [{__name__="kube_resourcequota", instance="192.168.93.109:32080", job="kube-state-metrics", namespace="ncra", resource="limits.cpu", resourcequota="compute-resources", type="hard"}, {__name__="kube_resourcequota", instance="192.168.93.108:32080", job="kube-state-metrics", namespace="ncra", resource="limits.cpu", resourcequota="compute-resources", type="hard"}];many-to-many matching not allowed: matching labels must be unique on one side 25.841s ago 1.799ms
alert: CPUThrottlingHigh expr: sum by(container, pod, namespace) (increase(container_cpu_cfs_throttled_periods_total{container!=""}[5m])) / sum by(container, pod, namespace) (increase(container_cpu_cfs_periods_total[5m])) > (25 / 100) for: 15m labels: severity: warning annotations: message: '{{ $value | humanizePercentage }} throttling of CPU in namespace {{ $labels.namespace }} for container {{ $labels.container }} in pod {{ $labels.pod }}.' runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-cputhrottlinghigh ok 25.84s ago 8.782ms

kubernetes-storage

15.665s ago

13.57ms

Rule State Error Last Evaluation Evaluation Time
alert: KubePersistentVolumeFillingUp expr: kubelet_volume_stats_available_bytes{job="kubelet"} / kubelet_volume_stats_capacity_bytes{job="kubelet"} < 0.03 for: 1m labels: severity: critical annotations: message: The PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} is only {{ $value | humanizePercentage }} free. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepersistentvolumefillingup ok 15.665s ago 772.6us
alert: KubePersistentVolumeFillingUp expr: (kubelet_volume_stats_available_bytes{job="kubelet"} / kubelet_volume_stats_capacity_bytes{job="kubelet"}) < 0.15 and predict_linear(kubelet_volume_stats_available_bytes{job="kubelet"}[6h], 4 * 24 * 3600) < 0 for: 1h labels: severity: warning annotations: message: Based on recent sampling, the PersistentVolume claimed by {{ $labels.persistentvolumeclaim }} in Namespace {{ $labels.namespace }} is expected to fill up within four days. Currently {{ $value | humanizePercentage }} is available. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepersistentvolumefillingup ok 15.664s ago 2.637ms
alert: KubePersistentVolumeErrors expr: kube_persistentvolume_status_phase{job="kube-state-metrics",phase=~"Failed|Pending"} > 0 for: 5m labels: severity: critical annotations: message: The persistent volume {{ $labels.persistentvolume }} has status {{ $labels.phase }}. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubepersistentvolumeerrors ok 15.662s ago 10.14ms

kubernetes-system

25.972s ago

6.076ms

Rule State Error Last Evaluation Evaluation Time
alert: KubeVersionMismatch expr: count(count by(gitVersion) (label_replace(kubernetes_build_info{job!~"kube-dns|coredns"}, "gitVersion", "$1", "gitVersion", "(v[0-9]*.[0-9]*.[0-9]*).*"))) > 1 for: 15m labels: severity: warning annotations: message: There are {{ $value }} different semantic versions of Kubernetes components running. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeversionmismatch ok 25.973s ago 1.854ms
alert: KubeClientErrors expr: (sum by(instance, job) (rate(rest_client_requests_total{code=~"5.."}[5m])) / sum by(instance, job) (rate(rest_client_requests_total[5m]))) > 0.01 for: 15m labels: severity: warning annotations: message: Kubernetes API server client '{{ $labels.job }}/{{ $labels.instance }}' is experiencing {{ $value | humanizePercentage }} errors.' runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeclienterrors ok 25.971s ago 4.2ms

kubernetes-system-apiserver

13.972s ago

43.13ms

Rule State Error Last Evaluation Evaluation Time
alert: KubeAPILatencyHigh expr: (cluster:apiserver_request_duration_seconds:mean5m{job="kube-apiserver"} > on(verb) group_left() (avg by(verb) (cluster:apiserver_request_duration_seconds:mean5m{job="kube-apiserver"} >= 0) + 2 * stddev by(verb) (cluster:apiserver_request_duration_seconds:mean5m{job="kube-apiserver"} >= 0))) > on(verb) group_left() 1.2 * avg by(verb) (cluster:apiserver_request_duration_seconds:mean5m{job="kube-apiserver"} >= 0) and on(verb, resource) cluster_quantile:apiserver_request_duration_seconds:histogram_quantile{job="kube-apiserver",quantile="0.99"} > 1 for: 5m labels: severity: warning annotations: message: The API server has an abnormal latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapilatencyhigh ok 13.972s ago 7.769ms
alert: KubeAPILatencyHigh expr: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile{job="kube-apiserver",quantile="0.99"} > 4 for: 10m labels: severity: critical annotations: message: The API server has a 99th percentile latency of {{ $value }} seconds for {{ $labels.verb }} {{ $labels.resource }}. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapilatencyhigh ok 13.964s ago 1.424ms
alert: KubeAPIErrorsHigh expr: sum by(resource, subresource, verb) (rate(apiserver_request_total{code=~"5..",job="kube-apiserver"}[5m])) / sum by(resource, subresource, verb) (rate(apiserver_request_total{job="kube-apiserver"}[5m])) > 0.1 for: 10m labels: severity: critical annotations: message: API server is returning errors for {{ $value | humanizePercentage }} of requests for {{ $labels.verb }} {{ $labels.resource }} {{ $labels.subresource }}. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapierrorshigh ok 13.963s ago 16.57ms
alert: KubeAPIErrorsHigh expr: sum by(resource, subresource, verb) (rate(apiserver_request_total{code=~"5..",job="kube-apiserver"}[5m])) / sum by(resource, subresource, verb) (rate(apiserver_request_total{job="kube-apiserver"}[5m])) > 0.05 for: 10m labels: severity: warning annotations: message: API server is returning errors for {{ $value | humanizePercentage }} of requests for {{ $labels.verb }} {{ $labels.resource }} {{ $labels.subresource }}. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapierrorshigh ok 13.947s ago 13.85ms
alert: KubeClientCertificateExpiration expr: apiserver_client_certificate_expiration_seconds_count{job="kube-apiserver"} > 0 and on(job) histogram_quantile(0.01, sum by(job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="kube-apiserver"}[5m]))) < 604800 labels: severity: warning annotations: message: A client certificate used to authenticate to the apiserver is expiring in less than 7.0 days. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeclientcertificateexpiration ok 13.933s ago 976.9us
alert: KubeClientCertificateExpiration expr: apiserver_client_certificate_expiration_seconds_count{job="kube-apiserver"} > 0 and on(job) histogram_quantile(0.01, sum by(job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="kube-apiserver"}[5m]))) < 86400 labels: severity: critical annotations: message: A client certificate used to authenticate to the apiserver is expiring in less than 24.0 hours. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeclientcertificateexpiration ok 13.932s ago 842.3us
alert: AggregatedAPIErrors expr: sum by(name, namespace) (increase(aggregator_unavailable_apiservice_count[5m])) > 2 labels: severity: warning annotations: message: An aggregated API {{ $labels.name }}/{{ $labels.namespace }} has reported errors. The number of errors have increased for it in the past five minutes. High values indicate that the availability of the service changes too often. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-aggregatedapierrors ok 13.932s ago 197.9us
alert: AggregatedAPIDown expr: sum by(name, namespace) (sum_over_time(aggregator_unavailable_apiservice[5m])) > 0 for: 5m labels: severity: warning annotations: message: An aggregated API {{ $labels.name }}/{{ $labels.namespace }} is down. It has not been available at least for the past five minutes. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-aggregatedapidown ok 13.931s ago 1.198ms
alert: KubeAPIDown expr: absent(up{job="kube-apiserver"} == 1) for: 15m labels: severity: critical annotations: message: KubeAPI has disappeared from Prometheus target discovery. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapidown ok 13.93s ago 229.8us

kubernetes-system-controller-manager

14.232s ago

620.6us

Rule State Error Last Evaluation Evaluation Time
alert: KubeControllerManagerDown expr: absent(up{job="kube-controller-manager"} == 1) for: 15m labels: severity: critical annotations: message: KubeControllerManager has disappeared from Prometheus target discovery. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubecontrollermanagerdown ok 14.232s ago 603.2us

kubernetes-system-kubelet

9.82s ago

12.31ms

Rule State Error Last Evaluation Evaluation Time
alert: KubeNodeNotReady expr: kube_node_status_condition{condition="Ready",job="kube-state-metrics",status="true"} == 0 for: 15m labels: severity: warning annotations: message: '{{ $labels.node }} has been unready for more than 15 minutes.' runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubenodenotready ok 9.821s ago 2.645ms
alert: KubeNodeUnreachable expr: kube_node_spec_taint{effect="NoSchedule",job="kube-state-metrics",key="node.kubernetes.io/unreachable"} == 1 for: 2m labels: severity: warning annotations: message: '{{ $labels.node }} is unreachable and some workloads may be rescheduled.' runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubenodeunreachable ok 9.818s ago 261.5us
alert: KubeletTooManyPods expr: max by(node) (max by(instance) (kubelet_running_pod_count{job="kubelet"}) * on(instance) group_left(node) kubelet_node_name{job="kubelet"}) / max by(node) (kube_node_status_capacity_pods{job="kube-state-metrics"} != 1) > 0.95 for: 15m labels: severity: warning annotations: message: Kubelet '{{ $labels.node }}' is running at {{ $value | humanizePercentage }} of its Pod capacity. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubelettoomanypods ok 9.818s ago 2.662ms
alert: KubeNodeReadinessFlapping expr: sum by(node) (changes(kube_node_status_condition{condition="Ready",status="true"}[15m])) > 2 for: 15m labels: severity: warning annotations: message: The readiness status of node {{ $labels.node }} has changed {{ $value }} times in the last 15 minutes. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubenodereadinessflapping ok 9.815s ago 1.665ms
alert: KubeletPlegDurationHigh expr: node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile{quantile="0.99"} >= 10 for: 5m labels: severity: warning annotations: message: The Kubelet Pod Lifecycle Event Generator has a 99th percentile duration of {{ $value }} seconds on node {{ $labels.node }}. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletplegdurationhigh ok 9.814s ago 309.2us
alert: KubeletPodStartUpLatencyHigh expr: histogram_quantile(0.99, sum by(instance, le) (rate(kubelet_pod_worker_duration_seconds_bucket{job="kubelet"}[5m]))) > 60 for: 15m labels: severity: warning annotations: message: Kubelet Pod startup 99th percentile latency is {{ $value }} seconds on node {{ $labels.node }}. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletpodstartuplatencyhigh ok 9.814s ago 4.384ms
alert: KubeletDown expr: absent(up{job="kubelet"} == 1) for: 15m labels: severity: critical annotations: message: Kubelet has disappeared from Prometheus target discovery. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeletdown ok 9.809s ago 352.5us

kubernetes-system-scheduler

23.64s ago

558.5us

Rule State Error Last Evaluation Evaluation Time
alert: KubeSchedulerDown expr: absent(up{job="kube-scheduler"} == 1) for: 15m labels: severity: critical annotations: message: KubeScheduler has disappeared from Prometheus target discovery. runbook_url: https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeschedulerdown ok 23.64s ago 546us

k8s.rules

9.793s ago

378.5ms

Rule State Error Last Evaluation Evaluation Time
record: namespace:container_cpu_usage_seconds_total:sum_rate expr: sum by(namespace) (rate(container_cpu_usage_seconds_total{container!="POD",image!="",job="cadvisor"}[5m])) ok 9.793s ago 13.45ms
record: node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate expr: sum by(cluster, namespace, pod, container) (rate(container_cpu_usage_seconds_total{container!="POD",image!="",job="cadvisor"}[5m])) * on(cluster, namespace, pod) group_left(node) topk by(cluster, namespace, pod) (1, max by(cluster, namespace, pod, node) (kube_pod_info)) ok 9.78s ago 40.32ms
record: node_namespace_pod_container:container_memory_working_set_bytes expr: container_memory_working_set_bytes{image!="",job="cadvisor"} * on(namespace, pod) group_left(node) topk by(namespace, pod) (1, max by(namespace, pod, node) (kube_pod_info)) ok 9.739s ago 48.81ms
record: node_namespace_pod_container:container_memory_rss expr: container_memory_rss{image!="",job="cadvisor"} * on(namespace, pod) group_left(node) topk by(namespace, pod) (1, max by(namespace, pod, node) (kube_pod_info)) ok 9.691s ago 49.21ms
record: node_namespace_pod_container:container_memory_cache expr: container_memory_cache{image!="",job="cadvisor"} * on(namespace, pod) group_left(node) topk by(namespace, pod) (1, max by(namespace, pod, node) (kube_pod_info)) ok 9.642s ago 40.9ms
record: node_namespace_pod_container:container_memory_swap expr: container_memory_swap{image!="",job="cadvisor"} * on(namespace, pod) group_left(node) topk by(namespace, pod) (1, max by(namespace, pod, node) (kube_pod_info)) ok 9.601s ago 42.6ms
record: namespace:container_memory_usage_bytes:sum expr: sum by(namespace) (container_memory_usage_bytes{container!="POD",image!="",job="cadvisor"}) ok 9.558s ago 7.891ms
record: namespace:kube_pod_container_resource_requests_memory_bytes:sum expr: sum by(namespace) (sum by(namespace, pod) (max by(namespace, pod, container) (kube_pod_container_resource_requests_memory_bytes{job="kube-state-metrics"}) * on(namespace, pod) group_left() max by(namespace, pod) (kube_pod_status_phase{phase=~"Pending|Running"} == 1))) ok 9.55s ago 51.09ms
record: namespace:kube_pod_container_resource_requests_cpu_cores:sum expr: sum by(namespace) (sum by(namespace, pod) (max by(namespace, pod, container) (kube_pod_container_resource_requests_cpu_cores{job="kube-state-metrics"}) * on(namespace, pod) group_left() max by(namespace, pod) (kube_pod_status_phase{phase=~"Pending|Running"} == 1))) ok 9.499s ago 51.4ms
record: mixin_pod_workload expr: max by(cluster, namespace, workload, pod) (label_replace(label_replace(kube_pod_owner{job="kube-state-metrics",owner_kind="ReplicaSet"}, "replicaset", "$1", "owner_name", "(.*)") * on(replicaset, namespace) group_left(owner_name) topk by(replicaset, namespace) (1, max by(replicaset, namespace, owner_name) (kube_replicaset_owner{job="kube-state-metrics"})), "workload", "$1", "owner_name", "(.*)")) labels: workload_type: deployment ok 9.448s ago 9.739ms
record: mixin_pod_workload expr: max by(cluster, namespace, workload, pod) (label_replace(kube_pod_owner{job="kube-state-metrics",owner_kind="DaemonSet"}, "workload", "$1", "owner_name", "(.*)")) labels: workload_type: daemonset ok 9.439s ago 5.881ms
record: mixin_pod_workload expr: max by(cluster, namespace, workload, pod) (label_replace(kube_pod_owner{job="kube-state-metrics",owner_kind="StatefulSet"}, "workload", "$1", "owner_name", "(.*)")) labels: workload_type: statefulset ok 9.433s ago 17.15ms

kube-apiserver.rules

30.908s ago

14.62s

Rule State Error Last Evaluation Evaluation Time
record: apiserver_request:burnrate1d expr: ((sum(rate(apiserver_request_duration_seconds_count{job="kube-apiserver",verb=~"LIST|GET"}[1d])) - (sum(rate(apiserver_request_duration_seconds_bucket{job="kube-apiserver",le="0.1",scope="resource",verb=~"LIST|GET"}[1d])) + sum(rate(apiserver_request_duration_seconds_bucket{job="kube-apiserver",le="0.5",scope="namespace",verb=~"LIST|GET"}[1d])) + sum(rate(apiserver_request_duration_seconds_bucket{job="kube-apiserver",le="5",scope="cluster",verb=~"LIST|GET"}[1d])))) + sum(rate(apiserver_request_total{code=~"5..",job="kube-apiserver",verb=~"LIST|GET"}[1d]))) / sum(rate(apiserver_request_total{job="kube-apiserver",verb=~"LIST|GET"}[1d])) labels: verb: read ok 908ms ago 595.9ms
record: apiserver_request:burnrate1h expr: ((sum(rate(apiserver_request_duration_seconds_count{job="kube-apiserver",verb=~"LIST|GET"}[1h])) - (sum(rate(apiserver_request_duration_seconds_bucket{job="kube-apiserver",le="0.1",scope="resource",verb=~"LIST|GET"}[1h])) + sum(rate(apiserver_request_duration_seconds_bucket{job="kube-apiserver",le="0.5",scope="namespace",verb=~"LIST|GET"}[1h])) + sum(rate(apiserver_request_duration_seconds_bucket{job="kube-apiserver",le="5",scope="cluster",verb=~"LIST|GET"}[1h])))) + sum(rate(apiserver_request_total{code=~"5..",job="kube-apiserver",verb=~"LIST|GET"}[1h]))) / sum(rate(apiserver_request_total{job="kube-apiserver",verb=~"LIST|GET"}[1h])) labels: verb: read ok 313ms ago 39.55ms
record: apiserver_request:burnrate2h expr: ((sum(rate(apiserver_request_duration_seconds_count{job="kube-apiserver",verb=~"LIST|GET"}[2h])) - (sum(rate(apiserver_request_duration_seconds_bucket{job="kube-apiserver",le="0.1",scope="resource",verb=~"LIST|GET"}[2h])) + sum(rate(apiserver_request_duration_seconds_bucket{job="kube-apiserver",le="0.5",scope="namespace",verb=~"LIST|GET"}[2h])) + sum(rate(apiserver_request_duration_seconds_bucket{job="kube-apiserver",le="5",scope="cluster",verb=~"LIST|GET"}[2h])))) + sum(rate(apiserver_request_total{code=~"5..",job="kube-apiserver",verb=~"LIST|GET"}[2h]))) / sum(rate(apiserver_request_total{job="kube-apiserver",verb=~"LIST|GET"}[2h])) labels: verb: read ok 274ms ago 54.22ms
record: apiserver_request:burnrate30m expr: ((sum(rate(apiserver_request_duration_seconds_count{job="kube-apiserver",verb=~"LIST|GET"}[30m])) - (sum(rate(apiserver_request_duration_seconds_bucket{job="kube-apiserver",le="0.1",scope="resource",verb=~"LIST|GET"}[30m])) + sum(rate(apiserver_request_duration_seconds_bucket{job="kube-apiserver",le="0.5",scope="namespace",verb=~"LIST|GET"}[30m])) + sum(rate(apiserver_request_duration_seconds_bucket{job="kube-apiserver",le="5",scope="cluster",verb=~"LIST|GET"}[30m])))) + sum(rate(apiserver_request_total{code=~"5..",job="kube-apiserver",verb=~"LIST|GET"}[30m]))) / sum(rate(apiserver_request_total{job="kube-apiserver",verb=~"LIST|GET"}[30m])) labels: verb: read ok 220ms ago 32.28ms
record: apiserver_request:burnrate3d expr: ((sum(rate(apiserver_request_duration_seconds_count{job="kube-apiserver",verb=~"LIST|GET"}[3d])) - (sum(rate(apiserver_request_duration_seconds_bucket{job="kube-apiserver",le="0.1",scope="resource",verb=~"LIST|GET"}[3d])) + sum(rate(apiserver_request_duration_seconds_bucket{job="kube-apiserver",le="0.5",scope="namespace",verb=~"LIST|GET"}[3d])) + sum(rate(apiserver_request_duration_seconds_bucket{job="kube-apiserver",le="5",scope="cluster",verb=~"LIST|GET"}[3d])))) + sum(rate(apiserver_request_total{code=~"5..",job="kube-apiserver",verb=~"LIST|GET"}[3d]))) / sum(rate(apiserver_request_total{job="kube-apiserver",verb=~"LIST|GET"}[3d])) labels: verb: read ok 30.125s ago 1.594s
record: apiserver_request:burnrate5m expr: ((sum(rate(apiserver_request_duration_seconds_count{job="kube-apiserver",verb=~"LIST|GET"}[5m])) - (sum(rate(apiserver_request_duration_seconds_bucket{job="kube-apiserver",le="0.1",scope="resource",verb=~"LIST|GET"}[5m])) + sum(rate(apiserver_request_duration_seconds_bucket{job="kube-apiserver",le="0.5",scope="namespace",verb=~"LIST|GET"}[5m])) + sum(rate(apiserver_request_duration_seconds_bucket{job="kube-apiserver",le="5",scope="cluster",verb=~"LIST|GET"}[5m])))) + sum(rate(apiserver_request_total{code=~"5..",job="kube-apiserver",verb=~"LIST|GET"}[5m]))) / sum(rate(apiserver_request_total{job="kube-apiserver",verb=~"LIST|GET"}[5m])) labels: verb: read ok 28.531s ago 28.11ms
record: apiserver_request:burnrate6h expr: ((sum(rate(apiserver_request_duration_seconds_count{job="kube-apiserver",verb=~"LIST|GET"}[6h])) - (sum(rate(apiserver_request_duration_seconds_bucket{job="kube-apiserver",le="0.1",scope="resource",verb=~"LIST|GET"}[6h])) + sum(rate(apiserver_request_duration_seconds_bucket{job="kube-apiserver",le="0.5",scope="namespace",verb=~"LIST|GET"}[6h])) + sum(rate(apiserver_request_duration_seconds_bucket{job="kube-apiserver",le="5",scope="cluster",verb=~"LIST|GET"}[6h])))) + sum(rate(apiserver_request_total{code=~"5..",job="kube-apiserver",verb=~"LIST|GET"}[6h]))) / sum(rate(apiserver_request_total{job="kube-apiserver",verb=~"LIST|GET"}[6h])) labels: verb: read ok 28.503s ago 221.6ms
record: apiserver_request:burnrate1d expr: ((sum(rate(apiserver_request_duration_seconds_count{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE"}[1d])) - sum(rate(apiserver_request_duration_seconds_bucket{job="kube-apiserver",le="1",verb=~"POST|PUT|PATCH|DELETE"}[1d]))) + sum(rate(apiserver_request_total{code=~"5..",job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE"}[1d]))) / sum(rate(apiserver_request_total{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE"}[1d])) labels: verb: write ok 28.281s ago 353.5ms
record: apiserver_request:burnrate1h expr: ((sum(rate(apiserver_request_duration_seconds_count{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE"}[1h])) - sum(rate(apiserver_request_duration_seconds_bucket{job="kube-apiserver",le="1",verb=~"POST|PUT|PATCH|DELETE"}[1h]))) + sum(rate(apiserver_request_total{code=~"5..",job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE"}[1h]))) / sum(rate(apiserver_request_total{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE"}[1h])) labels: verb: write ok 27.928s ago 28.52ms
record: apiserver_request:burnrate2h expr: ((sum(rate(apiserver_request_duration_seconds_count{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE"}[2h])) - sum(rate(apiserver_request_duration_seconds_bucket{job="kube-apiserver",le="1",verb=~"POST|PUT|PATCH|DELETE"}[2h]))) + sum(rate(apiserver_request_total{code=~"5..",job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE"}[2h]))) / sum(rate(apiserver_request_total{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE"}[2h])) labels: verb: write ok 27.9s ago 34.86ms
record: apiserver_request:burnrate30m expr: ((sum(rate(apiserver_request_duration_seconds_count{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE"}[30m])) - sum(rate(apiserver_request_duration_seconds_bucket{job="kube-apiserver",le="1",verb=~"POST|PUT|PATCH|DELETE"}[30m]))) + sum(rate(apiserver_request_total{code=~"5..",job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE"}[30m]))) / sum(rate(apiserver_request_total{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE"}[30m])) labels: verb: write ok 27.865s ago 35.01ms
record: apiserver_request:burnrate3d expr: ((sum(rate(apiserver_request_duration_seconds_count{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE"}[3d])) - sum(rate(apiserver_request_duration_seconds_bucket{job="kube-apiserver",le="1",verb=~"POST|PUT|PATCH|DELETE"}[3d]))) + sum(rate(apiserver_request_total{code=~"5..",job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE"}[3d]))) / sum(rate(apiserver_request_total{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE"}[3d])) labels: verb: write ok 27.83s ago 986.1ms
record: apiserver_request:burnrate5m expr: ((sum(rate(apiserver_request_duration_seconds_count{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE"}[5m])) - sum(rate(apiserver_request_duration_seconds_bucket{job="kube-apiserver",le="1",verb=~"POST|PUT|PATCH|DELETE"}[5m]))) + sum(rate(apiserver_request_total{code=~"5..",job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE"}[5m]))) / sum(rate(apiserver_request_total{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE"}[5m])) labels: verb: write ok 26.844s ago 18.68ms
record: apiserver_request:burnrate6h expr: ((sum(rate(apiserver_request_duration_seconds_count{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE"}[6h])) - sum(rate(apiserver_request_duration_seconds_bucket{job="kube-apiserver",le="1",verb=~"POST|PUT|PATCH|DELETE"}[6h]))) + sum(rate(apiserver_request_total{code=~"5..",job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE"}[6h]))) / sum(rate(apiserver_request_total{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE"}[6h])) labels: verb: write ok 26.826s ago 111.7ms
record: apiserver_request:availability30d expr: 1 - ((sum(increase(apiserver_request_duration_seconds_count{verb=~"POST|PUT|PATCH|DELETE"}[30d])) - sum(increase(apiserver_request_duration_seconds_bucket{le="1",verb=~"POST|PUT|PATCH|DELETE"}[30d]))) + (sum(increase(apiserver_request_duration_seconds_count{verb=~"LIST|GET"}[30d])) - (sum(increase(apiserver_request_duration_seconds_bucket{le="0.1",scope="resource",verb=~"LIST|GET"}[30d])) + sum(increase(apiserver_request_duration_seconds_bucket{le="0.5",scope="namespace",verb=~"LIST|GET"}[30d])) + sum(increase(apiserver_request_duration_seconds_bucket{le="5",scope="cluster",verb=~"LIST|GET"}[30d])))) + sum(code:apiserver_request_total:increase30d{code=~"5.."})) / sum(code:apiserver_request_total:increase30d) labels: verb: all ok 26.714s ago 3.402s
record: apiserver_request:availability30d expr: 1 - (sum(increase(apiserver_request_duration_seconds_count{job="kube-apiserver",verb=~"LIST|GET"}[30d])) - (sum(increase(apiserver_request_duration_seconds_bucket{job="kube-apiserver",le="0.1",scope="resource",verb=~"LIST|GET"}[30d])) + sum(increase(apiserver_request_duration_seconds_bucket{job="kube-apiserver",le="0.5",scope="namespace",verb=~"LIST|GET"}[30d])) + sum(increase(apiserver_request_duration_seconds_bucket{job="kube-apiserver",le="5",scope="cluster",verb=~"LIST|GET"}[30d]))) + sum(code:apiserver_request_total:increase30d{code=~"5..",verb="read"})) / sum(code:apiserver_request_total:increase30d{verb="read"}) labels: verb: read ok 23.313s ago 2.285s
record: apiserver_request:availability30d expr: 1 - ((sum(increase(apiserver_request_duration_seconds_count{verb=~"POST|PUT|PATCH|DELETE"}[30d])) - sum(increase(apiserver_request_duration_seconds_bucket{le="1",verb=~"POST|PUT|PATCH|DELETE"}[30d]))) + sum(code:apiserver_request_total:increase30d{code=~"5..",verb="write"})) / sum(code:apiserver_request_total:increase30d{verb="write"}) labels: verb: write ok 21.028s ago 1.292s
record: code:apiserver_request_total:increase30d expr: sum by(code) (increase(apiserver_request_total{job="kube-apiserver",verb=~"LIST|GET"}[30d])) labels: verb: read ok 19.736s ago 1.614s
record: code:apiserver_request_total:increase30d expr: sum by(code) (increase(apiserver_request_total{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE"}[30d])) labels: verb: write ok 18.122s ago 898.9ms
record: code_resource:apiserver_request_total:rate5m expr: sum by(code, resource) (rate(apiserver_request_total{job="kube-apiserver",verb=~"LIST|GET"}[5m])) labels: verb: read ok 17.223s ago 8.842ms
record: code_resource:apiserver_request_total:rate5m expr: sum by(code, resource) (rate(apiserver_request_total{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE"}[5m])) labels: verb: write ok 17.214s ago 5.31ms
record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile expr: histogram_quantile(0.99, sum by(le, resource) (rate(apiserver_request_duration_seconds_bucket{job="kube-apiserver",verb=~"LIST|GET"}[5m]))) > 0 labels: quantile: "0.99" verb: read ok 17.209s ago 191.8ms
record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile expr: histogram_quantile(0.99, sum by(le, resource) (rate(apiserver_request_duration_seconds_bucket{job="kube-apiserver",verb=~"POST|PUT|PATCH|DELETE"}[5m]))) > 0 labels: quantile: "0.99" verb: write ok 17.018s ago 112.1ms
record: cluster:apiserver_request_duration_seconds:mean5m expr: sum without(instance, pod) (rate(apiserver_request_duration_seconds_sum{subresource!="log",verb!~"LIST|WATCH|WATCHLIST|DELETECOLLECTION|PROXY|CONNECT"}[5m])) / sum without(instance, pod) (rate(apiserver_request_duration_seconds_count{subresource!="log",verb!~"LIST|WATCH|WATCHLIST|DELETECOLLECTION|PROXY|CONNECT"}[5m])) ok 16.906s ago 13.58ms
record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile expr: histogram_quantile(0.99, sum without(instance, pod) (rate(apiserver_request_duration_seconds_bucket{job="kube-apiserver",subresource!="log",verb!~"LIST|WATCH|WATCHLIST|DELETECOLLECTION|PROXY|CONNECT"}[5m]))) labels: quantile: "0.99" ok 16.892s ago 198.4ms
record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile expr: histogram_quantile(0.9, sum without(instance, pod) (rate(apiserver_request_duration_seconds_bucket{job="kube-apiserver",subresource!="log",verb!~"LIST|WATCH|WATCHLIST|DELETECOLLECTION|PROXY|CONNECT"}[5m]))) labels: quantile: "0.9" ok 16.694s ago 201.3ms
record: cluster_quantile:apiserver_request_duration_seconds:histogram_quantile expr: histogram_quantile(0.5, sum without(instance, pod) (rate(apiserver_request_duration_seconds_bucket{job="kube-apiserver",subresource!="log",verb!~"LIST|WATCH|WATCHLIST|DELETECOLLECTION|PROXY|CONNECT"}[5m]))) labels: quantile: "0.5" ok 16.493s ago 197.4ms

kube-scheduler.rules

14.594s ago

7.479ms

Rule State Error Last Evaluation Evaluation Time
record: cluster_quantile:scheduler_e2e_scheduling_duration_seconds:histogram_quantile expr: histogram_quantile(0.99, sum without(instance, pod) (rate(scheduler_e2e_scheduling_duration_seconds_bucket{job="kube-scheduler"}[5m]))) labels: quantile: "0.99" ok 14.594s ago 1.232ms
record: cluster_quantile:scheduler_scheduling_algorithm_duration_seconds:histogram_quantile expr: histogram_quantile(0.99, sum without(instance, pod) (rate(scheduler_scheduling_algorithm_duration_seconds_bucket{job="kube-scheduler"}[5m]))) labels: quantile: "0.99" ok 14.593s ago 989.4us
record: cluster_quantile:scheduler_binding_duration_seconds:histogram_quantile expr: histogram_quantile(0.99, sum without(instance, pod) (rate(scheduler_binding_duration_seconds_bucket{job="kube-scheduler"}[5m]))) labels: quantile: "0.99" ok 14.592s ago 901.9us
record: cluster_quantile:scheduler_e2e_scheduling_duration_seconds:histogram_quantile expr: histogram_quantile(0.9, sum without(instance, pod) (rate(scheduler_e2e_scheduling_duration_seconds_bucket{job="kube-scheduler"}[5m]))) labels: quantile: "0.9" ok 14.592s ago 793.2us
record: cluster_quantile:scheduler_scheduling_algorithm_duration_seconds:histogram_quantile expr: histogram_quantile(0.9, sum without(instance, pod) (rate(scheduler_scheduling_algorithm_duration_seconds_bucket{job="kube-scheduler"}[5m]))) labels: quantile: "0.9" ok 14.591s ago 788.2us
record: cluster_quantile:scheduler_binding_duration_seconds:histogram_quantile expr: histogram_quantile(0.9, sum without(instance, pod) (rate(scheduler_binding_duration_seconds_bucket{job="kube-scheduler"}[5m]))) labels: quantile: "0.9" ok 14.59s ago 629us
record: cluster_quantile:scheduler_e2e_scheduling_duration_seconds:histogram_quantile expr: histogram_quantile(0.5, sum without(instance, pod) (rate(scheduler_e2e_scheduling_duration_seconds_bucket{job="kube-scheduler"}[5m]))) labels: quantile: "0.5" ok 14.59s ago 762.5us
record: cluster_quantile:scheduler_scheduling_algorithm_duration_seconds:histogram_quantile expr: histogram_quantile(0.5, sum without(instance, pod) (rate(scheduler_scheduling_algorithm_duration_seconds_bucket{job="kube-scheduler"}[5m]))) labels: quantile: "0.5" ok 14.589s ago 718.7us
record: cluster_quantile:scheduler_binding_duration_seconds:histogram_quantile expr: histogram_quantile(0.5, sum without(instance, pod) (rate(scheduler_binding_duration_seconds_bucket{job="kube-scheduler"}[5m]))) labels: quantile: "0.5" ok 14.588s ago 587.6us

kubelet.rules

25.765s ago

9.897ms

Rule State Error Last Evaluation Evaluation Time
record: node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile expr: histogram_quantile(0.99, sum by(instance, le) (rate(kubelet_pleg_relist_duration_seconds_bucket[5m])) * on(instance) group_left(node) kubelet_node_name{job="kubelet"}) labels: quantile: "0.99" ok 25.765s ago 3.719ms
record: node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile expr: histogram_quantile(0.9, sum by(instance, le) (rate(kubelet_pleg_relist_duration_seconds_bucket[5m])) * on(instance) group_left(node) kubelet_node_name{job="kubelet"}) labels: quantile: "0.9" ok 25.761s ago 3.065ms
record: node_quantile:kubelet_pleg_relist_duration_seconds:histogram_quantile expr: histogram_quantile(0.5, sum by(instance, le) (rate(kubelet_pleg_relist_duration_seconds_bucket[5m])) * on(instance) group_left(node) kubelet_node_name{job="kubelet"}) labels: quantile: "0.5" ok 25.758s ago 3.091ms

node.rules

24.784s ago

67.23ms

Rule State Error Last Evaluation Evaluation Time
record: :kube_pod_info_node_count: expr: sum(min by(cluster, node) (kube_pod_info)) ok 24.785s ago 28.53ms
record: node_namespace_pod:kube_pod_info: expr: topk by(namespace, pod) (1, max by(node, namespace, pod) (label_replace(kube_pod_info{job="kube-state-metrics"}, "pod", "$1", "pod", "(.*)"))) ok 24.756s ago 34.95ms
record: node:node_num_cpu:sum expr: count by(cluster, node) (sum by(node, cpu) (node_cpu_seconds_total{job="node-exporter"} * on(namespace, pod) group_left(node) node_namespace_pod:kube_pod_info:)) ok 24.721s ago 3.255ms
record: :node_memory_MemAvailable_bytes:sum expr: sum by(cluster) (node_memory_MemAvailable_bytes{job="node-exporter"} or (node_memory_Buffers_bytes{job="node-exporter"} + node_memory_Cached_bytes{job="node-exporter"} + node_memory_MemFree_bytes{job="node-exporter"} + node_memory_Slab_bytes{job="node-exporter"})) ok 24.718s ago 471.8us

ansible managed record rules

18.011s ago

21.52ms

Rule State Error Last Evaluation Evaluation Time
record: instance:node_cpu:load expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) ok 18.011s ago 14.41ms
record: instance:node_ram:usage expr: (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 ok 17.996s ago 2.745ms
record: instance:node_fs:disk_space expr: node_filesystem_avail_bytes{job="node",mountpoint="/"} / node_filesystem_size_bytes{job="node",mountpoint="/"} * 100 ok 17.994s ago 4.333ms