Prometheus · 完整白皮书 | 编程语言全景手册

📌 第一部分：Prometheus 概览与定位

1.1 定义与全称

Prometheus 是由 SoundCloud 于 2012 年创建的开源监控和告警工具包，由 Matt T. Proud 和 Julius Volz 设计。2016 年加入 CNCF，是第二个毕业的项目（仅次于 Kubernetes）。Prometheus 是 云原生监控的事实标准。

1.2 核心定位

Prometheus 的核心定位是 云原生监控和告警平台。它提供了：

多维数据模型（指标 + 标签）
Pull 模式采集指标
强大的 PromQL 查询语言
时序数据库（TSDB）
内置告警管理（Alertmanager）
服务发现（Kubernetes、Consul、DNS）
丰富的 Exporters（Node Exporter、MySQL Exporter 等）
与 Grafana 深度集成

1.3 主要应用领域

基础设施监控： CPU、内存、磁盘、网络
应用性能监控： 请求延迟、错误率、吞吐量
Kubernetes 监控： Pod、Node、Service 指标
数据库监控： MySQL、PostgreSQL、Redis 等
业务指标监控： 用户数、订单量、支付成功率
告警管理： 异常检测和通知

1.4 知名案例

Uber： 使用 Prometheus 监控微服务
Spotify： 使用 Prometheus 监控系统
GitLab： 内置 Prometheus 监控
DigitalOcean： 使用 Prometheus 监控基础设施
阿里巴巴： 使用 Prometheus 监控云服务
腾讯： 使用 Prometheus 监控 Kubernetes 集群

📜 第二部分：Prometheus 的历史与发展演进

2.1 诞生背景（2012年）

SoundCloud 在 2012 年面临监控系统扩展性问题，Google 的 Borgmon 系统给了他们灵感，创建了 Prometheus。2015 年开源，2016 年成为 CNCF 第二个毕业项目。

2.2 关键版本里程碑

Prometheus 0.1（2012年）： 内部版本
Prometheus 0.5（2015年）： 开源发布
Prometheus 1.0（2016年）： 稳定版本
Prometheus 2.0（2017年）： 重大性能提升——新的 TSDB 引擎
Prometheus 2.10（2019年）： 增强查询性能
Prometheus 2.20（2020年）： 增强服务发现
Prometheus 2.30（2021年）： 改进告警
Prometheus 2.40（2023年）： 性能优化
Prometheus 2.53（2024年）： 最新版本

2.3 核心架构

Prometheus Server： 核心服务（采集 + 存储 + 查询）
Alertmanager： 告警管理
Exporters： 指标采集器（Node、MySQL、Redis 等）
Pushgateway： 短生命周期任务指标推送
Client Libraries： 应用指标埋点
Service Discovery： 自动发现目标

⚙️ 第三部分：核心概念与配置

3.1 数据模型

# 指标格式：{=, ...}

# 示例
http_requests_total{method="GET", status="200", endpoint="/api/users"}
node_cpu_seconds_total{cpu="0", mode="user"}
container_memory_usage_bytes{pod="nginx-123", namespace="prod"}

# 指标类型
# Counter: 只增不减的计数器
# Gauge: 可增可减的测量值
# Histogram: 分布统计（请求延迟）
# Summary: 分位数统计

3.2 配置文件

# prometheus.yml
global:
  scrape_interval: 15s      # 采集间隔
  evaluation_interval: 15s  # 规则评估间隔
  external_labels:
    monitor: "my-monitor"

# Alertmanager 配置
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

# 告警规则
rule_files:
  - "alerts/*.yml"

# 采集目标
scrape_configs:
  # 采集自身指标
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  # 采集 Node Exporter
  - job_name: "node"
    static_configs:
      - targets: ["node1:9100", "node2:9100"]

  # 采集 MySQL Exporter
  - job_name: "mysql"
    static_configs:
      - targets: ["mysql:9104"]

  # Kubernetes 服务发现
  - job_name: "kubernetes-pods"
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scheme]
        action: replace
        target_label: __scheme__
        regex: (.+)
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name

3.3 告警规则

# alerts/node_alerts.yml
groups:
  - name: node_alerts
    rules:
      # CPU 使用率告警
      - alert: HighCPUUsage
        expr: (100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "高 CPU 使用率"
          description: "{{ $labels.instance }} CPU 使用率超过 80%"

      # 内存使用率告警
      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "高内存使用率"
          description: "{{ $labels.instance }} 内存使用率超过 90%"

      # 磁盘空间告警
      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "磁盘空间不足"
          description: "{{ $labels.instance }} 磁盘剩余空间不足 15%"

      # 节点宕机告警
      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "实例宕机"
          description: "{{ $labels.instance }} 已宕机超过 1 分钟"

3.4 PromQL 查询语言

# 基础查询
node_cpu_seconds_total
http_requests_total

# 标签过滤
node_cpu_seconds_total{mode="user"}
http_requests_total{method="GET"}

# 时间范围
node_cpu_seconds_total[5m]  # 最近 5 分钟

# 聚合函数
sum(http_requests_total)                           # 求和
avg(node_cpu_seconds_total)                        # 平均值
max(http_requests_total) by (method)               # 按方法分组取最大值
count(up)                                          # 计数

# 速率计算（Counter 类型）
rate(http_requests_total[5m])                      # 每秒请求数
irate(http_requests_total[1m])                     # 瞬时速率

# 偏移
node_cpu_seconds_total offset 1h                   # 1 小时前的数据

# 比较运算
http_requests_total > 1000

# 时间序列运算
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes

# 常用查询示例
# CPU 使用率（最近 5 分钟）
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# 内存使用率
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

# QPS（每秒请求数）
sum(rate(http_requests_total[1m]))

# 错误率
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

# 请求延迟 P99
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# Pod 重启次数
sum(kube_pod_container_status_restarts_total) by (pod)

3.5 常用 Exporters

Node Exporter： 系统指标（CPU、内存、磁盘、网络）
MySQL Exporter： MySQL 数据库指标
PostgreSQL Exporter： PostgreSQL 指标
Redis Exporter： Redis 缓存指标
MongoDB Exporter： MongoDB 指标
Nginx Exporter： Nginx 指标
Elasticsearch Exporter： Elasticsearch 指标
Blackbox Exporter： HTTP/HTTPS 探测
JMX Exporter： Java 应用指标
Pushgateway： 短生命周期任务指标

⚖️ 第四部分：Prometheus vs 其他监控工具

对比项	Prometheus	Zabbix	Datadog	Graphite
数据模型	多维标签	主机+监控项	多维标签	点+层级
采集方式	Pull	Push/Pull	Push	Push
查询语言	PromQL	内置	内置	Graphite 查询
价格	免费	免费	付费	免费
云原生	✅ 原生	❌ 传统	✅	❌
K8s 集成	✅ 深度	❌ 有限	✅	❌
适用场景	云原生监控	传统 IT 监控	SaaS 监控	时序数据

🧠 第五部分：学习建议

基础入门

Prometheus 安装、配置、基本查询（PromQL）

核心进阶

Exporter 使用、指标采集配置、服务发现

高级特性

告警规则配置、Alertmanager、PromQL 高级查询

实战应用

Kubernetes 监控、Grafana 集成、生产环境最佳实践

🎯 总结升华

Prometheus 是云原生时代的"监控操作系统"。

它用 多维数据模型、Pull 采集模式、强大的 PromQL 重新定义了监控系统。Prometheus 是 CNCF 的旗舰项目，与 Kubernetes 一起构成了云原生技术栈的基石。

无论你是 DevOps 工程师、SRE 还是开发者，Prometheus 都是监控领域必须掌握的工具。

"Prometheus 让监控变得像查询一样简单。" 📊

🔖 相关标签

📄 本文档为 Prometheus 完整白皮书 · 最后更新于 2026年06月28日

📑 本文目录