Prometheus收集MSK指标

集群开启Open monitoring

在第一章创建集群时,并没有开启Open monitoring。这并没关系,在创建完成集群后依然可以开启:

image-20220103193823424

选择Enable open monitoring, 并同时使用JMX ExporterNode Exporter

image-20220103193847171

点击确认后,过几分钟集群的更新完成。

开启Open Monitoring后,MSK使用11001端口提供JMX Exporter的指标,使用11002端口提供Node Exporter的指标,要注意这和Node Exporter默认的9100端口不一样。 参考: https://docs.aws.amazon.com/msk/latest/developerguide/open-monitoring.html

安装Prometheus

https://prometheus.io/download/ 地址下找到prometheus的下载链接:

image-20220103235515971

当前版本是2.32.0,将其下载并解压:

wget https://github.com/prometheus/prometheus/releases/download/v2.32.0/prometheus-2.32.0.linux-amd64.tar.gz
tar -zxvf prometheus-2.32.0.linux-amd64.tar.gz 
cd prometheus-2.32.0.linux-amd64/

此时可以执行./prometheus命令来运行prometheus,但我们需要做一些其他配置才能拉取到MSK的指标

Prometheus的配置

将上面文件目录下的prometheus.yml, 内容替换为:

# file: prometheus.yml
# my global config
global:
  scrape_interval: 10s

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'
    static_configs:
    # 9090 is the prometheus server port
    - targets: ['localhost:9090']
  - job_name: 'broker'
    file_sd_configs:
    - files:
      - 'targets.json'

最后我们加了一个job来拉取broker的exporter指标,这个Job使用了基于文件的服务发现(可参考拙作 )

targets.json里需要配置所有JMX ExporterNode Exporter的地址, 第一步是获取所有Broker的地址,这些Exporter运行在Broker的11001和11002端口:

CLUSTER_ARN=arn:aws:kafka:ap-southeast-1:145197526627:cluster/MSKDemo/89d04308-2643-4e80-b6e2-fe996354f056-4  # 根据集群的实际情况做替换

aws kafka list-nodes --cluster-arn $CLUSTER_ARN \
    --query NodeInfoList[*].BrokerNodeInfo.Endpoints[]

image-20220104061552508

在当前目录下新建targets.json,将上面的输出地址加上:11001:11002, 分别对应jmx和node exporter:

[
    {
      "labels": {
        "job": "jmx"
      },
      "targets": [
        "b-3.mskdemo.mxqzz7.c4.kafka.ap-southeast-1.amazonaws.com:11001", 
        "b-6.mskdemo.mxqzz7.c4.kafka.ap-southeast-1.amazonaws.com:11001", 
        "b-2.mskdemo.mxqzz7.c4.kafka.ap-southeast-1.amazonaws.com:11001", 
        "b-5.mskdemo.mxqzz7.c4.kafka.ap-southeast-1.amazonaws.com:11001", 
        "b-4.mskdemo.mxqzz7.c4.kafka.ap-southeast-1.amazonaws.com:11001", 
        "b-1.mskdemo.mxqzz7.c4.kafka.ap-southeast-1.amazonaws.com:11001"
      ]
    },
    {
      "labels": {
        "job": "node"
      },
      "targets": [
        "b-3.mskdemo.mxqzz7.c4.kafka.ap-southeast-1.amazonaws.com:11002", 
        "b-6.mskdemo.mxqzz7.c4.kafka.ap-southeast-1.amazonaws.com:11002", 
        "b-2.mskdemo.mxqzz7.c4.kafka.ap-southeast-1.amazonaws.com:11002", 
        "b-5.mskdemo.mxqzz7.c4.kafka.ap-southeast-1.amazonaws.com:11002", 
        "b-4.mskdemo.mxqzz7.c4.kafka.ap-southeast-1.amazonaws.com:11002", 
        "b-1.mskdemo.mxqzz7.c4.kafka.ap-southeast-1.amazonaws.com:11002"
      ]
    }
  ]

由于后面我们要访问Prometheus的web UI,在Cloud 9下要确保8080端口不被占用。我们先将之前的akhq停掉( 进入对应目录执行docker-compose stop

image-20220104062714199

启动prometheus:

./prometheus --web.listen-address="127.0.0.1:8080"

image-20220104062740425

上面命令将prometheus的web UI运行在8080端口,我们可以访问它:

image-20220104063027660

进入到Targets页面:

image-20220104063044782

看到broker下12个endpoint状态都是UP,说明Prometheus成功拉取到了MSK JMX ExporterNode Exporter的数据:

image-20220104063109582

在Graph页面,我们可以获取指标的列表及其详细数据

image-20220104063200157

本文的prometheus是在测试环境下运行。
在生产环境下,要考虑prometheus的高可用,比如使用k8s部署或使用AWS托管的Prometheus,还有要考虑prometheus数据备份等