Prometheus 学习笔记

Introduction

Overview

  1. Prometheus is an open-source systems monitoring and alerting toolkit.
  2. Prometheus collects and stores its metrics as time series data, i.e. metrics information is stored with the timestamp at which it was recorded, alongside optional key-value pairs called labels.
  3. Features

    1. a multi-dimensional data model with time series data identified by metric name and key/value pairs
    2. PromQL, a flexible query language to leverage this dimensionality
    3. no reliance on distributed storage; single server nodes are autonomous
    4. time series collection happens via a pull model over HTTP
    5. pushing time series is supported via an intermediary gateway
    6. targets are discovered via service discovery or static configuration
    7. multiple modes of graphing and dashboarding support
  4. Components

    1. the main Prometheus server which scrapes and stores time series data
    2. client libraries for instrumenting application code
    3. a push gateway for supporting short-lived jobs
    4. special-purpose exporters for services like HAProxy, StatsD, Graphite, etc.
    5. an alertmanager to handle alerts
    6. various support tools
  5. Prometheus configuration file: prometheus.yml

    1. global.scrape_interval
    2. global.evaluation_interval
    3. rule_files: []
    4. scrape_configs: {job_name:"", static_configs:""}
  6. Prometheus server UI

    1. status page: http://<;host>:9090/
    2. self metrics page: http://<;host>:9090/metrics
    3. expression browser: http://<;host>:9090/graph
  7. glossary

    1. The Alertmanager takes in alerts, aggregates them into groups, de-duplicates, applies silences, throttles, and then sends out notifications to email, Pagerduty, Slack etc.

Concepts

  1. Data models

    1. <metric_name>{<label_name>=<label_value>, ...}
    2. metrics_name 符合: /a-zA-Z_:*/ 字母数字下划线分号(分号只是用在定义 recording rule 的)
    3. dimensional(维度)需要通过 labels 定义
    4. time series: streams of timestamped values belonging to the same metric and the same set of labeled dimensions.
    5. 添加/去除等改变 label value 的操作会导致创建新的 time series
    6. label name 符合 /a-zA-Z_*/ 字母数字下划线 (2个连续下划线(__)开头的 label name 是系统保留用的)
    7. label value 可以使用任何 Unicode 字符
    8. A label with an empty label value is considered equivalent to a label that does not exist
  2. Metrics Types

    1. Counter: 单调增长的计数器, 重启后变为0再自增
    2. Gauge: 可增可减的数值
    3. Histogram: 柱状图 指标 basename

      1. <basename>_bucket{le="<upper inclusive bound>"}
      2. <basename>_sum: total sum
      3. <basename>_count: =<basename>_bucket{le="+Inf"}
    4. Summary
  3. Jobs & Instances

    1. When Prometheus scrapes a target, it attaches some labels automatically to the scraped time series which serve to identify the scraped target: job: <job_name> & instance: <host>:<port>

Prometheus

  1. Configuration

    1. command-line flags configure immutable system parameters (such as storage locations, amount of data to keep on disk and in memory, etc.)
    2. configuration file defines everything related to scraping jobs and their instances, as well as which rule files to load.
    3. Prometheus can reload its configuration at runtime.

      1. send SIGHUP;
      2. HTTP POST request to the /-/reload endpoint
    4. scrape_config

      1. Targets with static_configs or dynamic service-discovery;
    5. rule check -> promtool check rules /path/to/example.rules.yml
  2. PromQL

    1. can evaluate: instant vector, range vector, scalar, string
    2. metrics_name{} 可以写为: {__name__="metrics_name"} 比如查询多个 metrics {__name__=~"job:.*"}
    3. subQuery: <instant_query> '[' <range> ':' [<resolution>] ']' [ @ <float_literal> ] [ offset <duration> ] (<resolution> is optional. Default is the global evaluation interval.)
    4. Vector matching

      1. <vector expr> <bin-op> ignoring(<label list>) <vector expr>
      2. <vector expr> <bin-op> on(<label list>) <vector expr>
  3. Storage

    1. format: https://github.com/prometheus/prometheus/blob/release-2.36/tsdb/docs/format/README.md

GDB

GNU Debugger (GDB) 是 Linux & Unix 上广泛使用的 debugger, 可以对 C, C++, Objective-C, go 等程序 debug.

GDB 可以 1) 对正在运行的程序通过 attach 的方式进行 debug, 2) 也可以通过 gdb 新运行一个程序进行 debug, 3) 还可以直接对 core dump 进行有线的 debug.

GDB 命令 cheatsheet

  1. https://darkdust.net/files/GDB%20Cheat%20Sheet.pdf
  2. https://gist.github.com/rkubik/b96c23bd8ed58333de37f2b8cd052c30

GDB 内部实现

GDB 使用 ptrace 系统调用去观察和控制其它程序的运行. 断点是通过替换原程序的某地址的指令为特殊的指令(int03)来实现的, 执行断点程序产生SIGTRAP中断.
关于 ptrace 特定的操作, 参看: https://man7.org/linux/man-pages/man2/ptrace.2.html

Netty 抛出的 io.netty.util.internal.OutOfDirectMemoryError

最近遇到一个奇怪的Java direct memory 导致的 OutOfDirectMemoryError. 我们的应用程序在启动的时候, 明明设置了-XX:MaxDirectMemorySize=1G, 并且我们有持续的监控数据看到 direct memory 才用了不到10M, 就怎么爆出了io.netty.util.internal.OutOfDirectMemoryError 呢?

症状

日志里看到很多这样的错误:

msg=Root cause=OutOfDirectMemoryError: failed to allocate 4194304 byte(s) of direct memory (used: 1069547527, max: 1073741824)&st=io.netty.util.internal.OutOfDirectMemoryError: failed to allocate 4194304 byte(s) of direct memory (used: 1069547527, max: 1073741824)
    at io.netty.util.internal.PlatformDependent.incrementMemoryCounter(PlatformDependent.java:802)
    at io.netty.util.internal.PlatformDependent.allocateDirectNoCleaner(PlatformDependent.java:731)
    at io.netty.buffer.PoolArena$DirectArena.allocateDirect(PoolArena.java:649)
    at io.netty.buffer.PoolArena$DirectArena.newChunk(PoolArena.java:624)
    at io.netty.buffer.PoolArena.allocateNormal(PoolArena.java:203)
    at io.netty.buffer.PoolArena.tcacheAllocateSmall(PoolArena.java:173)
    at io.netty.buffer.PoolArena.allocate(PoolArena.java:134)
    at io.netty.buffer.PoolArena.allocate(PoolArena.java:126)
    at io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:394)
    at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:188)
    at io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:179)
    at io.netty.buffer.AbstractByteBufAllocator.ioBuffer(AbstractByteBufAllocator.java:140)
    at io.netty.channel.DefaultMaxMessagesRecvByteBufAllocator$MaxMessageHandle.allocate(DefaultMaxMessagesRecvByteBufAllocator.java:120)
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:150)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:722)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:658)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:584)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.lang.Thread.run(Thread.java:748)

初步诊断

上面的错误, 告诉我们以下一些细节:

  1. Java 应用程序的 direct memory 不够用了;
  2. direct memory 最大值是1G (1073741824), 已经使用1G (1069547527);
  3. 本次申请的用量是 4M (4194304 bytes);

大家知道 Java 提供了监控 direct memory 使用量的 MBean, 很多框架也支持持续导出这些数据形成时序 metrics. 下图是通过JDK 自带的 Jconsole 看到的 direct memory MBean 的数据: 6个 DirectByteBuffer, 共使用了49152 bytes.
mbeandirect.png
在服务器上使用 JConsole 不方便, 但是有很多命令行工具可以帮我们查看 MBean. 我们公司的框架暴露了这些数据为时序数据, 通过这些时序数据我们很容易看到, 我们的 direct memory 使用量不足10M. 所以这里矛盾的地方就是虽然我们设置了最大1G 的 direct memory, MBean 暴露的数据显示使用不到10M, 可是 Netty 却一直告诉我们1G 已经用完, 并且抛出 OutOfDirectMemoryError.

读 Netty 代码

根据出错栈, 我们找到了 Netty 的代码: io.netty.util.internal.PlatformDependent.incrementMemoryCounter.

    private static void incrementMemoryCounter(int capacity) {
        if (DIRECT_MEMORY_COUNTER != null) {
            long newUsedMemory = DIRECT_MEMORY_COUNTER.addAndGet(capacity);
            if (newUsedMemory > DIRECT_MEMORY_LIMIT) {
                DIRECT_MEMORY_COUNTER.addAndGet(-capacity);
                throw new OutOfDirectMemoryError("failed to allocate " + capacity
                        + " byte(s) of direct memory (used: " + (newUsedMemory - capacity)
                        + ", max: " + DIRECT_MEMORY_LIMIT + ')');
            }
        }
    }

根据这段代码, 我们注意到: DIRECT_MEMORY_LIMIT 在我们的情况下, 就是我们设置的 MaxDirectMemorySize 的值, 就是1G. DIRECT_MEMORY_COUNTER 就是 Netty 使用的 direct memory 的数量.
我们最初看到的那个 OutOfDirectMemoryError 就是这段代码抛出的. 很明显, 再加上那申请的4M, 就超过了1G 的最大限制. 为了验证我们说的情况, 我们事后做了一个 heap dump, 虽然当时没有抛出OutOfDirectMemoryError, MBean 显示的 direct memory 也只使用了4M多. 可是 DIRECT_MEMORY_COUNTER 的值, 却显示已经使用了490M 多:
heap.png

为什么MBean 报出的 direct memory 使用量和 Netty 的 DIRECT_MEMORY_COUNTER 的值不一样?

为了找出二者之间的差异, 我们要分别看下他们是怎么统计的.

MBean 的 direct memory 的统计方式

JDK 的 BufferPoolMXBean 统计 direct/mapper 的 memory 使用量, 它作为一个接口, 真正的实现类可以在 JDK 找到. sun.management.ManagementFactoryHelper (JDK8) 里面就有一个实现类:

private static BufferPoolMXBean
        createBufferPoolMXBean(final sun.misc.JavaNioAccess.BufferPool pool)
    {
        return new BufferPoolMXBean() {
            private volatile ObjectName objname;  // created lazily
            @Override
            public ObjectName getObjectName() {
                ObjectName result = objname;
                if (result == null) {
                    synchronized (this) {
                        result = objname;
                        if (result == null) {
                            result = Util.newObjectName(BUFFER_POOL_MXBEAN_NAME +
                                ",name=" + pool.getName());
                            objname = result;
                        }
                    }
                }
                return result;
            }
            @Override
            public String getName() {
                return pool.getName();
            }
            @Override
            public long getCount() {
                return pool.getCount();
            }
            @Override
            public long getTotalCapacity() {
                return pool.getTotalCapacity();
            }
            @Override
            public long getMemoryUsed() {
                return pool.getMemoryUsed();
            }
        };
    }

从上面代码可以看出, 真正的使用量, 都是从sun.misc.JavaNioAccess.BufferPool获得, 沿着这个线索找下去, 就在java.nio.Bits里面找到2个BufferPool的真正实现, 分别记录了 direct/mapper 的使用量. 其中, direct memory 的 count, totalCapacity 和 memory used 分别是由 Bits 的 count, totalCapacity 和 reservedMemory记录.

return new sun.misc.JavaNioAccess.BufferPool() {
    @Override
    public String getName() {
        return "direct";
    }
    @Override
    public long getCount() {
        return Bits.count.get();
    }
    @Override
    public long getTotalCapacity() {
        return Bits.totalCapacity.get();
    }
    @Override
    public long getMemoryUsed() {
        return Bits.reservedMemory.get();
    }
};                       

我们单追踪 Bits.reservedMemory 是怎么 track 已经使用量的:

DirectByteBuffer(int cap) {                   // package-private
        super(-1, 0, cap, cap);
        //省略其它无关代码
        Bits.reserveMemory(size, cap);
  }

从上面的代码可以看到, 每当 java.nio.DirectByteBuffer 初始化的时候, 它都会调用 Bits.reserveMemory 代码去 track 内存使用量. 这样当我们想查看 direct memory 使用量的时候, 只要查看 Bits 的记录数据就好了.

Netty 的 DIRECT_MEMORY_COUNTER 的统计方式

DIRECT_MEMORY_COUNTER 的统计方式比较直观, 它只在 PlatformDependent 内部被使用. 每当想要申请一块新的 direct 内存时候, 它就调用 incrementMemoryCounter 方法去增加 DIRECT_MEMORY_COUNTER 的值.

原因

所以, 虽然他们都以 MaxDirectMemorySize 为最大值, 但是都看不到对方的存在, 都只统计自己知道的使用量. 所以这导致我们看到 MBean 里面暴露的值和真正的使用量并不一样.

扩展 - 有关 ByteBuffer

JDK 里面的ByteBuffer 的数据结构有下图, Buffer 是一个抽象类, 定义了通用的一些方法, 比如 limit(), reset(), mark(), flip(), rewind(), isDirect(), 等. 然后 Buffer 下面定义了 primitive 类型对应的各种 Buffer 类. ByteBuffer 又分为在 heap 的 HeapByteBuffer 和在直接内存的 MappedByteBuffer & DirectByteBuffer. 下面以 R 结尾的 ByteBuffer 代表 read only 的 ByteBuffer.
ByteBuffer.png

扩展 - Netty 自己的 metrics 如何暴露这些数据的

Netty 本来提供了一个 metrics, 只是需要自己把这些 metrics 数据导出. metrics 数据通过PooledByteBufAllocatorMetric 暴露.

Java JDK UnSafe 类

JDK 里面有个类你可能没有直接用过, 但是阅读 JDK 源代码的时候, 到某些关键点上, 却经常会遇到, 就是 sun.misc.Unsafe. 这个类为什么不是一个名词呢? 为什么它就像一个警告一样, 告诉你: 不要用, 不要用.

- 阅读剩余部分 -

常见的工具 docker 安装

有一个闲置的 server, 把常用的工具都安装到上面, 方便平时去试验一些东西:

Nginx

sudo docker run --restart always --name nginx -v /home/supra/work/data/nginx_html:/usr/share/nginx/html:ro -v /home/supra/work/data/nginx_config/mime.types:/etc/nginx/mime.types:ro  -p 80:80 -d nginx

mongoDB

sudo docker network create mongo-network
sudo docker run --network mongo-network --restart always -p 27017:27017 --volume /home/supra/work/data/mongo/grafana:/data/db --name mongodb -d mongo
sudo docker run --network mongo-network --restart always -e ME_CONFIG_MONGODB_SERVER=mongodb -p 8081:8081 --name mongoui mongo-express

elasticSearch & kibana

参考: https://www.elastic.co/guide/en/kibana/current/docker.html

sudo docker network create elastic
sudo docker run --restart always --name es01 --network elastic -p 9200:9200 -p 9300:9300 -e "discovery.type=single-node" -d docker.elastic.co/elasticsearch/elasticsearch:7.16.1
sudo docker run --restart always --name kib01 --network elastic -p 5601:5601 -e "ELASTICSEARCH_HOSTS=http://es01:9200" -d docker.elastic.co/kibana/kibana:7.16.1

Splunk

sudo docker run --restart always -p 8000:8000 -e "SPLUNK_START_ARGS=--accept-license"  -e "SPLUNK_PASSWORD=Sre@2021" --name splunk -d splunk/splunk

Clickhouse

docker run -d --name clickhouse-server --ulimit nofile=5120:5120 --volume=/home/supra/work/data/clickhouse:/var/lib/clickhouse -p 8123:8123 -p 9000:9000 yandex/clickhouse-server

Redis

参考: https://hub.docker.com/_/redis

$ docker network create redis-network
$ sudo docker run --network redis-network --restart always --volume /home/supra/work/data/redis/data:/data --name redis -p 6379:6379 -d redis redis-server --save 60 1 --loglevel warning
$ docker run -it --network redis-network --rm redis redis-cli -h redis

Prometheus

参考: https://prometheus.io/docs/prometheus/latest/installation/

sudo docker run -d --restart always --name prometheus -p 9090:9090 -v /home/supra/work/data/prometheus:/etc/prometheus prom/prometheus

MySQL & phpmyadmin

参考: https://hub.docker.com/_/mysql

mkdir  /home/supra/work/data/mysql/data
 docker run --restart always --name mysqld -v /home/supra/work/data/mysql/data:/var/lib/mysql -e MYSQL_ROOT_PASSWORD=Sre@2022 -d -p 3306:3306 -e "MYSQL_USER=sre" -e "MYSQL_PASSWORD=Sre@2022" mysql//create phpmyadmin UI
 docker run --restart always --name phpmyadmin -d --link mysqld:db -p 8082:80 phpmyadmin

continuumio/anaconda3

docker run --restart always -d --name anaconda3  -p 8888:8888 continuumio/anaconda3 /bin/bash -c "\
    conda install jupyter -y --quiet && \
    mkdir -p /opt/notebooks && \
    jupyter notebook --NotebookApp.token='' --NotebookApp.password='' \
    --notebook-dir=/opt/notebooks --ip='*' --port=8888 \
    --no-browser --allow-root"

另外在 nginx 的 html 目录放一个 index.html

<li><a target="_blank" href=":5601/">elastic</a></li>
<li><a target="_blank" href=":8000/">Splunk(admin/Sre@2021)if expired, reinstall</a></li>
<li><a target="_blank" href=":8081/">MongoUI</a></li>
<li><a target="_blank" href=":8123/">ClickHouseUI</a></li>
<li><a target="_blank" href="/">RedisUI</a></li>
<li><a target="_blank" href=":9090/">Prometheus</a></li>

<li><a target="_blank" href="https://www.tianxiaohui.com/display/~xiatian/supra">wiki</a></li>

<script>
    (function() {
        const links = document.querySelectorAll("a");
        links.forEach(function(ele){
                ele.href = ele.href.replace("/:", ":");
        });
    })();
</script>