2020年8月

诊断不正确的 JSONArray 引起的死循环导致 CPU 使用率高

作者: Eric
时间: August 31, 2020
分类: Troubleshooting 相关
评论

最近有个应用经常 CPU 使用率经常非常高, 一旦升上去, 很难再下来, 除非进行重启. 使用 htop 观察, 出现 CPU 使用率高的并不是一个线程. 然后使用 async-profiler 查看, 发现代码很像死循环, 99%以上的栈都被 JSONArray 初始化占用了:

看上去一个很正常的 JSONArray 初始化为什么会占用这么多 CPU 时间呢? 难道是处理的业务较多? 一开始开发人员认为最近的促销导致业务量暴涨导致的. 其实做一个 heap dump 很快发现了事情的元凶.

我们的这段代码是这么写的(这里只是展示基本含义, 没做其它验证):

String str = "some string";
if (str.startWith("[")) {
   new org.codehaus.jettison.json.JSONArray(str);
} else {
   //something else
}

问题就出现在这个 str, 如果它的格式有问题, 不是真正的数组, 那么就会出现问题, 比如下面的代码:

       public static void main(String[] args) {
        String str = "[*/*A25] **";
        try {
            System.out.println("start");
            System.out.println(new JSONArray(str));
            System.out.println("end");
        } catch (JSONException e) {
            e.printStackTrace();
        }
    }

如果 debug 上面的代码, 就会发现出现了死循环, 线程再也无法退出. stuck 在了 JSONTokener L179 ~ 188 -> next() -> back() -> next(). 一直 next(), back().

于是给这个 lib 提了一个 issue: https://github.com/jettison-json/jettison/issues/36

解决chrome extensions - Tabbed Postman - REST Client 无法导入问题

作者: Eric
时间: August 26, 2020
分类: chrome 插件
评论

虽然 Postman 早就出了 Chrome Apps版本, 不过因为积累了很多老版本 Tabbed Postman 的例子, 大多数时间还是使用老版本的 Postman, 张这样:

有次重装电脑, 把老版本里面的数据导出来, 然后想导入到新系统的 chrome的 Tabbed Postman 里面去, 不过每次都显示: it works, 却从来没有导入成功过, 原因在于某段 JavaScript 出错, 今天为了使用某个数据, 只能手工 fix 这个 bug 了.

找到 Tabbed Postman Extension 的代码地址 (你可能要替换掉最后一部分):

~/Library/Application\ Support/Google/Chrome/Default/Extensions/coohjcphdfgbiolnekdpbcijmhambjff

把上面的整个文件夹复制到 ~/Download 目录下

修改 0.8.4.19_0/js/request.js (前面是我使用的 Tabbed Postman 的版本号), 修改的函数是: importCollectionData.

 importCollectionData:function (coll) {
   for (let i = 0; i < coll.collections.length; i++) {
       let collection = coll.collections[i];
       pm.indexedDB.addCollection(collection, function (c) {
           var message = {
               name:collection.name,
               action:"added"
           };
 
           $('.modal-import-alerts').append(Handlebars.templates.message_collection_added(message));
 
           var requests = [];
 
           var ordered = false;
           if ("order" in collection) {
               ordered = true;
           }
 console.log(collection);
           for (var i = 0; i < collection.requests.length; i++) {
               var request = collection.requests[i];
               request.collectionId = collection.id;
 
                 /*Handling rawModeData */
               if(request.hasOwnProperty("rawModeData")) {
                   request.data = request.rawModeData;
               }  
 
 
               var newId = guid();
 
               if (ordered) {
                   var currentId = request.id;
                   var loc = _.indexOf(collection["order"], currentId);
                   collection["order"][loc] = newId;
               }
 
               request.id = newId;
 
               if ("responses" in request) {
                   var j, count;
                   for (j = 0, count = request["responses"].length; j < count; j++) {
                       request["responses"][j].id = guid();
                       request["responses"][j].collectionRequestId = newId;                        
                   }
               }
 
               pm.indexedDB.addCollectionRequest(request, function (req) {});
               requests.push(request);
           }
 
           pm.indexedDB.updateCollection(collection, function() {});
           
           collection.requests = requests;
           pm.collections.render(collection);
       });
   }
 },

然后在 Chrome Extensions 里面选择 Load unpacked, 选择刚才修改的版本, 就可以导入了.

JVM Heap dump 相关知识

作者: Eric
时间: August 26, 2020
分类: Java 相关
评论

最近在研究 JVM heap 相关的东西, 把相关的内容整理在这里

什么是 HPROF
HPROF 是 Heap and CPU Profiling Agent 的工具, 是通过使用 JVM TI 的接口实现的 agent, 它在 JDK 里面是用来展示如何使用 JVM TI 的小工具. 同时它也是 Hotspot JVM 产生的 heap dump 的文件格式. HPROF 的格式参考下面. https://docs.oracle.com/javase/8/docs/technotes/samples/hprof.html. 使用以下命令, 查看帮助

java -agentlib:hprof=help
HPROF: Heap and CPU Profiling Agent (JVMTI Demonstration Code)

hprof usage: java -agentlib:hprof=[help]|[<option>=<value>, ...]

Option Name and Value  Description                    Default
---------------------  -----------                    -------
heap=dump|sites|all    heap profiling                 all
cpu=samples|times|old  CPU usage                      off
monitor=y|n            monitor contention             n
format=a|b             text(txt) or binary output     a
file=<file>            write data to file             java.hprof[{.txt}]
net=<host>:<port>      send data over a socket        off
depth=<size>           stack trace depth              4
interval=<ms>          sample interval in ms          10
cutoff=<value>         output cutoff point            0.0001
lineno=y|n             line number in traces?         y
thread=y|n             thread in traces?              n
doe=y|n                dump on exit?                  y
msa=y|n                Solaris micro state accounting n
force=y|n              force output to <file>         y
verbose=y|n            print messages about dumps     y

什么是 JVM TI
JVM TI 是 JDK 5 开始, 取代早期(JDK 4 及以前) JVM 提供的 DI (Debug Interface) 和 PI (Profiling Interface) 的统一接口. 参考
HPROF 文件里面都可以有啥?
CPU profiling 的数据, 一般以 stack trace sample 的形式存在; Heap 内存分配的信息; 以及锁竞争的信息.

参考:

JDK 自带文档 demo/jvmti/hprof/src/manual.html
http://www.brendangregg.com/blog/2014-06-09/java-cpu-sampling-using-hprof.html

JVM 的 "VM Thread" 线程

作者: Eric
时间: August 25, 2020
分类: Java 相关
评论

最近在帮一个同事看一个 Java 应用为什么会被经常重启的时候, 定位到 Java 应用中一个线程经常以 CPU 使用率100%的方式出现在 htop 的头部, 它是什么线程, 为什么会导致一个 CPU 100%? 这引起了我的兴趣.

VM Thread 做什么的?

"VM Thread" 是 JVM 自身启动的一个线程, 它主要用来协调其它线程达到安全点(Safepoint). 需要达到安全点的线程主要有: Stop the world 的 GC, 做 thread dump, 线程挂起以及偏向锁的撤销.
一般在 Thread dump 中, 它的表现为下面的样子, 没有线程栈信息:

"VM Thread" os_prio=0 tid=0x00007f5a212dc000 nid=0x1ad8d runnable

同时, 在 Thread dump 中, 有很多线程虽然当时不需要锁同步, 却能看到很多线程在 BLOCKED 状态, 如下:


"MyThreadPool-2" #77 daemon prio=5 os_prio=0 tid=0x00007f597a01c000 nid=0x1ae5e waiting for monitor entry [0x00007f591558a000]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at com.sun.org.apache.xml.internal.dtm.ref.DTMManagerDefault.<init>(DTMManagerDefault.java:96)
    at com.sun.org.apache.xalan.internal.xsltc.dom.XSLTCDTMManager.<init>(XSLTCDTMManager.java:68)
    at com.sun.org.apache.xalan.internal.xsltc.dom.XSLTCDTMManager.newInstance(XSLTCDTMManager.java:78)

为什么这里出现 VM Thread 占用这么多 CPU

原因是这个程序因为内存泄漏, heap 几乎用光, 使用 jstat 命令看到, 每隔大约3s左右, 就有一次新的 full GC, 而CMS 的 full GC 只有单线程在工作, 所以只有一个 CPU core 在忙.

Refer:

Java 无法 attach 到目标进程, 使用 core dump 转换为 Java heap dump

作者: Eric
时间: August 23, 2020
分类: Troubleshooting 相关
评论

有时候, 当我们尝试对一个已经 GC overhead 非常高的 Java 应用进程去做 heap dump 的时候, 或者使用 jstack/jcmd 去做 thread dump 的时候, 发现我们跟本没办法 attach 到目标 Java 进程, 得到下面的错误:

/usr/bin/jcmd 7674 GC.heap_dump /tmp/heap.log.hprof
7674:
com.sun.tools.attach.AttachNotSupportedException: Unable to open socket file: target process not responding or HotSpot VM not loaded
at sun.tools.attach.LinuxVirtualMachine.<init>(LinuxVirtualMachine.java:106)
at sun.tools.attach.LinuxAttachProvider.attachVirtualMachine(LinuxAttachProvider.java:63)
at com.sun.tools.attach.VirtualMachine.attach(VirtualMachine.java:213)
at sun.tools.jcmd.JCmd.executeCommandForPid(JCmd.java:140)
at sun.tools.jcmd.JCmd.main(JCmd.java:129)

一般我们会多尝试几次, 期望有哪次可以 attach 到目标 Java 进程, 不过有时候, 尝试好多次都无法达到目的.

除了这种尝试的方法之外, 我们还可以尝试另外一种方式: 先获得目标进程的 core dump, 再从 core dump 转成 Java heap dump.

具体的方法步骤是:

安装 gcore

$ sudo apt install gdb -y   #gcore command in gdb package

设置 core dump 可允许的大小. 通常生产环境的 core dump 都是禁止的, 一方面 core dump 特别大, 占用磁盘比较大, 另外 core dump 可能包含运行时内存的敏感数据. 如果你的生产环境的 core size 是 unlimited 的或者特别大, 可以忽略这步.

$ sudo cp /etc/security/limits.conf /etc/security/limits.d/   # 复制limits.conf模板文件到配置文件夹
$ sudo vim /etc/security/limits.d/limits.conf     # 编辑配置文件, 修改我们目标 Java 进程的用户的配置
#### 我们假设我们的目标进程使用 appUser 运行, 给他设置 core file 的 hard, soft 允许大小
      appUser hard core unlimited
      appUser soft core unlimited

$ sudo su appUser  # 切换到 appUser
$ ulimit -a        # 查看它的 core file size 的 hard, soft 大小   
### 有时候, 即便你改好了, 还是看到 core size 是0, 不过不要悲观, 继续下去, 可以参数 core dump.                                                                         
$ exit             # 退出 appUser

获取 core dump

$ pgrep java   # 显示查看目标进程的 pid
$ sudo gcore <pid>     # 使用 gcore 获取 core dump, 根据目标进程的内存占用大小, 可能会花费不同的时间
### 有时候使用别的用户, 即使使用 sudo 也不能 ptrace, 会得到 ptrace 无法 attach 的错误, 这时候, 切换到
### appUser 用户, 直接 gcore <pid> 就行. 如果报无法写入, 看你在哪个文件夹, 到一个可写的文件夹就好了
$ ls -lah core.<pid>   # 查看获取的 core dump 的大小, 通常在当前目录

把 core file 的 hard, soft 修改的配置改回来

$ sudo rm -f /etc/security/limits.d/limits.conf # 之前我们把模板加到这个文件, 现在删掉

转换 core dump 到 Java heap dump. 一定要使用运行对应目标进程运行的 JDK.
如果你是 Hotspot JDK (Oracle 或者 OpenJDK):

$ find /app -name jmap     # 找到运行目标进程的 JDK 里面的 jmap 命令
$ sudo /usr/bin/jmap  -dump:format=b,file=heap.hprof  /usr/bin/java ./core.<pid>   #转换

如果你是 IBM J9 JDK:

$ find /app -name jextract  # 找到运行目标进程的 JDK 里面的 jextract 命令
$ jextract core.<pid>       # 从 core dump 转成 heap dump core.<pid>.zip
## 对于这个 core.<pid>.zip 使用 MAT 并且安装的 dtfj 插件, 可以打开, 或者使用 IBM JDK 的 jdmpview 工具里 
## 面的 heapdump 命令, 可以把 这个 zip 文件转换成 IBM PHD 格式的 heap dump, 然后使用带 DTFJ 插件的 MAT 
## 工具分析. 
$ /usr/bin/jdmpview 
$ heapdump
$ ls -lah
-rw-r--r--  1 root    root    7.5G Aug 21 17:51 core.16751
-rw-rw-r--  1 user1   user1   926M Aug 21 19:09 core.16751.zip
-rw-rw-r--  1 user1   user1    19M Aug 21 19:46 core.16751.zip.phd

到这里, 获取 heap dump 的过程结束, 下面是怎么从 prod 环境复制 heap dump 到本地环境. 你可以选择使用 scp 或者 rsync 复制到本地, 也可以向下面一样, 使用 nc 建立一个下载服务器, 从本地下载.

设置 nc 下载服务

$ sudo apt install netcat -y # 若 nc 没有安装, 则安装
$ sudo nc -4 -v -l 7070 < ./heap.hprof   # 一种 nc 支持这种参数
$ sudo nc -v -l -p 7070 < ./heap.hprof   # 另外一种支持这种参数

下载到本地
1. 使用浏览器访问 http://:7070/
2. 或者使用 curl 命令去下载 → curl "http://:7070/" --output heap.hprof
清理
```
$ rm core.<pid>  dump.hprof
```
有了 Heap dump, 你就可以使用 MAT 或者 JVisualVM 查看里面的内容了.