Eric 发布的文章

关于 HandshakeCompletedNotify-Thread 线程

最近诊断一个线程泄漏问题的时候, 发现大量的 HandshakeCompletedNotify-Thread 线程, 观察它的代码栈, 发现源自于 JDK 代码 sun.security.ssl.TransportContext.finishHandshake.

每次 SSL 连接建立都创建一个新的线程去做通知, 显然是有性能问题. 可是这个问题尚没有修复, 看上去为了兼容老的 API没人想修复: https://bugs.openjdk.java.net/browse/JDK-8246039

所以尽量连接重用, 减少性能开销.

创建线程栈:

HandshakeCompletedNotify-Thread
java.lang.Thread.start(Thread.java)
sun.security.ssl.TransportContext.finishHandshake(TransportContext.java:620)
sun.security.ssl.Finished$T12FinishedConsumer.onConsumeFinished(Finished.java:546)
sun.security.ssl.Finished$T12FinishedConsumer.consume(Finished.java:515)
sun.security.ssl.SSLHandshake.consume(SSLHandshake.java:377)
sun.security.ssl.HandshakeContext.dispatch(HandshakeContext.java:444)
sun.security.ssl.HandshakeContext.dispatch(HandshakeContext.java:422)
sun.security.ssl.TransportContext.dispatch(TransportContext.java:182)
sun.security.ssl.SSLTransport.decode(SSLTransport.java:156)
sun.security.ssl.SSLSocketImpl.decode(SSLSocketImpl.java:1409)
sun.security.ssl.SSLSocketImpl.readHandshakeRecord(SSLSocketImpl.java:1315)
sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:439)
sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:410)
sun.net.www.protocol.https.HttpsClient.afterConnect(HttpsClient.java:559)
sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:197)
sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1570)
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1498)
sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:268)

这段代码:

    HandshakeStatus finishHandshake() {
        if (protocolVersion.useTLS13PlusSpec()) {
            outputRecord.tc = this;
            inputRecord.tc = this;
            cipherSuite = handshakeContext.negotiatedCipherSuite;
            inputRecord.readCipher.baseSecret =
                    handshakeContext.baseReadSecret;
            outputRecord.writeCipher.baseSecret =
                    handshakeContext.baseWriteSecret;
        }

        handshakeContext = null;
        outputRecord.handshakeHash.finish();
        inputRecord.finishHandshake();
        outputRecord.finishHandshake();
        isNegotiated = true;

        // Tell folk about handshake completion, but do it in a separate thread.
        if (transport instanceof SSLSocket &&
                sslConfig.handshakeListeners != null &&
                !sslConfig.handshakeListeners.isEmpty()) {
            HandshakeCompletedEvent hce =
                new HandshakeCompletedEvent((SSLSocket)transport, conSession);
            Thread thread = new Thread(
                null,
                new NotifyHandshake(sslConfig.handshakeListeners, hce),
                "HandshakeCompletedNotify-Thread",
                0,
                false);
            thread.start();
        }

        return HandshakeStatus.FINISHED;
    }

关于进来流量从网卡到 kernel 再到 select/poll/epoll 再到 tomcat 的 requst/response

先把这几篇放这里, 等有空再把内容转成图片

网卡到 sk_buffer
https://packagecloud.io/blog/monitoring-tuning-linux-networking-stack-receiving-data/
https://www.bilibili.com/read/cv10228233
https://www.eet-china.com/mp/a32619.html

select/poll/epoll
https://copyconstruct.medium.com/nonblocking-i-o-99948ad7c957
https://copyconstruct.medium.com/the-method-to-epolls-madness-d9d2d6378642

记录一次tcp连接持续断掉的排查过程

有人上报说他们的 app 连接另外一个微服务的时候, 连接经常断掉, 不知道什么原因.

查看网络方面的监控, 网络一切安好.

查看日志, 看到一些异常信息: java.nio.channels.ClosedByInterruptException. 顺着这个信息, 我们找到了根本原因.

  1. 既然是 java.nio.channels.ClosedByInterruptException, 那么我们可以用 btrace 查看为什么这个异常被创建, 得到结果如下:

    java.nio.channels.ClosedByInterruptException.<init>(ClosedByInterruptException.java:51)
    java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:202)
    sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:477)
    com.tianxiaohui.java.BufferedWriteChannel.flush(BufferedWriteChannel.java:281)
    com.tianxiaohui.java.BufferedWriteChannel.write(BufferedWriteChannel.java:223)
    com.tianxiaohui.java.BaseClient.sendBytesActual(BaseClient.java:339)
    com.tianxiaohui.java.BaseClient.sendBytesOverChannel(BaseClient.java:203)
    ...
    rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:48)
    rx.internal.operators.OnSubscribeLift.call(OnSubscribeLift.java:30)
    rx.Observable.unsafeSubscribe(Observable.java:10327) 
    rx.internal.operators.OperatorSubscribeOn$SubscribeOnSubscriber.call(OperatorSubscribeOn.java:100)
    rx.xinternal.schedulers.CachedThreadScheduler$EventLoopWorker$1.call(CachedThreadScheduler.java:230)
    rx.internal.schedulers.ScheduledAction.run(ScheduledAction.java:55)
    ...
     java.lang.Thread.run(Thread.java:748)
  2. 既然是 interrupt 异常, 那么就一定有其它线程给这个线程发 interrupt, 于是通过下面的 btrace 代码, 我们就能找到发 interrupt 的线程

    import org.openjdk.btrace.core.annotations.*;
    import static org.openjdk.btrace.core.BTraceUtils.*;
    
    import org.openjdk.btrace.core.BTraceUtils.Strings;
    
    @BTrace
    public class ThreadInterruptTrace {
    
      @OnMethod( clazz="/java\\.nio\\.channels\\.ClosedByInterruptException/", method="<init>" )
      public static void createException() {
          println(Strings.strcat(name(currentThread()), " --- get ClosedByInterruptException "));
          println(jstackStr());
      }
    
      @OnMethod( clazz="/java\\.lang\\.Thread1/", method="/interrupt/" )
      public static void t1r(@ProbeClassName String probeClass, @Self Thread self) {
          if (Strings.startsWith(name(self), "RxIoScheduler")) {
              println(Strings.strcat(Strings.strcat(name(currentThread()), " try to interrupt ->  "), str(self)));
                println(jstackStr());
            }
        }
    }
  3. 找到发 interrupt 的线程, 那么就去看代码, 找了发生 interrupt 的原因, 最终原因是下面的代码. 每次 create 一个 Observable 都要新起一个线程池做为 scheduler, 其实不需要. 到时每次新建一个 scheduler 线程池导致创建太多线程池, 消耗大量内存, 之后又自动销毁. 然后这些 scheduler 销毁的时候, 还要通知 I/O 线程(这里的 I/O 线程池用的默认的), 也就是给 I/O 线程发一个 interrupt. 最终导致了我们看到的问题.

    Observable.create(new Observable.OnSubscribe<List<T>>() {
      @Override
      public void call(Subscriber<? super List<T>> subscriber) {
               //some code here
          }
      }).}).observeOn(Schedulers.newThread())
           .subscribeOn(Schedulers.io())
           .timeout(threadTimeout, TimeUnit.MILLISECONDS);
    

JVM 的 debug 符号

下面内容是从 https://www.opsian.com/faq/ 页面抄的:

If you've downloaded a JVM from a normal distribution - for example Oracle, Azul or Adopt OpenJDK - then it will come with debug symbols built in. Many distributions do not ship debug symbols with their JDK packages, this will result in some features of the Opsian agent not being available. It is easy to install the debug packages.

Ubuntu - If you are using Java 11 then run apt install openjdk-11-dbg or if you're running Java 8 then run apt install openjdk-8-dbg
Red Hat / Centos - If you are running version 7 or later then you need to Enable the DebugInfo repository on Red Hat or Centos. Then you need to install the appropriate debug info package for your JDK version, for example yum -y install java-1.8.0-openjdk-debuginfo.x86_64 if you're using Java 8.

附带一张 18 年参加 SRE con 上面别人的总结:
srecon18americas_slides_goldshtein_pdf__page_8_of_74_.png

btrace 常见的代码

通常一个 btrace 代码长这样:

import org.openjdk.btrace.core.annotations.*;
import static org.openjdk.btrace.core.BTraceUtils.*;
 
import org.openjdk.btrace.core.BTraceUtils.Strings;
 
@BTrace
public class ClosedByInterruptExceptionTracer {
 
    @OnMethod( clazz="/java\\.nio\\.channels\\.ClosedByInterruptException/", method="<init>" )
    public static void createException() {
        println(Strings.strcat("current thread: ", name(currentThread())));
        println(jstackStr());
    }

    @OnMethod( clazz="/com\\.tianxiaohui\\.Example/", method="/methodA/" )
    public static void createException(@ProbeClassName String probeClass, @ProbeMethodName String probeMethod, Object param1) {
        println(Strings.strcat("praram1 is: ", BTraceUtils.classOf(param1)));
        BTraceUtils.printFields(param1);
    }
}

其它常用的代码:

  1. 获得当前线程

    println(Strings.strcat("current thread name: ", name(currentThread())));
    println(str(currentThread()));
  2. 打印某个对象的自身属性

    str(obj);
  3. 打印返回值

    @BTrace
    public class CaptureReturn {
    
        @OnMethod(clazz = "/com\\.tianxiaohui\\.ProcessToken/", method = "/validate/", location=@Location(Kind.RETURN) )
        public static void capture(@Return Object result) {
            BTraceUtils.printFields(result);
            println(jstackStr());
        }
    }
  4. 打印当前时间

    println("current time " + BTraceUtils.timestamp("yyyy-MM-dd' 'HH:mm:ss"));
  5. 对字符串操作可以用 BTraceUtils.Strings 的方法, 比如

    if (Strings.startsWith(fileName, "MIME")) {
  6. 创建很多 class 造成 metaspace 满 https://github.com/btraceio/btrace/blob/4f9f1c446da2105cd7e83dba63e26761a629103e/btrace-dist/src/main/resources/samples/Classload.java
  7. 创建新线程: https://github.com/btraceio/btrace/blob/4f9f1c446da2105cd7e83dba63e26761a629103e/btrace-dist/src/main/resources/samples/ThreadStart.java
  8. 更多例子: https://github.com/btraceio/btrace/tree/4f9f1c446da2105cd7e83dba63e26761a629103e/btrace-dist/src/main/resources/samples

另外一个例子:

import org.openjdk.btrace.core.annotations.*;
import static org.openjdk.btrace.core.BTraceUtils.*;

import org.openjdk.btrace.core.BTraceUtils.Strings;

@BTrace
public class ThreadInterruptTrace {

    @OnMethod( clazz="/java\\.nio\\.channels\\.ClosedByInterruptException/", method="<init>" )
    public static void createException() {
        println(Strings.strcat(name(currentThread()), " --- get ClosedByInterruptException "));
        println(jstackStr());
    }

    @OnMethod( clazz="/java\\.lang\\.Thread1/", method="/interrupt/" )
    public static void t1r(@ProbeClassName String probeClass, @Self Thread self) {
        if (Strings.startsWith(name(self), "RxIoScheduler")) {
            println(Strings.strcat(Strings.strcat(name(currentThread()), " try to interrupt ->  "), str(self)));
            println(jstackStr());
        }
    }
}