我们目前发现运行过程中,突然三台节点出现了
2021-09-13 12:08:08,779 ERROR failed to req API:http://10.12.105.24:8848/nacos/v1/ns/distro/checksum
java.net.SocketTimeoutException: 10,000 milliseconds timeout on connection http-outgoing-33476 [ACTIVE] at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:387) at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:92) at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:39) at org.apache.http.impl.nio.reactor.AbstractIODispatch.timeout(AbstractIODispatch.java:175) at org.apache.http.impl.nio.reactor.BaseIOReactor.sessionTimedOut(BaseIOReactor.java:261) at org.apache.http.impl.nio.reactor.AbstractIOReactor.timeoutCheck(AbstractIOReactor.java:502) at org.apache.http.impl.nio.reactor.BaseIOReactor.validate(BaseIOReactor.java:211) at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:280) at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591) at java.lang.Thread.run(Thread.java:748) 2021-09-13 12:08:08,828 ERROR failed to req API:http://10.12.105.26:8848/nacos/v1/ns/distro/checksum
java.net.SocketTimeoutException: 10,000 milliseconds timeout on connection http-outgoing-33477 [ACTIVE] at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:387) at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:92) at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:39) at org.apache.http.impl.nio.reactor.AbstractIODispatch.timeout(AbstractIODispatch.java:175) at org.apache.http.impl.nio.reactor.BaseIOReactor.sessionTimedOut(BaseIOReactor.java:261) at org.apache.http.impl.nio.reactor.AbstractIOReactor.timeoutCheck(AbstractIOReactor.java:502) at org.apache.http.impl.nio.reactor.BaseIOReactor.validate(BaseIOReactor.java:211) at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:280) at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104) at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591)
这个异常,我们初步认为在选举过程中请求流量一样会打到所有的节点上,导致选主失败,于是我们外层添加了ng检查,一旦切到某一台,然后另外两台就可以迅速恢复,但是只能临时解决下,但是找不到为什么三台突然出现这个问题,防火墙都关闭,服务数量在6000+ ;感觉是健康检查的问题,帮看下

 而nacos源码中,这段代码,默认给的连接个数是虚拟机处理器*2,我们是8c,也就是连接池只有16个连接,并且这些连接都没有做默认超时释放连接的操作,初步断定是这块的原因:并没有设置connectionTimeout时间。
线程日志中tomcat200个线程,都被消耗在获取连接的上面,集群在进行服务检查和同步syncData的过程中会不断的拿连接,而连接获取不到,三台的200线程都被消耗完,彼此死锁无法处理其他的请求。
而nacos源码中,这段代码,默认给的连接个数是虚拟机处理器*2,我们是8c,也就是连接池只有16个连接,并且这些连接都没有做默认超时释放连接的操作,初步断定是这块的原因:并没有设置connectionTimeout时间。
线程日志中tomcat200个线程,都被消耗在获取连接的上面,集群在进行服务检查和同步syncData的过程中会不断的拿连接,而连接获取不到,三台的200线程都被消耗完,彼此死锁无法处理其他的请求。