事故现象
informer突然收不到pod变更的事件
日志报错如下:
INFO [2019/10/22 11:29:15.043][INFO][ReflectorRunnable:98] class io.kubernetes.client.models.V1Pod#Read timeout retry list and watch
INFO [2019/10/22 11:29:16.043][INFO][ReflectorRunnable:45] class io.kubernetes.client.models.V1Pod#Start listing and watching...
Exception in thread "Thread-10" java.lang.IllegalStateException: Queue full
at java.util.AbstractQueue.add(AbstractQueue.java:98)
at java.util.concurrent.ArrayBlockingQueue.add(ArrayBlockingQueue.java:312)
at io.kubernetes.client.informer.cache.ProcessorListener.add(ProcessorListener.java:75)
at io.kubernetes.client.informer.cache.SharedProcessor.distribute(SharedProcessor.java:101)
at io.kubernetes.client.informer.impl.DefaultSharedIndexInformer.handleDeltas(DefaultSharedIndexInformer.java:183)
at io.kubernetes.client.informer.cache.DeltaFIFO.pop(DeltaFIFO.java:313)
at io.kubernetes.client.informer.cache.Controller.processLoop(Controller.java:140)
at io.kubernetes.client.informer.cache.Controller.run(Controller.java:107)
at java.lang.Thread.run(Thread.java:748)
事故排查
跟踪代码发现ProcessorListener初始化时设定了默认大小1000,当监听pod数量大于1000时就会出现上面异常。
ProcessorListener
private static final int DEFAULT_QUEUE_CAPACITY = 1000;
public ProcessorListener(ResourceEventHandler<ApiType> handler, long resyncPeriod) {
this.resyncPeriod = resyncPeriod;
this.handler = handler;
this.queue = new ArrayBlockingQueue<>(DEFAULT_QUEUE_CAPACITY);
......
}
public void add(Notification<ApiType> obj) {
if (obj == null) {
return;
}
this.queue.add(obj);
}
丫的,该类的queue 还只有这一个初始方法,也就是说没法改变这个容量。
AbstractQueue
public boolean add(E e) {
if (offer(e))
return true;
else
throw new IllegalStateException("Queue full");
}
抛异常后DefaultSharedIndexInformer的controller线程被中断,就再也监听不到新的pod变更消息了。
DefaultSharedIndexInformer
public DefaultSharedIndexInformer(
Class<ApiType> apiTypeClass, ListerWatcher listerWatcher, long resyncPeriod) {
this.resyncCheckPeriodMillis = resyncPeriod;
this.defaultEventHandlerResyncPeriod = resyncPeriod;
this.processor = new SharedProcessor<>();
this.indexer = new Cache();
DeltaFIFO<ApiType> fifo = new DeltaFIFO<ApiType>(Cache::metaNamespaceKeyFunc, this.indexer);
this.controller =
new Controller<ApiType, ApiListType>(
apiTypeClass,
fifo,
listerWatcher,
this::handleDeltas,
processor::shouldResync,
resyncCheckPeriodMillis);
controllerThread = new Thread(controller::run);
}
github 找到该bug fix记录
https://github.com/kubernetes-client/java/issues/667
https://github.com/kubernetes-client/java/pull/669
官方解决方案是由ArrayBlockingQueue改成无界队列LinkedBlockingQueue,已经合并到了master。
事故解决
升级kubernetes client到6.0.1 问题解决。