问题
发现线上RocketMQ从节点发现没有流量,主从同步停止。
找到异常日志如下:
broker.log
2020-04-01 18:48:45 INFO BrokerControllerScheduledThread1 - Slave fall behind master: -4123389833770013789 bytes
初以为这个原因导致!
排查
slave落后master怎么可能是负呢?
进一步排查安全部门扫描HA同步端口10912时,发送数据会导致上面异常日志!
但是并不会导致主从同步停止。
进一步查找异常日志,发现storeerror.log 一直在不断的打印下面错误日志!
2020-04-01 18:41:33 WARN WriteSocketService - Offset for /home/xx/rocketmq-broker/store/commitlog/00000005344013058048 not matched. Request offset: 3895535325339, index: -1350, mappedFileSize: 1073741824, mappedFiles count: 61
2020-04-01 18:41:33 WARN WriteSocketService - findMappedFileByOffset failure.
java.lang.ArrayIndexOutOfBoundsException: -1350
at java.util.concurrent.CopyOnWriteArrayList.get(CopyOnWriteArrayList.java:387) ~[na:1.8.0_131]
at java.util.concurrent.CopyOnWriteArrayList.get(CopyOnWriteArrayList.java:396) ~[na:1.8.0_131]
at org.apache.rocketmq.store.MappedFileQueue.findMappedFileByOffset(MappedFileQueue.java:478) ~[rocketmq-store-4.2.1-SNAPSHOT.jar:na]
at org.apache.rocketmq.store.CommitLog.getData(CommitLog.java:148) [rocketmq-store-4.2.1-SNAPSHOT.jar:na]
at org.apache.rocketmq.store.CommitLog.getData(CommitLog.java:143) [rocketmq-store-4.2.1-SNAPSHOT.jar:na]
at org.apache.rocketmq.store.DefaultMessageStore.getCommitLogData(DefaultMessageStore.java:807) [rocketmq-store-4.2.1-SNAPSHOT.jar:na]
at org.apache.rocketmq.store.ha.HAConnection$WriteSocketService.run(HAConnection.java:271) [rocketmq-store-4.2.1-SNAPSHOT.jar:na]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_131]
源码分析
HAConnection.java
SelectMappedBufferResult selectResult =
HAConnection.this.haService.getDefaultMessageStore().getCommitLogData(this.nextTransferFromWhere);
//这里selectResult是null导致主从无法同步
if (selectResult != null) {
int size = selectResult.getSize();
if (size > HAConnection.this.haService.getDefaultMessageStore().getMessageStoreConfig().getHaTransferBatchSize()) {
size = HAConnection.this.haService.getDefaultMessageStore().getMessageStoreConfig().getHaTransferBatchSize();
}
long thisOffset = this.nextTransferFromWhere;
this.nextTransferFromWhere += size;
selectResult.getByteBuffer().limit(size);
this.selectMappedBufferResult = selectResult;
// Build Header
this.byteBufferHeader.position(0);
this.byteBufferHeader.limit(headerSize);
this.byteBufferHeader.putLong(thisOffset);
this.byteBufferHeader.putInt(size);
this.byteBufferHeader.flip();
this.lastWriteOver = this.transferData();
} else {
HAConnection.this.haService.getWaitNotifyObject().allWaitForRunning(100);
}
selectResult 为null的原因如上面的异常报错,findMappedFileByOffset failure!所以定位到是nextTransferFromWhere这个变量的问题,而这个变量由slave告诉master!
HAConnection.java
if (-1 == this.nextTransferFromWhere) {
if (0 == HAConnection.this.slaveRequestOffset) {
long masterOffset = HAConnection.this.haService.getDefaultMessageStore().getCommitLog().getMaxOffset();
masterOffset =
masterOffset
- (masterOffset % HAConnection.this.haService.getDefaultMessageStore().getMessageStoreConfig()
.getMapedFileSizeCommitLog());
if (masterOffset < 0) {
masterOffset = 0;
}
this.nextTransferFromWhere = masterOffset;
} else {
this.nextTransferFromWhere = HAConnection.this.slaveRequestOffset;
}
log.info("master transfer data from " + this.nextTransferFromWhere + " to slave[" + HAConnection.this.clientAddr
+ "], and slave request " + HAConnection.this.slaveRequestOffset);
}
在slave机器发现store目录00000003894461595648文件,正好和异常对上!
原因已经很明朗了,slave数据由于某种原因远远落后master,导致主从无法同步!将该文件从store/commitlog移走后,主从同步恢复正常!