RocketMQ主从无法同步问题排查

问题

发现线上RocketMQ从节点发现没有流量,主从同步停止。

找到异常日志如下:

broker.log


2020-04-01 18:48:45 INFO BrokerControllerScheduledThread1 - Slave fall behind master: -4123389833770013789 bytes

初以为这个原因导致!

排查

slave落后master怎么可能是负呢?

进一步排查安全部门扫描HA同步端口10912时,发送数据会导致上面异常日志!

但是并不会导致主从同步停止。

进一步查找异常日志,发现storeerror.log 一直在不断的打印下面错误日志!


2020-04-01 18:41:33 WARN WriteSocketService - Offset for /home/xx/rocketmq-broker/store/commitlog/00000005344013058048 not matched. Request offset: 3895535325339, index: -1350, mappedFileSize: 1073741824, mappedFiles count: 61
2020-04-01 18:41:33 WARN WriteSocketService - findMappedFileByOffset failure.
java.lang.ArrayIndexOutOfBoundsException: -1350
	at java.util.concurrent.CopyOnWriteArrayList.get(CopyOnWriteArrayList.java:387) ~[na:1.8.0_131]
	at java.util.concurrent.CopyOnWriteArrayList.get(CopyOnWriteArrayList.java:396) ~[na:1.8.0_131]
	at org.apache.rocketmq.store.MappedFileQueue.findMappedFileByOffset(MappedFileQueue.java:478) ~[rocketmq-store-4.2.1-SNAPSHOT.jar:na]
	at org.apache.rocketmq.store.CommitLog.getData(CommitLog.java:148) [rocketmq-store-4.2.1-SNAPSHOT.jar:na]
	at org.apache.rocketmq.store.CommitLog.getData(CommitLog.java:143) [rocketmq-store-4.2.1-SNAPSHOT.jar:na]
	at org.apache.rocketmq.store.DefaultMessageStore.getCommitLogData(DefaultMessageStore.java:807) [rocketmq-store-4.2.1-SNAPSHOT.jar:na]
	at org.apache.rocketmq.store.ha.HAConnection$WriteSocketService.run(HAConnection.java:271) [rocketmq-store-4.2.1-SNAPSHOT.jar:na]
	at java.lang.Thread.run(Thread.java:748) [na:1.8.0_131]


源码分析

HAConnection.java


                    SelectMappedBufferResult selectResult =
                        HAConnection.this.haService.getDefaultMessageStore().getCommitLogData(this.nextTransferFromWhere);
                   //这里selectResult是null导致主从无法同步
                    if (selectResult != null) {
                        int size = selectResult.getSize();
                        if (size > HAConnection.this.haService.getDefaultMessageStore().getMessageStoreConfig().getHaTransferBatchSize()) {
                            size = HAConnection.this.haService.getDefaultMessageStore().getMessageStoreConfig().getHaTransferBatchSize();
                        }

                        long thisOffset = this.nextTransferFromWhere;
                        this.nextTransferFromWhere += size;

                        selectResult.getByteBuffer().limit(size);
                        this.selectMappedBufferResult = selectResult;

                        // Build Header
                        this.byteBufferHeader.position(0);
                        this.byteBufferHeader.limit(headerSize);
                        this.byteBufferHeader.putLong(thisOffset);
                        this.byteBufferHeader.putInt(size);
                        this.byteBufferHeader.flip();

                        this.lastWriteOver = this.transferData();
                    } else {

                        HAConnection.this.haService.getWaitNotifyObject().allWaitForRunning(100);
                    }

selectResult 为null的原因如上面的异常报错,findMappedFileByOffset failure!所以定位到是nextTransferFromWhere这个变量的问题,而这个变量由slave告诉master!

HAConnection.java

                    if (-1 == this.nextTransferFromWhere) {
                        if (0 == HAConnection.this.slaveRequestOffset) {
                            long masterOffset = HAConnection.this.haService.getDefaultMessageStore().getCommitLog().getMaxOffset();
                            masterOffset =
                                masterOffset
                                    - (masterOffset % HAConnection.this.haService.getDefaultMessageStore().getMessageStoreConfig()
                                    .getMapedFileSizeCommitLog());

                            if (masterOffset < 0) {
                                masterOffset = 0;
                            }

                            this.nextTransferFromWhere = masterOffset;
                        } else {
                            this.nextTransferFromWhere = HAConnection.this.slaveRequestOffset;
                        }

                        log.info("master transfer data from " + this.nextTransferFromWhere + " to slave[" + HAConnection.this.clientAddr
                            + "], and slave request " + HAConnection.this.slaveRequestOffset);
                    }

在slave机器发现store目录00000003894461595648文件,正好和异常对上!

原因已经很明朗了,slave数据由于某种原因远远落后master,导致主从无法同步!将该文件从store/commitlog移走后,主从同步恢复正常!

0 0 投票数
Article Rating
订阅评论
提醒
guest
0 评论
最旧
最新 最多投票
内联反馈
查看所有评论
0
希望看到您的想法,请您发表评论x