HBase Peer状态异常导致的oldWALs目录文件数异常
背景简述
1 | HBase Version 1.0.0-CDH5.5 |
该集群作为主集群,配置了一个Peer集群,但Peer集群下线后已经进行了disable_peer
。
事故异常过程
HBase RegionServer 突发异常,并出现了宕机
RegionServer 无法处理流量导致 RPC Read/Write流量异常:
对RegionServer 进行重启失败,出现异常:
1 | Feb 22, 12:01:46.497 PM WARN org.apache.hadoop.hbase.coordination.SplitLogManagerCoordination |
定位问题
从日志提示看是由于/hbase/oldWALs
目录下的文件数超过了HDFS
最大文件限制数10485761
,检查该目录:
1 | hdfs -dfs -du -s /hbase/oldWALs |
这个文件达到了100T, 已经无法统计出该目录下的子文件。目录/hbase/oldWALs
的作用是WAL的归档目录,一旦一个WAL文件中记录的所有KV数据确认已经从MemStore持久化到HFile,那么该WAL文件就会被移到该目录。 开启了Peer后,若未复制成功的WAL也会存放在该目录。
解决问题
清理/hbase/oldWALs
1 | today=`date +'%s'` |
数据
原因
HLog文件是有生命周期的,HLog生命周期:
默认情况下HBase Master(Active节点)会后台启动一个线程,以hbase.master.cleaner.interval
(默认1分钟,HDP3 改为了1h间隔)为间隔检查oldWALs
下所有失效的日志问题,确定是否可以删除。
确认删除需要达成2个条件:
- HLog 文件是否还在参与主从复制,以及该文件是否还在用于主从复制
ReplicationHFileCleaner.getDeletableFiles
逻辑 - HLog 文件在oldWALs文件中存在的时间 >
hbase.master.logcleaner.ttl
(默认值10minutes)1
2
3
4
5
6[zk: localhost:2181(CONNECTED) 6] ls /hbase/replication/peers
[]
[zk: localhost:2181(CONNECTED) 7] ls /hbase/replication/rs
[xxxx-hbase01-node244.xxxx.com,60020,1645515180740, xxxx-hbase01-node238.xxxx.com,60020,1645502396855, xxxx-hbase01-node12.xxxx.com,60020,1645502395651, xxxx-hbase01-node68.xxxx.com,60020,1645502396574, xxxx-hbase01-node56.xxxx.com,60020,1645502396748, xxxx-hbase01-node57.xxxx.com,60020,1645502395942, xxxx-hbase01-node66.xxxx.com,60020,1645502396692, xxxx-hbase01-node11.xxxx.com,60020,1645502396175, xxxx-hbase01-node149.xxxx.com,60020,1645502556596, xxxx-hbase01-node64.xxxx.com,60020,1645502398389, xxxx-hbase01-node62.xxxx.com,60020,1645502396268, xxxx-hbase01-node59.xxxx.com,60020,1645501475468, xxxx-hbase01-node240.xxxx.com,60020,1645502396736, xxxx-hbase01-node75.xxxx.com,60020,1645502396501, xxxx-hbase01-node10.xxxx.com,60020,1645502395833, xxxx-hbase01-node63.xxxx.com,60020,1645502397816, xxxx-hbase01-node150.xxxx.com,60020,1645502396480, xxxx-hbase01-node71.xxxx.com,60020,1645502396275, xxxx-hbase01-node49.xxxx.com,60020,1645502396541, xxxx-hbase01-node65.xxxx.com,60020,1645502398568, xxxx-hbase01-node77.xxxx.com,60020,1645502396304, xxxx-hbase01-node54.xxxx.com,60020,1645502395830, xxxx-hbase01-node78.xxxx.com,60020,1645502396183, xxxx-hbase01-node55.xxxx.com,60020,1645502396742, xxxx-hbase01-node50.xxxx.com,60020,1645502396573, xxxx-hbase01-node70.xxxx.com,60020,1645502396859, xxxx-hbase01-node72.xxxx.com,60020,1645502395894, xxxx-hbase01-node242.xxxx.com,60020,1645502396166, xxxx-hbase01-node53.xxxx.com,60020,1645502395834, xxxx-hbase01-node61.xxxx.com,60020,1645502396141, xxxx-hbase01-node51.xxxx.com,60020,1645502396211, xxxx-hbase01-node148.xxxx.com,60020,1645502396044, xxxx-hbase01-node67.xxxx.com,60020,1645502396172, xxxx-hbase01-node58.xxxx.com,60020,1645502398506, xxxx-hbase01-node76.xxxx.com,60020,1645502397094, xxxx-hbase01-node52.xxxx.com,60020,1645502395760, xxxx-hbase01-node73.xxxx.com,60020,1645502396696, xxxx-hbase01-node60.xxxx.com,60020,1645502396503, xxxx-hbase01-node241.xxxx.com,60020,1645502396087, xxxx-hbase01-node69.xxxx.com,60020,1645502396658, xxxx-hbase01-node151.xxxx.com,60020,1645502396097, xxxx-hbase01-node239.xxxx.com,60020,1645502396421, xxxx-hbase01-node74.xxxx.com,60020,1645502396215, xxxx-hbase01-node61.xxxx.com,60020,1626235982224]
[zk: localhost:2181(CONNECTED) 8] rmr /hbase/replication/rs