大数据运维监控系统(Big data operation and maintenance monitoring system)
监控架构
- prometheus集群拆分、高可用架构搭建、配置规划
- prometheusManager平台开发(监控节点管理,告警规则配置,支持api外部调用)
- grafana可视化面板配置(主机,核心组件面板配置)
各大数据集群监控参数
- 核心指标分析抓取,过滤次要指标配置
- 端口配置评估
- jmx_exporter、prometheus_export版本测试调研
- 测试接入监控对集群的影响
部署和上线
- jmx_exporter可以 以java agent模式启动的,因此无须配置JMX远程输出相关的配置
- 集群所有节点分发jmx监控配置(节点信息已通过版本化管理) (ansible script + jar package)
- 集群角色配置监控参数
CDH cluster config jmx_prometheus
HDP cluster config jmx_prometheus
重启
根据集群的SOP操作指引,灰度重启各集群的各类服务节点。组件重点监控指标
组件 | 指标 | 说明 | 是否需要告警 | 告警级别 warnning 警告major 严重 critical 灾难 | 告警阈值 | 持续时间(分钟) | 告警接收对象 | 备注 |
---|---|---|---|---|---|---|---|---|
kafka | Offline Brokers | 下线节点 | √ | major | ||||
Offline Partitions | 下线分区 | √ | major | |||||
kafka_log_log_value | topic数据量 | |||||||
broker GC | 节点gc时长 | |||||||
process_cpu_seconds_total | 进程cpu使用率 | |||||||
jvm_memory_bytes_used | 进程jvm内存使用 | |||||||
kafka_server_replicamanager_value | 节点分区数 | |||||||
kafka_server_replicamanager_value | 节点leader 数 | |||||||
kafka_server_brokertopicmetrics_oneminuterate | 集群数据生产/消费速率 | |||||||
节点数据生产/消费速率 | ||||||||
topic数据生产/消费速率 | ||||||||
topic消息生产速率 | ||||||||
topic消息大小 | √ | warnning | ||||||
kafka_server_zookeeperclientmetrics_count | 请求zk延时 | |||||||
kafka_burrow_partition_lag | 消费lag | √ | dip接口业务方自行配置 | |||||
memory | 节点内存 | |||||||
cpu | 节点cpu | |||||||
disk | 磁盘容量 | |||||||
磁盘io | √ | major | ||||||
error_log | 日志告警 | √ | warnning | |||||
zookeeper | ||||||||
status | 节点状态 | √ | major | |||||
NumAliveConnections | 连接数 | |||||||
AvgRequestLatency | 请求延时 | |||||||
GC | gc时长 | |||||||
hdfs | status | 节点进程 | √ | major | to standby or to null | 告警在状态转换时候 | ||
safe model | 安全模式 | |||||||
capacity | 集群磁盘容量 | |||||||
corrupt_blocks | 损坏的块 | |||||||
missing_blocks | 丢失的块 | |||||||
under_replicated_blocks | 副本不足的块 | |||||||
volume_failures_total | 失败的卷 | |||||||
num_live_data_nodes | 可用的datanode数量 | |||||||
num_dead_data_nodes | 死亡的datanode数量 | |||||||
num_stale_data_nodes | 由于心跳延迟而标记为过期的datanode数量 | |||||||
blocks_total | 已分配的块 | |||||||
block_capacity | 最大可分配块 | |||||||
files_total | 文件总数 | |||||||
pending_replication_blocks | ||||||||
pending_deletion_blocks | ||||||||
under_replicated_blocks | ||||||||
scheduled_replication_blocks | ||||||||
excess_blocks | ||||||||
syncs60s99thpercentilelatencymicros | ||||||||
gc_time_millis | ||||||||
num_open_connections | ||||||||
rpc_queue_time_avg_time | √ | warnning | 10s | >10s 升级为 major | ||||
rpc_processing_time_avg_time | √ | warnning | 5s | |||||
threads_runnable | ||||||||
threads_waiting | ||||||||
block_received_and_deleted_ops | ||||||||
block_received_and_deleted_avg_time | √ | warnning | 10s | >10s 升级为 major | ||||
transactions_since_last_checkpoint | √ | warnning | 2百万 | >2百万事升级为 major | ||||
yarn | ||||||||
status | 节点状态 | |||||||
numactivenms | 可用的NodeManager | |||||||
numunhealthynms | ||||||||
numlostnms | ||||||||
availablevcores | ||||||||
allocatedvcores | ||||||||
availablemb | ||||||||
allocatedmb | ||||||||
- | 集群总cpu使用率 | |||||||
- | 集群总内存使用率 | |||||||
appsrunning | ||||||||
appskilled | ||||||||
appsfailed | ||||||||
pendingcontainers | ||||||||
hbase | status | 节点状态 | √ | critical | 2分钟 | |||
ritCount | 异常 region 数 | √ | major | 5分钟 | ||||
hlogFileCount | Hlog文件数 | |||||||
hlogFileSize | Hlog大小 | |||||||
storeFileCount | Storefile个数 | |||||||
storeFileSize | Storefile大小 | |||||||
flushQueueLength | flush队列 | √ | warning | >500 | 10分钟 | |||
compactionQueueLength | Compaction队列 | √ | warning | >1000 | 10分钟 | |||
splitQueueLength | split队列 | √ | warning | >100 | 10分钟 | |||
blockCacheHitPercent | Blockcache命中率 | |||||||
Mutate_num_ops | put次数 | |||||||
ScanNext_num_ops | scan次数 | |||||||
Get_num_ops | get次数 | |||||||
slowGetCount | 慢get次数 | √ | major | >100 | 2分钟 | |||
slowPutCount | 慢put次数 | √ | major | >1000 | 10分钟 | |||
totalRequestCount | 请求总次数 | |||||||
percentFilesLocal | 本地化率 | √ | major | <90% | 60分钟 | |||
Get_99th_percentile | Get延迟 | √ | major | >5000ms | 2分钟 | |||
Scan_99th_percentile | Scan延迟 | |||||||
Mutate_99th_percentile | Put延迟 | √ | warning | >5000ms | 10分钟 | |||
FlushTime_99th_percentile | flush花费时间 | |||||||
SplitTime_99th_percentile | split花费时间 | |||||||
numOpenConnections | 连接数 | |||||||
TotalCallTime_99th_percentile | 响应时间 | √ | major | >5000 ms | 2分钟 | |||
ProcessCallTime_99th_percentile | 处理时间 | √ | major | >5000 ms | 2分钟 | |||
MemHeapMaxM | 最大内存 | |||||||
MemHeapUsedM | 已使用内存 | √ | major | >80% | 5分钟 | |||
ThreadsRunnable | 运行线程数 | |||||||
GcCount | GC次数 | |||||||
GcTimeMillis | 时长 | |||||||
flume | flume_source_metrics_EventAcceptedCount | |||||||
flume_channel_metrics_ChannelSize | ||||||||
flume_source_metrics_EventReceivedCount | ||||||||
spark | ||||||||
jvm_memory_usage | jvm内存使用情况 | |||||||
jvm_gcTime_count | jvm垃圾回收次数 | |||||||
filesystem_usage | hdfs文件读写速率 | |||||||
jvm_memory_pools | jvm内存池使用情况 | |||||||
executor_tasks | executor完成task的个数 | |||||||
flink | ||||||||
flink_jobmanager_job_fullRestarts | job重启次数 | |||||||
flink_taskmanager_job_task_operator_KafkaConsumer_records_consumed_rate | kafka消费速率 | |||||||
flink_jobmanager_Status_JVM_Memory_Direct_Count | The number of buffers in the direct buffer pool. | |||||||
flink_jobmanager_Status_JVM_Memory_Direct_TotalCapacity | The total capacity of all buffers in the direct buffer pool (in bytes). | |||||||
flink_taskmanager_Status_JVM_Memory_Direct_Count | The number of buffers in the direct buffer pool. | |||||||
flink_taskmanager_Status_JVM_Memory_Direct_TotalCapacity | The total capacity of all buffers in the direct buffer pool (in bytes). | |||||||
flink_jobmanager_Status_JVM_Memory_Heap_Used | The amount of heap memory currently used (in bytes). | |||||||
flink_jobmanager_Status_JVM_Memory_Heap_Max | The maximum amount of heap memory that can be used for memory management (in bytes). | |||||||
This value might not be necessarily equal to the maximum value specified through -Xmx or the equivalent Flink configuration parameter. Some GC algorithms allocate heap memory that won’t be available to the user code and, therefore, not being exposed through the heap metrics. | ||||||||
flink_taskmanager_Status_JVM_Memory_Heap_Used | The amount of heap memory currently used (in bytes). | |||||||
flink_taskmanager_Status_JVM_Memory_Heap_Max | The maximum amount of heap memory that can be used for memory management (in bytes). | |||||||
This value might not be necessarily equal to the maximum value specified through -Xmx or the equivalent Flink configuration parameter. Some GC algorithms allocate heap memory that won’t be available to the user code and, therefore, not being exposed through the heap metrics. | ||||||||
flink_jobmanager_Status_JVM_CPU_Load | The recent CPU usage of the JVM. | |||||||
flink_taskmanager_Status_JVM_CPU_Load | The recent CPU usage of the JVM. | |||||||
flink_jobmanager_job_numberOfCompletedCheckpoints | job中完成的chk数量 | |||||||
flink_jobmanager_job_numberOfInProgressCheckpoints | job中进行中的chk数量 | |||||||
flink_jobmanager_job_lastCheckpointSize | job中最后一个chk大小 | |||||||
flink_jobmanager_job_lastCheckpointDuration | job中最后一个chk耗时 | |||||||
azkaban | ||||||||
excutors status |
告警配置
告警的核心是告警分析
- 告警平台:
通过使用操作员预先设置好的告警过滤规则,将各个系统传过来的告警进行规则压制–>再将过滤后的告警再分层级压制→展示、处理、分发通知
告警压制:
1、过滤无用告警(如系统变更操作等非物理故障造成的告警)
2、过滤次要告警(例如告警等级较低或者风向很小的告警)
3、根据规则过滤主要告警
4、被过滤的告警触发的通知消息被系统压制在消息队列中
5、等下次事件再触发或者新的调度方案生成后才发出通知消息
6、过滤后剩下的告警,会即时通过短信、邮件、im、工单等形式通知相关的操作人员处理告警故障
告警分析:
1、将采集到告警按照定制的时间间隔进行汇总、统计、分析
2、将分析的结果总结成案例,做成样本提供预测告警
告警处理及推送:
1、对过滤剩下的告警信息进行分类,重新组装信息,压缩打包,提取摘要等,将告警信息推送
2、调用各通道接口,对告警信息进行分批发送
故障预测技术:
1、基于历史数据的预测
2、基于状态监控的预测