zabbix4监控LSI阵列卡
linux端
前提:搭建好zabbix平台,安装好megacli软件在/opt/MegaRAID/MegaCli/MegaCli64
环境:centos6.10
zabbix 4
功能实现
- 硬盘自动发现并加入监控(新接入一块盘会自动接入)
- 监控硬盘的物理坏道
- 监控硬盘的逻辑坏道
- 监控硬盘的预报错(DELL服务器确认硬盘是否故障的最重要指标)
- 监控硬盘的状态
- 监控阵列等级状态,一但出现降级则告警
阀值设置
- Medaia Error Count on Every Disk <=30
- Other Error Count on Every Disk <=1000
- Predictive Failure Count On Every Disk <=2
- Firmware State on Every Disk !=Unconfigured(bad),Failed
- Raid Level State != Degraded
创建取值脚本
为了安全与方便,在home目录下创建一个zabbix的文件夹存放缓存,并把所有者改为zabbix
mkdir /home/zabbix
chown zabbix:zabbix /home/zabbix
mkdir /home/zabbix/tmp
chown zabbix:zabbix /home/zabbix/tmp
这里放在/home/zabbix/diskcheck_megacli.sh,看你喜好
#!/bin/bash
#zabbix监控硬盘信息脚本
#By xiangjunyu 20151101
TEMP_DIR="/home/zabbix"
#获取磁盘信息
sudo /opt/MegaRAID/MegaCli/MegaCli64 -Pdlist -a0|grep -Ei '(Slot Number|Media Error Count|Other Error Count|Predictive Failure Count|Raw Size|Firmware state)'|sed -e "s:\[0x.*Sectors\]::g" >${TEMP_DIR}/tmp/pdinfo.txt
#将每块磁盘信息拆分,进行逐盘分析
split -l 6 -d ${TEMP_DIR}/tmp/pdinfo.txt ${TEMP_DIR}/tmp/pdinfo
#获取磁盘数量(实际数量=PDNUM+1)
PDNUM=`sudo /opt/MegaRAID/MegaCli/MegaCli64 -PDGetNum -aAll|grep Physical|awk '{ print $8 }'`
#磁盘分块后文件名规范统一化
for((i=0;i<${PDNUM};i++))
do
mv ${TEMP_DIR}/tmp/pdinfo0${i} ${TEMP_DIR}/tmp/pdinfo${i} >/dev/null 2>&1
#ls /tmp/pdinfo${i}
done
SLOT_NUM=$2
DATAFORMATE()
{
while read LINE
do
if [[ ${LINE} == Slot* ]];
then
SLOTNUMNAME=`echo ${LINE}|awk -F: '{ print $1 }'`
SLOTNUM=`echo ${LINE}|awk -F: '{ print $2 }'`
elif [[ ${LINE} == Media* ]];
then
MECNAME=`echo ${LINE}|awk -F: '{ print $1 }'`
MEC=`echo ${LINE}|awk -F: '{ print $2 }'`
elif [[ ${LINE} == Other* ]];
then
OECNAME=`echo ${LINE}|awk -F: '{ print $1 }'`
OEC=`echo ${LINE}|awk -F: '{ print $2 }'`
elif [[ ${LINE} == Predictive* ]];
then
PFCNAME=`echo ${LINE}|awk -F: '{ print $1 }'`
PFC=`echo ${LINE}|awk -F: '{ print $2 }'`
elif [[ ${LINE} == Raw* ]];
then
RAWNAME=`echo ${LINE}|awk -F: '{ print $1 }'`
SIZE=`echo ${LINE}|awk -F: '{ print $2 }'`
elif [[ ${LINE} == Firmware* ]];
then
FIRMWARENAME=`echo ${LINE}|awk -F: '{ print $1 }'`
FIRMWARESTATUS=`echo ${LINE}|awk -F: '{ print $2 }'`
fi
done <${TEMP_DIR}/tmp/pdinfo${SLOT_NUM}
}
#检测阵列等级状态
CHECKRAIDLEVEL()
{
sudo /opt/MegaRAID/MegaCli/MegaCli64 -LDInfo -Lall -aALL|grep Degraded
if [ $? = 0 ]
then
echo -1
else
echo 0
fi
}
OPTION=$1
case $OPTION in
mec) DATAFORMATE
echo ${MEC}
;;
oec) DATAFORMATE
echo ${OEC}
;;
pfc) DATAFORMATE
echo ${PFC}
;;
firm)
DATAFORMATE
if [[ "$FIRMWARESTATUS{}" = "Unconfigured(bad)" ]]
then
echo -1
elif [[ "$FIRMWARESTATUS{}" = "Failed" ]]
then
echo -1
else
echo 0
fi
;;
rdlevel)
CHECKRAIDLEVEL
;;
*) echo "Please select option: mec $slot_num ;oec $slot_num;pfc $slot_num;firm $slot_num;rdlevel"
esac
rm -rf ${TEMP_DIR}/tmp/pdinfo*
权限加个x(运行)
chmod +x /home/zabbix/diskcheck_megacli.sh
zabbix agent设置
给zabbix用户分root权限运行MegaCli
visudo
加上zabbix ALL=(root) NOPASSWD: /opt/MegaRAID/MegaCli/MegaCli64
esc键
:wq保存
修改zabbix agent配置文件
vim /etc/zabbix/zabbix_agentd.conf
#加上一行
UnsafeUserParameters=1
#还有一行应该默认有Include=/etc/zabbix/zabbix_agentd.conf.d/
创建自定义监控项的配置文件
写个配置文件在/etc/zabbix/zabbix_agentd.conf.d/disk.conf
#硬盘自动发现
#UserParameter=raid.pd.discovery,MegaCli64 -PDlist -aAll -NoLog|grep Slot|awk 'BEGIN{printf "{"data":[nn"} {printf ",n{ "{#SLOT_NUM}":"%s"}", $NF, $1;} END{ printf "nt]n}n";}' | sed '/^,$/d'
UserParameter=raid.pd.discovery,sudo /opt/MegaRAID/MegaCli/MegaCli64 -PDlist -aAll -NoLog | grep Slot | awk 'BEGIN{printf "{\"data\":["} {if(NR>1) printf ","; printf "\n{\"{#SLOT_NUM}\":\"%s\"}", $NF} END{printf "\n]}\n"}' | sed '/^,$/d'
#收集Media Error Count
UserParameter=raid.phy.mec[*],/home/zabbix/diskcheck_megacli.sh mec $1
#收集Other Error Count
UserParameter=raid.phy.oec[*],/home/zabbix/diskcheck_megacli.sh oec $1
#收集Predictive Failure Count
UserParameter=raid.phy.pfc[*],/home/zabbix/diskcheck_megacli.sh pfc $1
#检测硬盘状态,有故障则回复-1
UserParameter=raid.phy.firms[*],/home/zabbix/diskcheck_megacli.sh firm $1
#检测阵列等级,有降级则回复-1
UserParameter=raid.level.state,/home/zabbix/diskcheck_megacli.sh rdlevel
重启zabbix agent
service zabbix-agent restart
测试一下取值
zabbix_agentd -t raid.phy.mec[0]
zabbix_agentd -t raid.pd.discovery
raid.phy.mec[3] [t|0]
[root@localhost zabbix_agentd.d]# zabbix_agentd -t raid.pd.discovery raid.pd.discovery [t|{“data”:[ {"{#SLOT_NUM}":“0”}, {"{#SLOT_NUM}":“1”}, {"{#SLOT_NUM}":“2”}, {"{#SLOT_NUM}":“3”} ]}]
取到值为0,则正常;如有报错,要具体看
zabbix server设置
创建模板
命名为Check Raid By MegaCli

创建自动发现规则
在模板中新建一个Discovery rule
Name:Physical disk discovery
Type:Zabbix agent(active)
Key:raid.pd.discovery
Update interval (in sec):3600
Keep lost resources period (in days):30
Deion:Find physical disk
Enabled: ✔

并且在过滤器添加一项{#SLOT_NUM},对应disk.conf里面写的key

创建监控项原型item
在自动发现规则中Physical disk discovery创建监控项原型
-
Media Error Count On Slot {#SLOT_NUM}
Name:Media Error Count On Slot $1
Type:Zabbix agent(active)
Key:raid.phy.mec[{#SLOT_NUM}] #这里的key注意和disk.conf里的匹配
Applications:MegaRaid #自己新建一个Application
Enabled: ✔
-
Other Error Count On Slot {#SLOT_NUM}
-
Predictive Error Count On Slot {#SLOT_NUM}
-
Firmware State On Slot {#SLOT_NUM}
-
Raid Level State

创建触发器原型trigger
在自动发现规则中Physical disk discovery创建触发器原型
名称:{HOST.NAME}硬盘阵列 SLOT {#SLOT_NUM} Firmware State 报错
表达式:{Check Raid By MegaCli:raid.phy.firms[{#SLOT_NUM}].last()}<>0
严重性:严重
以此类推


疑难解答
自动发现规则报错
无法发送请求
错误的发现规则类型 请求无法发送。
Value should be a JSON object.

格式错误
在server上用zabbix_get -s agent的ip -k key看看
zabbix_get -s xxx -k raid.pd.discovery
[root@localhost ~]# zabbix_get -s 192 -k raid.pd.discovery
{"data":[
]}
[root@localhost ~]# zabbix_get -s 192 -k raid.phy.mec[0]
/home/zabbix/diskcheck_megacli.sh: line 5: /var/lib/zabbix/.bash_profile: No such file or directory
/home/zabbix/diskcheck_megacli.sh: line 21: ((: i<: syntax error: operand expected (error token is "<")
/home/zabbix/diskcheck_megacli.sh: line 28: /home/zabbix/tmp/pdinfo0: No such file or directory
没有数据,应该是zabbix用户权限不足
在agent的机子上用sudo -u zabbix 跑跑看看

要用root才能跑,在sudoer里面加上就行了