[关闭]
@cdmonkey 2017-05-11T02:03:19.000000Z 字数 8276 阅读 1805

smartmontools

Nagios


https://github.com/v-zhuravlev/zbx-smartctl

一、简介

SMART: Self Monitoring Analysis and Reporting Technology

Download: http://www.smartmontools.org/wiki/Download
Linux.cn: http://linux.cn/article-4461-weixin.html
http://my.oschina.net/u/877567/blog/307336

原创硬盘检测:http://czmmiao.iteye.com/blog/1058215

服务器硬盘是最容易出现问题及发生故障的部件,而且硬盘中存储着大量重要的数据,万一出现故障所造成的损失也是无法估计的,轻则需要化费大量的时间与精力去做数据恢复,重则硬盘报废,里面重要的数据也无法完全恢复,所以对硬盘健康状态的检测,就尤其重要了。为了避免遇到硬盘损坏的情况,用户可使用该软件包程序,它通过使用自身监测、分析及报告三种技术来实施监管存储硬件。

二、安装

安装依赖包:

  1. yum install gcc gcc-c++
  1. [root@oadb tools]# tar zxvf smartmontools-6.4.tar.gz
  2. [root@oadb tools]# cd smartmontools-6.4
  3. [root@oadb smartmontools-6.4]# ./configure --prefix=/usr/local --without-selinux
  4. make && make install
  5. ------------------
  6. # Yum install:
  7. yum install -y smartmontools

三、使用

1. smartctl

  1. # View the version information:
  2. [root@oadb ~]# smartctl -V
  3. smartctl 6.4 2015-06-04 r4109 [x86_64-linux-2.6.18-308.el5] (local build)
  4. Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org
  5. ...

使用一台配有八块硬盘的服务器进行测试:

  1. [root@kvm-test ~]# smartctl --scan-open
  2. /dev/sda -d scsi # /dev/sda, SCSI device
  3. /dev/bus/6 -d megaraid,0 # /dev/bus/6 [megaraid_disk_00], SCSI device
  4. /dev/bus/6 -d megaraid,1 # /dev/bus/6 [megaraid_disk_01], SCSI device
  5. /dev/bus/6 -d megaraid,2 # /dev/bus/6 [megaraid_disk_02], SCSI device
  6. /dev/bus/6 -d megaraid,3 # /dev/bus/6 [megaraid_disk_03], SCSI device
  7. /dev/bus/6 -d megaraid,4 # /dev/bus/6 [megaraid_disk_04], SCSI device
  8. /dev/bus/6 -d megaraid,5 # /dev/bus/6 [megaraid_disk_05], SCSI device
  9. /dev/bus/6 -d megaraid,6 # /dev/bus/6 [megaraid_disk_06], SCSI device
  10. /dev/bus/6 -d megaraid,7 # /dev/bus/6 [megaraid_disk_07], SCSI device

其中:“megaraid”应该是阵列卡芯片名称,megaraid,0中的0代表的是于“megaraid”中的物理盘编号。

取得设备的“smart”信息:

  1. [root@kvm-test ~]# smartctl --all /dev/sda
  2. smartctl 6.5 2016-05-07 r4318 [x86_64-linux-2.6.32-573.el6.x86_64] (local build)
  3. Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
  4. === START OF INFORMATION SECTION ===
  5. Vendor: DELL
  6. Product: PERC H710
  7. Revision: 3.13
  8. User Capacity: 2,096,078,258,176 bytes [2.09 TB]
  9. Logical block size: 512 bytes
  10. Logical Unit id: 0x6c81f660e79791001ec4a6da05d41165
  11. Serial number: 006511d405daa6c41e009197e760f681
  12. Device type: disk
  13. Local Time is: Wed Apr 19 13:50:19 2017 CST
  14. SMART support is: Unavailable - device lacks SMART capability.
  15. === START OF READ SMART DATA SECTION ===
  16. Current Drive Temperature: 0 C
  17. Drive Trip Temperature: 0 C
  18. Error Counter logging not supported
  19. Device does not support Self Test logging

能够看到,阵列卡本身并不支持“SMART”功能。再具体看下单块硬盘的情况:

  1. [root@kvm-test ~]# smartctl -i -d megaraid,0 /dev/sda
  2. smartctl 6.5 2016-05-07 r4318 [x86_64-linux-2.6.32-573.el6.x86_64] (local build)
  3. Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
  4. === START OF INFORMATION SECTION ===
  5. Vendor: SEAGATE
  6. Product: ST9300653SS
  7. Revision: YS0A
  8. Compliance: SPC-4
  9. User Capacity: 300,000,000,000 bytes [300 GB]
  10. Logical block size: 512 bytes
  11. Rotation Rate: 15000 rpm
  12. Form Factor: 2.5 inches
  13. Logical Unit id: 0x5000c500716ff47f
  14. Serial number: 6XN5SGF1
  15. Device type: disk
  16. Transport protocol: SAS (SPL-3)
  17. Local Time is: Wed Apr 19 14:49:20 2017 CST
  18. SMART support is: Available - device has SMART capability.
  19. SMART support is: Enabled
  20. Temperature Warning: Disabled or Not Supported

能够看到,硬盘设备支持“SMART”功能。

  1. [root@oadb ~]# smartctl -i /dev/sda
  2. smartctl 6.4 2015-06-04 r4109 [x86_64-linux-2.6.18-308.el5] (local build)
  3. Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org
  4. Smartctl open device: /dev/sda failed: DELL or MegaRaid controller, please try adding '-d megaraid,N'
  1. # View disk information:
  2. [root@oadb ~]# smartctl -i -d megaraid,0 /dev/sda
  3. smartctl 6.4 2015-06-04 r4109 [x86_64-linux-2.6.18-308.el5] (local build)
  4. Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org
  5. === START OF INFORMATION SECTION ===
  6. Vendor: SEAGATE
  7. Product: ST9300653SS
  8. Revision: YS09
  9. Compliance: SPC-4
  10. User Capacity: 300,000,000,000 bytes [300 GB]
  11. Logical block size: 512 bytes
  12. Rotation Rate: 15000 rpm
  13. Form Factor: 2.5 inches
  14. Logical Unit id: 0x5000c5005fb0745b
  15. Serial number: 6XN33ER0
  16. Device type: disk
  17. Transport protocol: SAS (SPL-3)
  18. Local Time is: Thu Oct 22 19:02:55 2015 CST
  19. SMART support is: Available - device has SMART capability.
  20. SMART support is: Enabled
  21. Temperature Warning: Disabled or Not Supported

查看设备的健康状态:

  1. # Check the hard disk health status:
  2. [root@oadb ~]# smartctl -H /dev/sda -d megaraid,0
  3. smartctl 6.4 2015-06-04 r4109 [x86_64-linux-2.6.18-308.el5] (local build)
  4. Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org
  5. === START OF READ SMART DATA SECTION ===
  6. SMART Health Status: OK

使用SMART只能报告硬盘已经不再健康,可报警后还能继续运行多久无法确定,通常,SMART报警参数是有预留的,硬盘出现报错信息后,并不会当场就损坏,一般能坚持一段时间,有的硬盘报错后还继续跑了好几年,有的硬盘报错后几天就无法使用了,千万不能存侥幸心理。

  1. # Check the hard disk error log:
  2. [root@oadb ~]# smartctl -l error -d megaraid,0 /dev/sda
  3. smartctl 6.4 2015-06-04 r4109 [x86_64-linux-2.6.18-308.el5] (local build)
  4. Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org
  5. === START OF READ SMART DATA SECTION ===
  6. Error counter log:
  7. Errors Corrected by Total Correction Gigabytes Total
  8. ECC rereads/ errors algorithm processed uncorrected
  9. fast | delayed rewrites corrected invocations [10^9 bytes] errors
  10. read: 931486797 0 0 931486797 0 13847.275 0
  11. write: 0 0 0 0 0 5962.098 0
  12. verify: 3042089148 0 0 3042089148 0 18667.773 0
  13. Non-medium error count: 2
  1. [root@oadb ~]# smartctl -A -d megaraid,0 /dev/sda
  2. smartctl 6.4 2015-06-04 r4109 [x86_64-linux-2.6.18-308.el5] (local build)
  3. Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org
  4. === START OF READ SMART DATA SECTION ===
  5. Current Drive Temperature: 31 C
  6. Drive Trip Temperature: 68 C
  7. Manufactured in week 10 of year 2013
  8. Specified cycle count over device lifetime: 10000
  9. Accumulated start-stop cycles: 26
  10. Elements in grown defect list: 0
  11. Vendor (Seagate) cache information
  12. Blocks sent to initiator = 3753898349
  13. Blocks received from initiator = 3003303195
  14. Blocks read from cache and sent to initiator = 2259987377
  15. Number of read and write commands whose size <= segment size = 40109835
  16. Number of read and write commands whose size > segment size = 190
  17. Vendor (Seagate/Hitachi) factory information
  18. number of hours powered up = 16559.07
  19. number of minutes until next internal SMART test = 19
  1. [root@PBSSES01 ~]# smartctl -a /dev/bus/0 -d megaraid,3
  2. smartctl 6.5 2016-05-07 r4318 [x86_64-linux-2.6.32-573.el6.x86_64] (local build)
  3. Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
  4. === START OF INFORMATION SECTION ===
  5. Vendor: SEAGATE
  6. Product: ST600MP0005
  7. Revision: VT31
  8. Compliance: SPC-4
  9. User Capacity: 600,127,266,816 bytes [600 GB]
  10. Logical block size: 512 bytes
  11. Formatted with type 2 protection
  12. LU is fully provisioned
  13. Rotation Rate: 15000 rpm
  14. Form Factor: 2.5 inches
  15. Logical Unit id: 0x5000c5009599aff7
  16. Serial number: S7M0WHMN
  17. Device type: disk
  18. Transport protocol: SAS (SPL-3)
  19. Local Time is: Thu Apr 20 14:24:01 2017 CST
  20. SMART support is: Available - device has SMART capability.
  21. SMART support is: Enabled
  22. Temperature Warning: Disabled or Not Supported
  23. === START OF READ SMART DATA SECTION ===
  24. SMART Health Status: OK
  25. Current Drive Temperature: 34 C
  26. Drive Trip Temperature: 60 C
  27. Manufactured in week 48 of year 2015
  28. Specified cycle count over device lifetime: 10000
  29. Accumulated start-stop cycles: 23
  30. Specified load-unload count over device lifetime: 300000
  31. Accumulated load-unload cycles: 489
  32. Elements in grown defect list: 0 # 传说中这里是标识出是否有坏道的地方,俗称成长坏道。
  33. Vendor (Seagate) cache information
  34. Blocks sent to initiator = 290310405
  35. Blocks received from initiator = 583431357
  36. Blocks read from cache and sent to initiator = 1944507
  37. Number of read and write commands whose size <= segment size = 6490035
  38. Number of read and write commands whose size > segment size = 51
  39. Vendor (Seagate/Hitachi) factory information
  40. number of hours powered up = 10990.77
  41. number of minutes until next internal SMART test = 7
  42. Error counter log:
  43. Errors Corrected by Total Correction Gigabytes Total
  44. ECC rereads/ errors algorithm processed uncorrected
  45. fast | delayed rewrites corrected invocations [10^9 bytes] errors
  46. read: 2472572503 0 0 2472572503 0 728.491 0
  47. write: 0 0 0 0 0 315.816 0
  48. verify: 597797863 0 0 597797863 0 39623.364 0
  49. Non-medium error count: 2 # 非介质错误。意思是说不是盘的问题,一般是电缆、传输、校验问题,可忽略。
  50. SMART Self-test log
  51. Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
  52. Description number (hours)
  53. # 1 Background short Completed 80 3 - [- - -]
  54. # 2 Reserved(7) Completed 64 3 - [- - -]
  55. Long (extended) Self Test duration: 3360 seconds [56.0 minutes]

通过修改以上的 smartmontools 的设置文件定期对硬盘进行健康检测,如同给人定期体检一样,体检通过了并不代表就没病(很多疾病用体检的设备都查不到),所以这也符合“Google”的硬盘报告所说的情况,所有损坏的情况中只有60%可以被SMART检测到,所以不能完全依赖其检测结果。

Error log

  1. [root@PBSSES01 ~]# smartctl -l selftest /dev/bus/0 -d megaraid,3
  2. smartctl 6.5 2016-05-07 r4318 [x86_64-linux-2.6.32-573.el6.x86_64] (local build)
  3. Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
  4. === START OF READ SMART DATA SECTION ===
  5. SMART Self-test log
  6. Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
  7. Description number (hours)
  8. # 1 Background short Completed 80 3 - [- - -]
  9. # 2 Reserved(7) Completed 64 3 - [- - -]
  10. Long (extended) Self Test duration: 3360 seconds [56.0 minutes]

2. smartd

上面的smartctl看起来是个非常不错的工具,可每次都通过手动运行着实麻烦,如果能够以指定的间隔运行,同时又能通知运维人员其检测结果,那不是更好吗?其实该功能已经有了:通过smartd后台服务发就能够实施该功能。

  1. [root@oadb ~]# vim /etc/smartd.conf
  2. # Sample configuration file for smartd. See man smartd.conf.
  3. DEVICESCAN -H -m root
添加新批注
在作者公开此批注前,只有你和作者可见。
回复批注