Linux 常用監控指標總結

發布時間： 2020-03-01 15:47:34

　　Linux 常用監控指標總結

　　1. Linux運維基礎采集項
　　做運維，不怕出問題，怕的是出了問題，抓不到現場，兩眼摸黑。所以，依靠強大的監控系統，收集盡可能多的指標，意義重大。但哪些指標才是有意義的呢，本著從實踐中來的思想，各位工程師在長期摸爬滾打中總結出來的經驗最有價值。
　　在各位運維工程師長期的工作實踐中，我們總結了在系統運維過程中，經常會參考的一些指標，主要包括以下幾個類別：
　　· CPU
　　· Load
　　· 內存
　　· 磁盤
　　· IO
　　· 網絡相關
　　· 內核參數
　　· ss 統計輸出
　　· 端口采集
　　· 核心服務的進程存活信息采集
　　· 關鍵業務進程資源消耗
　　· NTP offset采集
　　· DNS解析采集
　　每個類別，具體的詳細指標如下，這些指標，都是open-falcon的agent組件直接支持的。falcon-agent每隔一定時間間隔（目前是60秒）會采集一次相關的指標，并匯報給server端。
　　2. CPU相關采集項
　　計算方法：通過采集/proc/stat來得到，大家可以參考sar命令的統計輸出來理解。
　　· cpu.idle：Percentage of time that the CPU or CPUs were idle and the system did not have an outstanding disk I/O request.
　　· cpu.busy：與cpu.idle相對，他的值等于100減去cpu.idle。
　　· cpu.guest：Percentage of time spent by the CPU or CPUs to run a virtual processor.
　　· cpu.iowait：Percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request.
　　· cpu.irq：Percentage of time spent by the CPU or CPUs to service hardware interrupts.
　　· cpu.softirq：Percentage of time spent by the CPU or CPUs to service software interrupts.
　　· cpu.nice：Percentage of CPU utilization that occurred while executing at the user level with nice priority.
　　· cpu.steal：Percentage of time spent in involuntary wait by the virtual CPU or CPUs while the hypervisor was servicing another virtual processor.
　　· cpu.system：Percentage of CPU utilization that occurred while executing at the system level (kernel).
　　· cpu.user：Percentage of CPU utilization that occurred while executing at the user level (application).
　　· cpu.cnt：cpu核數。
　　· cpu.switches：cpu上下文切換次數，計數器類型。
　　3. 磁盤相關采集項
　　計算方法：先讀取/proc/mounts拿到所有掛載點，然后通過syscall.Statfs_t拿到blocks和inode的使用情況。每個metric都會附加一組tag描述，類似mount=$mount,fstype=$fstype，其中$mount是掛載點，比如/home，$fstype是文件系統，比如ext4。
　　· df.bytes.free：磁盤可用量，int64
　　· df.bytes.free.percent：磁盤可用量占總量的百分比，float64，比如32.1
　　· df.bytes.total：磁盤總大小，int64
　　· df.bytes.used：磁盤已用大小，int64
　　· df.bytes.used.percent：磁盤已用大小占總量的百分比，float64
　　· df.inodes.total：inode總數，int64
　　· df.inodes.free：可用inode數目，int64
　　· df.inodes.free.percent：可用inode占比，float64
　　· df.inodes.used：已用的inode數據，int64
　　· df.inodes.used.percent：已用inode占比，float64
　　4. megacli工具輸出
　　使用 megacli 工具讀取 RAID 相關信息，每個metric都會附件一組tag描述，用來標明所屬PD或者 VD，PD格式為PD=Enclosure_ID:SLOT_ID，比如PD=32:0表明第一塊磁盤，VD=0 表明第一個邏輯磁盤。
　　· sys.disk.lsiraid.pd.Media_Error_Count：這個及以下三個指標目前僅作為數據收集，不一定意味磁盤損壞（只是表示損壞概率變大）
　　· sys.disk.lsiraid.pd.Other_Error_Count
　　· sys.disk.lsiraid.pd.Predictive_Failure_Count
　　· sys.disk.lsiraid.pd.Drive_Temperature
　　· sys.disk.lsiraid.pd.Firmware_state：如果值不為0，則此物理磁盤出現問題
　　· sys.disk.lsiraid.vd.cache_policy：如果值不為0，表示此邏輯磁盤緩存策略和設置不符
　　· sys.disk.lsiraid.vd.state：如果值不為0，表示此邏輯磁盤出現問題
　　5. SMART工具輸出
　　使用 smartctl 工具讀取磁盤 SMART 信息，目前所有指標僅作為數據收集，不一定意味磁盤損壞（只是表示概率變大），每個metric都會有一組tag描述，表明盤符，例如device=/dev/sda。
　　· sys.disk.smart.Reallocated_Sector_Ct
　　· sys.disk.smart.Spin_Retry_Count
　　· sys.disk.smart.Reallocated_Event_Count
　　· sys.disk.smart.Current_Pending_Sector
　　· sys.disk.smart.Offline_Uncorrectable
　　· sys.disk.smart.Temperature_Celsius
　　6. 分區讀寫監控
　　測試所有已掛載分區是否可讀寫，每個metric都會有一組tag描述，表示掛載點，比如mount=/home
　　· sys.disk.rw：如果值不為0，表明此分區讀寫出現問題
　　7. IO相關采集項
　　計算方法：每秒采集一次/proc/diskstats，計算差值，都是計數器類型的。每個metric都會有一組tag描述，形如device=$device，用來表示具體的設備，比如sda1、sdb。用戶可以參考iostat的幫助文檔來理解具體的metric含義。
　　· disk.io.ios_in_progress：Number of actual I/O requests currently in flight.
　　· disk.io.msec_read：Total number of ms spent by all reads.
　　· disk.io.msec_total：Amount of time during which ios_in_progress >= 1.
　　· disk.io.msec_weighted_total：Measure of recent I/O completion time and backlog.
　　· disk.io.msec_write：Total number of ms spent by all writes.
　　· disk.io.read_merged：Adjacent read requests merged in a single req.
　　· disk.io.read_requests：Total number of reads completed successfully.
　　· disk.io.read_sectors：Total number of sectors read successfully.
　　· disk.io.write_merged：Adjacent write requests merged in a single req.
　　· disk.io.write_requests：total number of writes completed successfully.
　　· disk.io.write_sectors：total number of sectors written successfully.
　　· disk.io.read_bytes：單位是byte的數字
　　· disk.io.write_bytes：單位是byte的數字
　　· disk.io.avgrq_sz：下面幾個值就是iostat -x 1看到的值
　　· disk.io.avgqu-sz
　　· disk.io.await
　　· disk.io.svctm
　　· disk.io.util：是個百分數，比如56.43，表示56.43%
　　8. 機器負載相關采集項
　　計算方法：讀取/proc/loadavg，都是原始值類型的：
　　· load.1min
　　· load.5min
　　· load.15min
　　9. 內存相關采集項
　　計算方法：讀取/proc/meminfo 中的內容，其中的mem.memfree是free+buffers+cached，mem.memused=mem.memtotal-mem.memfree。用戶具體可以參考free命令的輸出和幫助文檔來理解每個metric的含義。
　　· mem.memtotal：內存總大小
　　· mem.memused：使用了多少內存
　　· mem.memused.percent：使用的內存占比
　　· mem.memfree
　　· mem.memfree.percent
　　· mem.swaptotal：swap總大小
　　· mem.swapused：使用了多少swap
　　· mem.swapused.percent：使用的swap的占比
　　· mem.swapfree
　　· mem.swapfree.percent
　　10. 網絡相關采集項
　　計算方法：讀取/proc/net/dev的內容，每個metric都附加有一組tag，形如iface=$iface，標明具體那個interface，比如eth0。metric中帶有in的表示流入情況，out表示流出情況，total是總量in+out，支持的metric如下：
　　· net.if.in.bytes
　　· net.if.in.compressed
　　· net.if.in.dropped
　　· net.if.in.errors
　　· net.if.in.fifo.errs
　　· net.if.in.frame.errs
　　· net.if.in.multicast
　　· net.if.in.packets
　　· net.if.out.bytes
　　· net.if.out.carrier.errs
　　· net.if.out.collisions
　　· net.if.out.compressed
　　· net.if.out.dropped
　　· net.if.out.errors
　　· net.if.out.fifo.errs
　　· net.if.out.packets
　　· net.if.total.bytes
　　· net.if.total.dropped
　　· net.if.total.errors
　　· net.if.total.packets
　　11. 端口采集項
　　計算方法，通過ss -ln，來判斷指定的端口是否處于listen狀態。原始值類型，值要么是1：代表在監聽，要么是0，代表沒有在監聽。每個metric都附件一組tag，形如port=$port，$port就是具體的端口。
　　· net.port.listen
　　12. 機器內核配置
　　· kernel.maxfiles：讀取的/proc/sys/fs/file-max
　　· kernel.files.allocated：讀取的/proc/sys/fs/file-nr第一個Field
　　· kernel.files.left：值=kernel.maxfiles-kernel.files.allocated
　　· kernel.maxproc：讀取的/proc/sys/kernel/pid_max
　　13. ntp采集項
　　使用 ntpq -pn 獲取本機時間相對于 ntp 服務器的 offset。
　　· sys.ntp.offset：本機偏移時間，單位為ms，值過大或者為0則表明有異常，需要報警
　　14. 進程監控
　　· proc.num：判斷某個進程的數目，這里需要分兩個場景，一種是根據進程的名字來判定，比如name=sshd；另外一種是根據cmdline來判定，比如Java的應用進程名可能都是java，根據第一種情況沒法做區分，此時可以配置cmdline，如cmdline=./falcon_agent-c./cfg.ini
　　15. 進程資源監控
　　· process.cpu.all：進程和它的子進程使用的sys+user的cpu，單位是jiffies
　　· process.cpu.sys：進程和它的子進程使用的sys cpu，單位是jiffies
　　· process.cpu.user：進程和它的子進程使用的user cpu，單位是jiffies
　　· process.swap：進程和它的子進程使用的swap，單位是page
　　· process.fd：進程使用的文件描述符個數
　　· process.mem：進程占用內存，單位byte
　　16. ss命令輸出
　　· ss.orphaned
　　· ss.closed
　　· ss.timewait
　　· ss.slabinfo.timewait
　　· ss.synrecv
　　· ss.estab

您可能也喜歡：