Zabbix est un cms de supervision. Il permet de surveiller l’état de fonctionnement au travers de différentes stats de matériels, switch, routeur, serveur web, pc client, applications et autres. Dans ce billet je présente une procédure simple afin de superviser des RIGS de GPU AMD (rx570/580) via Zabbix.
Dans cet exemple on va pouvoir retrouver le taux d’utilisation d’une GPU, sa température. Il est possible de compléter l’ensemble avec d’autres stats, l’objectif du billet étant de décrire la procédure de mise en place via ces deux stats. Travaillant avec la distribution Ethos qui est une ubuntu modifié il est ainsi possible de surveiller l’utilisation de son RIG de façon simple et rapide sans utiliser l’interface classique de ethos qui est assez sobre.
Pour cela il suffit de déployer l’agent Zabbix de façon « classique », puis ajoutez en fin de fichier de configuration ces quelques lignes :
Les valeurs sont assez explicites, température, taux d’utilisation et le nombres de GPU présent sur le rig.
UserParameter=gpu.discovery,/etc/zabbix/scripts/get_gpus_info.sh
UserParameter=gpu.temp[*],export DISPLAY=:0 && xhost + > /dev/null && amdconfig --adapter=$1 --odgt | grep 'Temperature' | cut -d'-' -f2 | cut -d'.' -f1 | tr -d ' '
UserParameter=gpu.utilization[*],export DISPLAY=:0 && xhost + > /dev/null && amdconfig --adapter=$1 --odgc | grep 'GPU load' | cut -f1 -d'%' | cut -f2 -d':'| tr -d ' '
UserParameter=gpu.number,show stats | grep "gpus:" | tr -s ' ' | cut -d ' ' -f 2
Dans Zabbix importez le template suivant :
<zabbix_export>
<version>4.0</version>
<date>2018-11-08T09:07:06Z</date>
<groups>
<group>
<name>Templates</name>
</group>
</groups>
<templates>
<template>
<template>Template AMD GPUs Performance</template>
<name>Template AMD GPUs Performance</name>
<description>LINUX Template AMD GPUs Performance</description>
<groups>
<group>
<name>Templates</name>
</group>
</groups>
<applications>
<application>
<name>AMD</name>
</application>
</applications>
<items>
<item>
<name>Number of GPUs</name>
<type>0</type>
<snmp_community/>
<snmp_oid/>
<key>gpu.number</key>
<delay>30s</delay>
<history>90d</history>
<trends>365d</trends>
<status>0</status>
<value_type>0</value_type>
<allowed_hosts/>
<units/>
<snmpv3_contextname/>
<snmpv3_securityname/>
<snmpv3_securitylevel>0</snmpv3_securitylevel>
<snmpv3_authprotocol>0</snmpv3_authprotocol>
<snmpv3_authpassphrase/>
<snmpv3_privprotocol>0</snmpv3_privprotocol>
<snmpv3_privpassphrase/>
<params/>
<ipmi_sensor/>
<authtype>0</authtype>
<username/>
<password/>
<publickey/>
<privatekey/>
<port/>
<description>The number of GPUs present on this system.</description>
<inventory_link>0</inventory_link>
<applications>
<application>
<name>AMD</name>
</application>
</applications>
<valuemap/>
<logtimefmt/>
<preprocessing/>
<jmx_endpoint/>
<timeout>3s</timeout>
<url/>
<query_fields/>
<posts/>
<status_codes>200</status_codes>
<follow_redirects>1</follow_redirects>
<post_type>0</post_type>
<http_proxy/>
<headers/>
<retrieve_mode>0</retrieve_mode>
<request_method>0</request_method>
<output_format>0</output_format>
<allow_traps>0</allow_traps>
<ssl_cert_file/>
<ssl_key_file/>
<ssl_key_password/>
<verify_peer>0</verify_peer>
<verify_host>0</verify_host>
<master_item/>
</item>
<item>
<name>Temp GPUs</name>
<type>0</type>
<snmp_community/>
<snmp_oid/>
<key>gpu.temp</key>
<delay>10s</delay>
<history>90d</history>
<trends>365d</trends>
<status>1</status>
<value_type>0</value_type>
<allowed_hosts/>
<units/>
<snmpv3_contextname/>
<snmpv3_securityname/>
<snmpv3_securitylevel>0</snmpv3_securitylevel>
<snmpv3_authprotocol>0</snmpv3_authprotocol>
<snmpv3_authpassphrase/>
<snmpv3_privprotocol>0</snmpv3_privprotocol>
<snmpv3_privpassphrase/>
<params/>
<ipmi_sensor/>
<authtype>0</authtype>
<username/>
<password/>
<publickey/>
<privatekey/>
<port/>
<description/>
<inventory_link>0</inventory_link>
<applications>
<application>
<name>AMD</name>
</application>
</applications>
<valuemap/>
<logtimefmt/>
<preprocessing/>
<jmx_endpoint/>
<timeout>3s</timeout>
<url/>
<query_fields/>
<posts/>
<status_codes>200</status_codes>
<follow_redirects>1</follow_redirects>
<post_type>0</post_type>
<http_proxy/>
<headers/>
<retrieve_mode>0</retrieve_mode>
<request_method>0</request_method>
<output_format>0</output_format>
<allow_traps>0</allow_traps>
<ssl_cert_file/>
<ssl_key_file/>
<ssl_key_password/>
<verify_peer>0</verify_peer>
<verify_host>0</verify_host>
<master_item/>
</item>
<item>
<name>Utilisation GPUs</name>
<type>0</type>
<snmp_community/>
<snmp_oid/>
<key>gpu.utilisation</key>
<delay>10s</delay>
<history>90d</history>
<trends>365d</trends>
<status>1</status>
<value_type>0</value_type>
<allowed_hosts/>
<units/>
<snmpv3_contextname/>
<snmpv3_securityname/>
<snmpv3_securitylevel>0</snmpv3_securitylevel>
<snmpv3_authprotocol>0</snmpv3_authprotocol>
<snmpv3_authpassphrase/>
<snmpv3_privprotocol>0</snmpv3_privprotocol>
<snmpv3_privpassphrase/>
<params/>
<ipmi_sensor/>
<authtype>0</authtype>
<username/>
<password/>
<publickey/>
<privatekey/>
<port/>
<description/>
<inventory_link>0</inventory_link>
<applications>
<application>
<name>AMD</name>
</application>
</applications>
<valuemap/>
<logtimefmt/>
<preprocessing/>
<jmx_endpoint/>
<timeout>3s</timeout>
<url/>
<query_fields/>
<posts/>
<status_codes>200</status_codes>
<follow_redirects>1</follow_redirects>
<post_type>0</post_type>
<http_proxy/>
<headers/>
<retrieve_mode>0</retrieve_mode>
<request_method>0</request_method>
<output_format>0</output_format>
<allow_traps>0</allow_traps>
<ssl_cert_file/>
<ssl_key_file/>
<ssl_key_password/>
<verify_peer>0</verify_peer>
<verify_host>0</verify_host>
<master_item/>
</item>
</items>
<discovery_rules>
<discovery_rule>
<name>GPU discovery</name>
<type>0</type>
<snmp_community/>
<snmp_oid/>
<key>gpu.discovery</key>
<delay>600</delay>
<status>0</status>
<allowed_hosts/>
<snmpv3_contextname/>
<snmpv3_securityname/>
<snmpv3_securitylevel>0</snmpv3_securitylevel>
<snmpv3_authprotocol>0</snmpv3_authprotocol>
<snmpv3_authpassphrase/>
<snmpv3_privprotocol>0</snmpv3_privprotocol>
<snmpv3_privpassphrase/>
<params/>
<ipmi_sensor/>
<authtype>0</authtype>
<username/>
<password/>
<publickey/>
<privatekey/>
<port/>
<filter>
<evaltype>0</evaltype>
<formula/>
<conditions/>
</filter>
<lifetime>30d</lifetime>
<description>Discovery of graphics cards.</description>
<item_prototypes>
<item_prototype>
<name>GPU $1 Fan Speed</name>
<type>0</type>
<snmp_community/>
<snmp_oid/>
<key>gpu.fanspeed[{#GPUINDEX}]</key>
<delay>60</delay>
<history>7d</history>
<trends>365d</trends>
<status>0</status>
<value_type>3</value_type>
<allowed_hosts/>
<units>%</units>
<snmpv3_contextname/>
<snmpv3_securityname/>
<snmpv3_securitylevel>0</snmpv3_securitylevel>
<snmpv3_authprotocol>0</snmpv3_authprotocol>
<snmpv3_authpassphrase/>
<snmpv3_privprotocol>0</snmpv3_privprotocol>
<snmpv3_privpassphrase/>
<params/>
<ipmi_sensor/>
<authtype>0</authtype>
<username/>
<password/>
<publickey/>
<privatekey/>
<port/>
<description/>
<inventory_link>0</inventory_link>
<applications>
<application>
<name>AMD</name>
</application>
</applications>
<valuemap/>
<logtimefmt/>
<preprocessing/>
<jmx_endpoint/>
<timeout>3s</timeout>
<url/>
<query_fields/>
<posts/>
<status_codes>200</status_codes>
<follow_redirects>1</follow_redirects>
<post_type>0</post_type>
<http_proxy/>
<headers/>
<retrieve_mode>0</retrieve_mode>
<request_method>0</request_method>
<output_format>0</output_format>
<allow_traps>0</allow_traps>
<ssl_cert_file/>
<ssl_key_file/>
<ssl_key_password/>
<verify_peer>0</verify_peer>
<verify_host>0</verify_host>
<application_prototypes/>
<master_item/>
</item_prototype>
<item_prototype>
<name>GPU $1 Memory Free</name>
<type>0</type>
<snmp_community/>
<snmp_oid/>
<key>gpu.memfree[{#GPUINDEX}]</key>
<delay>60</delay>
<history>7d</history>
<trends>365d</trends>
<status>0</status>
<value_type>3</value_type>
<allowed_hosts/>
<units>MB</units>
<snmpv3_contextname/>
<snmpv3_securityname/>
<snmpv3_securitylevel>0</snmpv3_securitylevel>
<snmpv3_authprotocol>0</snmpv3_authprotocol>
<snmpv3_authpassphrase/>
<snmpv3_privprotocol>0</snmpv3_privprotocol>
<snmpv3_privpassphrase/>
<params/>
<ipmi_sensor/>
<authtype>0</authtype>
<username/>
<password/>
<publickey/>
<privatekey/>
<port/>
<description/>
<inventory_link>0</inventory_link>
<applications>
<application>
<name>AMD</name>
</application>
</applications>
<valuemap/>
<logtimefmt/>
<preprocessing/>
<jmx_endpoint/>
<timeout>3s</timeout>
<url/>
<query_fields/>
<posts/>
<status_codes>200</status_codes>
<follow_redirects>1</follow_redirects>
<post_type>0</post_type>
<http_proxy/>
<headers/>
<retrieve_mode>0</retrieve_mode>
<request_method>0</request_method>
<output_format>0</output_format>
<allow_traps>0</allow_traps>
<ssl_cert_file/>
<ssl_key_file/>
<ssl_key_password/>
<verify_peer>0</verify_peer>
<verify_host>0</verify_host>
<application_prototypes/>
<master_item/>
</item_prototype>
<item_prototype>
<name>GPU $1 Memory Total</name>
<type>0</type>
<snmp_community/>
<snmp_oid/>
<key>gpu.memtotal[{#GPUINDEX}]</key>
<delay>60</delay>
<history>7d</history>
<trends>365d</trends>
<status>0</status>
<value_type>3</value_type>
<allowed_hosts/>
<units>MB</units>
<snmpv3_contextname/>
<snmpv3_securityname/>
<snmpv3_securitylevel>0</snmpv3_securitylevel>
<snmpv3_authprotocol>0</snmpv3_authprotocol>
<snmpv3_authpassphrase/>
<snmpv3_privprotocol>0</snmpv3_privprotocol>
<snmpv3_privpassphrase/>
<params/>
<ipmi_sensor/>
<authtype>0</authtype>
<username/>
<password/>
<publickey/>
<privatekey/>
<port/>
<description/>
<inventory_link>0</inventory_link>
<applications>
<application>
<name>AMD</name>
</application>
</applications>
<valuemap/>
<logtimefmt/>
<preprocessing/>
<jmx_endpoint/>
<timeout>3s</timeout>
<url/>
<query_fields/>
<posts/>
<status_codes>200</status_codes>
<follow_redirects>1</follow_redirects>
<post_type>0</post_type>
<http_proxy/>
<headers/>
<retrieve_mode>0</retrieve_mode>
<request_method>0</request_method>
<output_format>0</output_format>
<allow_traps>0</allow_traps>
<ssl_cert_file/>
<ssl_key_file/>
<ssl_key_password/>
<verify_peer>0</verify_peer>
<verify_host>0</verify_host>
<application_prototypes/>
<master_item/>
</item_prototype>
<item_prototype>
<name>GPU $1 Memory Used</name>
<type>0</type>
<snmp_community/>
<snmp_oid/>
<key>gpu.memused[{#GPUINDEX}]</key>
<delay>60</delay>
<history>7d</history>
<trends>365d</trends>
<status>0</status>
<value_type>3</value_type>
<allowed_hosts/>
<units>MB</units>
<snmpv3_contextname/>
<snmpv3_securityname/>
<snmpv3_securitylevel>0</snmpv3_securitylevel>
<snmpv3_authprotocol>0</snmpv3_authprotocol>
<snmpv3_authpassphrase/>
<snmpv3_privprotocol>0</snmpv3_privprotocol>
<snmpv3_privpassphrase/>
<params/>
<ipmi_sensor/>
<authtype>0</authtype>
<username/>
<password/>
<publickey/>
<privatekey/>
<port/>
<description/>
<inventory_link>0</inventory_link>
<applications>
<application>
<name>AMD</name>
</application>
</applications>
<valuemap/>
<logtimefmt/>
<preprocessing/>
<jmx_endpoint/>
<timeout>3s</timeout>
<url/>
<query_fields/>
<posts/>
<status_codes>200</status_codes>
<follow_redirects>1</follow_redirects>
<post_type>0</post_type>
<http_proxy/>
<headers/>
<retrieve_mode>0</retrieve_mode>
<request_method>0</request_method>
<output_format>0</output_format>
<allow_traps>0</allow_traps>
<ssl_cert_file/>
<ssl_key_file/>
<ssl_key_password/>
<verify_peer>0</verify_peer>
<verify_host>0</verify_host>
<application_prototypes/>
<master_item/>
</item_prototype>
<item_prototype>
<name>GPU $1 Power in decaWatts</name>
<type>0</type>
<snmp_community/>
<snmp_oid/>
<key>gpu.power[{#GPUINDEX}]</key>
<delay>60</delay>
<history>7d</history>
<trends>365d</trends>
<status>0</status>
<value_type>3</value_type>
<allowed_hosts/>
<units>dW</units>
<snmpv3_contextname/>
<snmpv3_securityname/>
<snmpv3_securitylevel>0</snmpv3_securitylevel>
<snmpv3_authprotocol>0</snmpv3_authprotocol>
<snmpv3_authpassphrase/>
<snmpv3_privprotocol>0</snmpv3_privprotocol>
<snmpv3_privpassphrase/>
<params/>
<ipmi_sensor/>
<authtype>0</authtype>
<username/>
<password/>
<publickey/>
<privatekey/>
<port/>
<description/>
<inventory_link>0</inventory_link>
<applications>
<application>
<name>AMD</name>
</application>
</applications>
<valuemap/>
<logtimefmt/>
<preprocessing/>
<jmx_endpoint/>
<timeout>3s</timeout>
<url/>
<query_fields/>
<posts/>
<status_codes>200</status_codes>
<follow_redirects>1</follow_redirects>
<post_type>0</post_type>
<http_proxy/>
<headers/>
<retrieve_mode>0</retrieve_mode>
<request_method>0</request_method>
<output_format>0</output_format>
<allow_traps>0</allow_traps>
<ssl_cert_file/>
<ssl_key_file/>
<ssl_key_password/>
<verify_peer>0</verify_peer>
<verify_host>0</verify_host>
<application_prototypes/>
<master_item/>
</item_prototype>
<item_prototype>
<name>GPU $1 Temperature</name>
<type>0</type>
<snmp_community/>
<snmp_oid/>
<key>gpu.temp[{#GPUINDEX}]</key>
<delay>60</delay>
<history>7d</history>
<trends>365d</trends>
<status>0</status>
<value_type>0</value_type>
<allowed_hosts/>
<units>C</units>
<snmpv3_contextname/>
<snmpv3_securityname/>
<snmpv3_securitylevel>0</snmpv3_securitylevel>
<snmpv3_authprotocol>0</snmpv3_authprotocol>
<snmpv3_authpassphrase/>
<snmpv3_privprotocol>0</snmpv3_privprotocol>
<snmpv3_privpassphrase/>
<params/>
<ipmi_sensor/>
<authtype>0</authtype>
<username/>
<password/>
<publickey/>
<privatekey/>
<port/>
<description/>
<inventory_link>0</inventory_link>
<applications>
<application>
<name>AMD</name>
</application>
</applications>
<valuemap/>
<logtimefmt/>
<preprocessing/>
<jmx_endpoint/>
<timeout>3s</timeout>
<url/>
<query_fields/>
<posts/>
<status_codes>200</status_codes>
<follow_redirects>1</follow_redirects>
<post_type>0</post_type>
<http_proxy/>
<headers/>
<retrieve_mode>0</retrieve_mode>
<request_method>0</request_method>
<output_format>0</output_format>
<allow_traps>0</allow_traps>
<ssl_cert_file/>
<ssl_key_file/>
<ssl_key_password/>
<verify_peer>0</verify_peer>
<verify_host>0</verify_host>
<application_prototypes/>
<master_item/>
</item_prototype>
<item_prototype>
<name>GPU $1 Utilization</name>
<type>0</type>
<snmp_community/>
<snmp_oid/>
<key>gpu.utilization[{#GPUINDEX}]</key>
<delay>60</delay>
<history>90d</history>
<trends>365d</trends>
<status>0</status>
<value_type>3</value_type>
<allowed_hosts/>
<units>%</units>
<snmpv3_contextname/>
<snmpv3_securityname/>
<snmpv3_securitylevel>0</snmpv3_securitylevel>
<snmpv3_authprotocol>0</snmpv3_authprotocol>
<snmpv3_authpassphrase/>
<snmpv3_privprotocol>0</snmpv3_privprotocol>
<snmpv3_privpassphrase/>
<params/>
<ipmi_sensor/>
<authtype>0</authtype>
<username/>
<password/>
<publickey/>
<privatekey/>
<port/>
<description/>
<inventory_link>0</inventory_link>
<applications>
<application>
<name>AMD</name>
</application>
</applications>
<valuemap/>
<logtimefmt/>
<preprocessing/>
<jmx_endpoint/>
<timeout>3s</timeout>
<url/>
<query_fields/>
<posts/>
<status_codes>200</status_codes>
<follow_redirects>1</follow_redirects>
<post_type>0</post_type>
<http_proxy/>
<headers/>
<retrieve_mode>0</retrieve_mode>
<request_method>0</request_method>
<output_format>0</output_format>
<allow_traps>0</allow_traps>
<ssl_cert_file/>
<ssl_key_file/>
<ssl_key_password/>
<verify_peer>0</verify_peer>
<verify_host>0</verify_host>
<application_prototypes/>
<master_item/>
</item_prototype>
</item_prototypes>
<trigger_prototypes>
<trigger_prototype>
<expression>{Template AMD GPUs Performance:gpu.temp[{#GPUINDEX}].last()}>80</expression>
<recovery_mode>0</recovery_mode>
<recovery_expression/>
<name>GPU {#GPUINDEX} Temperature is extremely high</name>
<correlation_mode>0</correlation_mode>
<correlation_tag/>
<url/>
<status>0</status>
<priority>5</priority>
<description>A GPU's temperature is getting extremely high!</description>
<type>0</type>
<manual_close>0</manual_close>
<dependencies/>
<tags/>
</trigger_prototype>
<trigger_prototype>
<expression>{Template AMD GPUs Performance:gpu.temp[{#GPUINDEX}].last()}>70</expression>
<recovery_mode>0</recovery_mode>
<recovery_expression/>
<name>GPU {#GPUINDEX} Temperature is high</name>
<correlation_mode>0</correlation_mode>
<correlation_tag/>
<url/>
<status>0</status>
<priority>2</priority>
<description>A GPU's temperature is getting high!</description>
<type>0</type>
<manual_close>0</manual_close>
<dependencies/>
<tags/>
</trigger_prototype>
<trigger_prototype>
<expression>{Template AMD GPUs Performance:gpu.temp[{#GPUINDEX}].last()}>75</expression>
<recovery_mode>0</recovery_mode>
<recovery_expression/>
<name>GPU {#GPUINDEX} Temperature is very high</name>
<correlation_mode>0</correlation_mode>
<correlation_tag/>
<url/>
<status>0</status>
<priority>4</priority>
<description>A GPU's temperature is getting very high!</description>
<type>0</type>
<manual_close>0</manual_close>
<dependencies/>
<tags/>
</trigger_prototype>
</trigger_prototypes>
<graph_prototypes>
<graph_prototype>
<name>GPU {#GPUINDEX} Temperature</name>
<width>900</width>
<height>200</height>
<yaxismin>0.0000</yaxismin>
<yaxismax>100.0000</yaxismax>
<show_work_period>1</show_work_period>
<show_triggers>1</show_triggers>
<type>0</type>
<show_legend>1</show_legend>
<show_3d>0</show_3d>
<percent_left>0.0000</percent_left>
<percent_right>0.0000</percent_right>
<ymin_type_1>0</ymin_type_1>
<ymax_type_1>0</ymax_type_1>
<ymin_item_1>0</ymin_item_1>
<ymax_item_1>0</ymax_item_1>
<graph_items>
<graph_item>
<sortorder>0</sortorder>
<drawtype>0</drawtype>
<color>2774A4</color>
<yaxisside>0</yaxisside>
<calc_fnc>2</calc_fnc>
<type>0</type>
<item>
<host>Template AMD GPUs Performance</host>
<key>gpu.temp[{#GPUINDEX}]</key>
</item>
</graph_item>
</graph_items>
</graph_prototype>
<graph_prototype>
<name>GPU {#GPUINDEX} Utilization</name>
<width>900</width>
<height>200</height>
<yaxismin>0.0000</yaxismin>
<yaxismax>100.0000</yaxismax>
<show_work_period>1</show_work_period>
<show_triggers>1</show_triggers>
<type>0</type>
<show_legend>1</show_legend>
<show_3d>0</show_3d>
<percent_left>0.0000</percent_left>
<percent_right>0.0000</percent_right>
<ymin_type_1>0</ymin_type_1>
<ymax_type_1>0</ymax_type_1>
<ymin_item_1>0</ymin_item_1>
<ymax_item_1>0</ymax_item_1>
<graph_items>
<graph_item>
<sortorder>0</sortorder>
<drawtype>0</drawtype>
<color>199C0D</color>
<yaxisside>0</yaxisside>
<calc_fnc>2</calc_fnc>
<type>0</type>
<item>
<host>Template AMD GPUs Performance</host>
<key>gpu.utilization[{#GPUINDEX}]</key>
</item>
</graph_item>
</graph_items>
</graph_prototype>
</graph_prototypes>
<host_prototypes/>
<jmx_endpoint/>
<timeout>3s</timeout>
<url/>
<query_fields/>
<posts/>
<status_codes>200</status_codes>
<follow_redirects>1</follow_redirects>
<post_type>0</post_type>
<http_proxy/>
<headers/>
<retrieve_mode>0</retrieve_mode>
<request_method>0</request_method>
<allow_traps>0</allow_traps>
<ssl_cert_file/>
<ssl_key_file/>
<ssl_key_password/>
<verify_peer>0</verify_peer>
<verify_host>0</verify_host>
</discovery_rule>
</discovery_rules>
<httptests/>
<macros/>
<templates/>
<screens/>
</template>
</templates>
</zabbix_export>
Après quelques instants vous devriez voir la data remonter dans le frontend Zabbix.
L’ensemble des sources sont disponibles sur github : https://github.com/VeilleurTrytoFix/