Zabbix Server Health
Video Lecture
Description
Now that we have a few hosts configured in different template configurations and on different networks, we can experiment with Zabbix server health.
Values processed per second
Value processed per second (VPS) indicates how busy your Zabbix server is. This number may be high or low and is used as a guide to help you know when other issues may start to occur. If the number is higher than usual, and you are having no problems indicated in any of the other graphs then you can consider that OK. You can manage this value by enabling/disabling items, triggers and discovery rules for your hosts.
Utilization of data collectors
Depending on the types of items you have set up for your hosts, different pollers (data collectors) will be used to perform the task of requesting or receiving the item data.
Passive checks are managed by the poller
data collector, ping checks by the ICMP
data collector, web-scenarios by the http
data collector, the trapper
data collector handles incoming checks from active hosts and there are many other collectors handling different protocols.
When you make changes to a host, you can review this graph to see what impact it had.
Utilization of internal processes
Zabbix runs many internally scheduled tasks to do with housekeeping the SQL database, managing LLD, alerting, preprocessing, writing logs and more. Also monitor this graph to understand the impact of any changes you make.
Cache usage
The value cache is used to speed up calculations of trigger expressions, calculated items, dependent items and other things within Zabbix where it is more optimal to pull historical data straight from memory rather than re querying the database tables every time a value is needed.
The graph summarizes several caches used within Zabbix.
If any of the cache usages go above 80% then consider adjusting the Zabbix servers CacheSize
setting.
The CacheSize
setting is in the zabbix_server.conf
file. The default is 8M
. You can change this from 128K
to 64GB
. You will need to adjust this as you manage more hosts, especially if they have many triggers, calculated items, dependent items and other host related statistics and properties stored in the cache.
Value cache effectiveness
The two important values shown in this graph are related to hits
and misses
. A Hit
is when a value was retrieved from memory. A miss
happens when the data is not currently in memory, but needs to be retrieved from the database first. Aim to have as few misses as possible by increasing the CacheSize
setting if necessary, or by reducing the amount of items and triggers you are processing for a host.
Queue size
Checks are placed into a queue and the request/response is handled as soon as possible. Some requests on hosts don't resolve quickly due to many reasons, such as the host may be switched off, or may be experiencing other resource issues such as high CPU, low memory, low network bandwidth or just in the process of restarting. And so then there may be a backlog of unanswered requests waiting to be resolved.
In the course we can see that one of the hosts has many unresolved requests in the queue. This can be caused by changing templates often or other adjustments to configurations that you may make to a host. In this example, my issue is caused by many checks not being resolved due to my hosts being switched off at times.
To see a list of items in the queue, and which host they relate to, visit the page Administration ⇾ Queue ⇾ Queue details.
Summary
When adding hosts or making other changes to Zabbix then recheck the Zabbix Health dashboard regularly to get a good feel of what your change has done. Also note that the supplied templates will have many items, triggers, discovery rules and more enabled by default that you don't actually need. Disable everything that isn't critical for your use case to save resources when Zabbix health starts to indicate problems.