Zabbix Server Health
Now that we have a few hosts configured in different template configurations and on different networks, we can experiment with Zabbix server health.
Values processed per second
Value processed per second (VPS) indicates how busy your Zabbix server is. This number may be high or low and is used as a guide to help you know when other issues may start to occur. If the number is higher than usual, and you are having no problems indicated in any of the other graphs then you can consider that ok. You can manage this value by enabling/disabling items, triggers and discovery rules for your hosts.
Utilization of data collectors
Depending on the types of items you have set up for your hosts, different pollers (data collectors) will be used to perform the task of requesting or receiving the item data.
Passive checks are managed by the
poller data collector, ping checks by the
ICMP data collector, web-scenarios by the
http data collector, the
trapper data collector handles incoming checks from active hosts and there are many other collectors handling different protocols.
When you make changes to a host, you can review this graph to see what impact it had.
Utilization of internal processes
Zabbix runs many internally scheduled tasks to do with housekeeping the SQL database, managing LLD, alerting, preprocessing, writing logs and more. Also monitor this graph to understand the impact of any changes you make.
The value cache is used to speed up calculations of trigger expressions, calculated items, dependent items and other things within Zabbix where it is more optimal to pull historical data straight from memory rather than re querying the database tables every time a value is needed.
The graph summarizes several caches used within Zabbix.
If any of the cache usages go above 80% then consider adjusting the Zabbix servers
CacheSize setting is in the
zabbix_server.conf file. The default is
8M. You can change this from
64GB. You will need to adjust this as you manage more hosts, especially if they have many triggers, calculated items, dependent items and other host related statistics and properties stored in the cache.
Value cache effectiveness
The two important values shown in this graph are related to
Hit is when a value was retrieved from memory. A
miss happens when the data is not currently in memory, but needs to be retrieved from the database first. Aim to have as few misses as possible by increasing the
CacheSize setting if necessary, or by reducing the amount of items and triggers you are processing for a host.
Passive checks are placed into a queue and the request/response is handled as soon as possible. Some requests on hosts don't resolve quickly due to many reasons, such as it may be a complicated query to answer or the host may be experiencing other resource issues such as high CPU, low memory, low network bandwidth or even be switched off or just in the process of restarting. And so then there may be a backlog of unanswered requests waiting to be resolved.
Ideally you want the values in this graph to be as low as possible.
In the course we can see that one of the hosts has many unresolved requests in the queue. This can be caused by changing templates often or other adjustments to configurations that you may make to a host. In this example, my issue is caused by many passive checks not being quickly handled by my windows host behind the Proxy. I can fix this queue issue by reconfiguring it to use the active version of the Windows template. This particular fix won't always be the same for you. I could have also reduced the number of items it was trying to retrieve in order to make the template less demanding on my host.
To see a list of items in the queue, and which host they relate to, visit the page Administration ⇾ Queue ⇾ Queue details.
When adding hosts or making other changes to Zabbix then recheck the Zabbix Health dashboard regularly to get a good feel of what your change has done. Also note that the supplied templates will have many items, triggers, discovery rules and more enabled by default that you don't actually need. Disable everything that isn't critical for your use case to save resources when Zabbix health starts to indicate problems.