Top 10 hosts dashlet
Allow to define a dashlet, which filters about some host information and shows only the top 10 hits from the results. This allows me as an admin to focus on specific servers (Example: top 10 by CPU, disk usage, memory consumption)
Under consideration Suggested by: Thomas Lippert • Upvoted: 19 Feb • Comments: 9
Enhance it with AI! (we are in 2022)
Having the Top 10 by CPU or Memory is fine (better then nothing), but not always shows you the real critical ones.
For some Host its normal that they have high load at a certain time of the day/week (due to some regular job handling), one the simple approach this will leed to many false alarms in the dashboards/dashlets
This can be easily avoided by AI which is aware about it and only shows you the anomalies which are actually related to a problem
Christian Friedrich Merged
Sometimes we have problems with the load of a slave site. We always find it difficult finding the problem in this situation. It would be helpful to have more views within checkmk that show you e.g. which hosts or which services use especially many checkers/fetcher. Or generally the top 10 hosts that are responsible for the most load on the backends. This would definitely be helpful in narrowing down the causes of performance problems.
Marcel Arentz Admin
"Troubleshooting; Top Hosts / Services (Backend Performance)" (suggested by Christian Friedrich on 2022-07-13), including upvotes (2) and comments (0), was merged into this suggestion.
Thomas Lippert Admin
The challenge with this feature is, that in distributed monitoring, the master needs to get all top 10s from the different sites to create a consolidated Top 10 list. Currently, the core does not allow for this, making the implementation very expensive.
Question: If this feature only works for the actual site, would it still provide value?
Chiming in here: For us, the top 10 (or "top any", really) should show information from all sites.
Thomas Lippert Admin
Thanks Thierry for your opinion on this topic. I can fully understand the demand, but it makes this feature rather expensive :-(
We requested, discussed and planned it in 2019, supposed to be already part of the Capacity Mgmt "light" implementation.
"Last minute" plan of changes, "cant be impemented, not possible to do in the short of time left before the release" - now about X Releases later, still desperately waiting for it and not the only ones as it seems.
So can we please schedule this for 2.3?
It would certainly bring the Capacity Management on a proper level where you can finally work on it in a view/dashboard way with all functionality and export it in reports.
We believe with a proper integration you could not only make your customers happy, but also get a again some bigger lead towards your direct competitors
@Thomas: I can imagine certain bottlenecks, too. We currently run ~80 sites. Constantly fetching information from every site would yield a heavy impact on performance, I guess. Our operations engineers often ask for this functionality, though.
If this feature were to be considered, maybe some kind of compromise could be implemented? Like "Top 10 hosts/services (last 15 minutes)"? Like this, information of all sites would only have to be fetched once in 15 minutes. IMO the interval could be configurable for the user. If a certain lower limit gets undercut, you may show a warning message regarding probable performance implications?
Another way could be to asynchronously collect information on every site directly (again, in fixed intervals or dynamically per host & service as they get checked). Then the main site would only have to query a pre-computed table.
this should be mitigated with the usage of a caching functionality. Top 10 problems are not so time depending