Integrated & automatic Agent Failover from one site to another (backup-) site

6 votes

Hello,
in a distributed setting it is currently the case that if a site fails, all the agents attached to it simply disappear in the master gui. There is no built-in way that in this case the agents are automatically migrated to another site. You can write your own script, which does this, but such scripts sometimes run for hours, if you have e.g. 1000 hosts on one site. This is due to the fact that each agent must be discovered again. I would like to see an integrated agent failover solution, where an agent is automatically moved to another site (ideally without the need for rediscovery). The backup site should be definable, possibly via a dropdown field in the host config. If the failed site comes back online at a later time, it should be possible to switch back manually or automatically (selectable).

As an enterprise customer, I would expect such a feature in the enterprise version of checkmk, as I know it from other monitoring tools (e.g. IBM Tivoli Monitoring).

Under consideration Site management Suggested by: Christian Friedrich (28 Apr, '23) • Upvoted: 08 May, '23 • Comments: 3

Comments: 3

01 May, '23
Thomas Lippert Admin
Christian, what should happen with the data (state, RRD, ACK/Downtimes), which is stored on the primary Checkmk site before the failover. As the Checkmk site is down, it cannot be accessed. And what should happen, once the original site reappears.
02 May, '23
Christian Friedrich
Hi Thomas,
What should happen when the original site is back online, I already mentioned: "If the failed site comes back online at a later time, it should be possible to switch back manually or automatically (selectable). "
What should happen with RRD data, etc. is more or less exactly the question. But there are already other feature requests asking for a solution to the issue of agent data in the distributed setting. Maybe both things can be combined with one another. And how do you do that with your appliance, it seems to work there?Since hundreds of agents are often affected in the event of a site outage, spontaneously copying the agent data would not be a solution; it would probably take too long. But this could perhaps be discussed in an open session at the conference. Many greetings
15 May, '23
Thomas Lippert Admin
Hi Christian,

The behavior you describe is currently achieved by the appliance. If the site is failing a failover happens to the mirrored site. During normal operation, already all data is cloned to the failover site
I see some risks to do the failover on the agent side. E.g. in case of a network outage, it appears to the agent, that the site is gone, while the site is happily humming along. A failover in such case can create chaos. How should a site recognize any network issues, especially once you implement the agent in push mode.

FYI: The push agent stores all data packages for 10 minutes, so minor outages in the connection can be recovered without data loss

Thomas