What do you expect from a modern cluster? If a replica fails, it should be rescheduled. If a node goes down, all the services that were running inside should be distributed among healthy nodes. Schedulers (Swarm, Kubernetes, Mesos/Marathon) are providing self-healing solutions already, making sure that the system is (almost) always in the desired state.
The problem with self-healing is that it does not take into account constant changes. A cluster, and services inside need to adapt to changes continuously. Services need to be scaled and de-scaled. Nodes need to be created and added to the cluster but then must be able to be removed soon thereafter.
How about converting adaptation into self-adaptation? Can you remove humans from the process and make a system that is self-sufficient?
Watch this webinar to learn:
More about the following schedulers: Swarm, Kubernetes, Mesos/Marathon
Steps required to convert adaptation into self-adaptation
How to design a self-healing system that will continue to operate efficiently even when you are on vacation