Let’s assume you might not following the Ceph hype for whatever reasons or experiences. And also you might have found a working solution with Sheepdog for clustered block storage. And you’re probably using Proxmox, where Sheepdog is very well integrated (uses existing Corosync cluster). Well done ;)
Proxmox sometimes ships Corosync updates, which means the Corosync cluster service will be restarted. As Sheepdog relies on it, it might get confused and dazed.
Here’s how to keep consistent PVE + Sheepdog cluster all the time while upgrading/rebooting/…
- on another node, keep an eye on the sheepdog log:
tail -f /var/lib/sheepdog/sheep.log
- also on another node, keep an eye on the automatic cluster recovery process:
watch "dog node recovery"
- stop the sheepdog daemon:
systemctl stop sheepdog.service
(Now you should see something happen in the second terminal) - wait until the cluster recovery has finished
- perform your maintenance tasks
- start your Sheepdog again:
systemctl stop sheepdog.service
(Again, you should see something happen in the second terminal) - wait until the cluster recovery has finished
Done! Be happy! Now move on to integrate this into the orchestration tool of your choice (e.g. Ansible) to make sure you never break stuff by accident.