This is really a difficult problem.
Originally, I built 5 nodes, and cluster them with public IP address. After that, I converted the cluster with private IP address while I connected all the nodes with Wireguard. This is no issue so far. Since I have 1 node in US but the hardware performance is not good, so I removed the node, and arrange a new node in CA (OVH Bare Metal).
I do the same to install proxmox, wireguard and all the same hardening. Then I connected the new node into my cluster with Wireguard connected. It crashed my cluster. I cannot connect any of the nodes.
I did various testing. I was thinking if this is
- PVE Firewall Issue?
- OVH Firewall Isssue?
- Different PVE versioning? Existing node all are v7 and new node is v8.
- Wireguard / UDP issue?
After couple of testing, I made the new node connected with one existing node with public IP. Then I switched them to private IP. It has retransmit issue but it auto resolved around 15mins later.
So I removed the temp cluster, and do the same to connect all my nodes with public IP on next day. This is same result as last night what I did on temp cluster, however, this is still not working after I switched to private IP. It got retransmit issue again. even I waited around two to three hours to let them auto resolve.
Then I bring up my nested pve while I created before for testing, and configured wireguard to connect to nested pve nodes, then do the same testing. Failed.
I then created a new guest os with nested pve on new node, then configured Wireguard on it. I want to avoid crashing my pve system so I am going to test everything with nested pve environment.
This is strange, everything looks fine in nested pve nodes. The setting on nested pve exactly same as my production nodes. The only thing is production are having five nodes and nested having four nodes.
Then I disconnected one node in production, and connected again with new node. This is better now, but this is still having retransmit issue.
I focus “[TOTEM ] Retransmit List” issue, and made some changed on corosync.conf. I end up added two lines under TOTEM.
window_size: 300
netmtu: 1317
Window_size set as 300, this is recommended from some forum. I set Network MTU as 1317, this is because then I do the troubleshooting to check corosync status by systemclt status corosync, it shows changed MTU to 1317 on one of my existing node.
It now looks fine with my 4 nodes with these setting. I think I need more time to monitor it.
Refer
http://blog.itpub.net/133735/viewspace-767620/
https://linux.die.net/man/5/corosync.conf