Hey mates!
We have a two-node active/standby cluster running on a pair of 5100 appliances. One security gateway node is on R81.20 Take 53 and the other node is on R80.40 Take 173.
MVC is enabled on the R81.20 node. We enabled MVC to allow us to test the new R81.20 OS in our environment before upgrading the other node. We had only planned to be in MVC mode for a short while - upgrading the remaining node the next week.
This upgrade initially occurred at the beginning of April. Due to an emergency medical issue, I needed to take a month off and am just now getting back to work.
Shortly after the upgrade, we noticed the R81.20 node was rebooting on its own and afterward the cluster would roll over to the R80.40 node. After this happened a few times we decided to just leave the R80.40 node as the active node and I would take a look at the issue when I came back to the office.
After opening a ticket with TAC and having them examine my kernel crash files in /var/log/crash for the R81.20 node, they made this determination:
******************************************************************************
Primarily , the issue is happening because of the members being on different versions. The stack trace generated matched a previous task internally; TM-63720, which identifies the crash as a result of a change in table format.
The kernel table in question is : SEP_my_IKE_packet_gtid,
The table values were changed from {local ip, peer ip} to {local id, peer ip} between versions R80.40 and R81.20.
Issue occurs when trying to sync the table between R80.40 and R81.20,
we get owner member as IP and not as ID.
Recommendation;
Please make sure that both members are running with R81.20 and that will resolve the issue.
********************************************************************
So I get this and agree.
However, the whole point of MVC mode in this case was to be able to test the new OS, which I really can't do. The reboots are too frequent (sometimes daily). The R81.20 node reboots even while in standby mode.
What are our options if we want to test before upgrading the other node?
Can we disable MVC? Will this break the cluster and cause weirder things to happen?
Can we manually edit the above-referenced table causing the issue on the R80.40 node, thereby getting rid of the sync issue?
I suppose we can break the cluster and simply run on the new node, but this makes me nervous. I really would rather not do this.
Thanks, everyone! Interested in hearing your thoughts.
Joe