Hi,
We have a R81.10 cluster that we were trying to upgrade to R81.20 but it failed as it caused a partial outage. After some investigation it appears to be ClusterXL/OSPF related.
Please see sequence of events below
* The secondary firewall was upgraded to R81.20 first and we lost connection to the box as my management connection also goes through OSPF. cphaprob on the active one was ACTIVE/LOST
* Changed the version to R81.20 in Smart console and pushed the policy to the newly upgraded secondary without any issues.
* TAC suggested to upgrade to primary so the cluster can be established and we can troubleshoot further.
* At this point outage starts with all ospf interfaces not able to connect. On the peer routers the state is INIT. Traffic is now passing through secondary.
* Both firewalls are upgraded and I can now connect to both of them, however all the traffic dependent on ospf are still down.
* At this stage I am asked to rollback the change since I had already gone over the outage window.
* I start downgrade on the primary which is not passing traffic and immediately lose connection as the upgrade starts.
* Install policy works on the primary node after I change the version number on Smart console.
* I downgrade the secondary as well, and everything comes back as it was. Connectivity was restored.
At the time I didn't know it was an OSPF issue but I investigated the routed logs the next day and saw lots of OSPF error messages which indicate the ospf instance did not start.
Jul 29 20:17:07.194789 [routed] ERROR: cpcl_cxl_master_id(1026): Failed to get cluster master information
Jul 29 20:17:07.194789 [routed] ERROR: api_get_member_info(485): failed to get member info selector = 0x81602ad, data = (nil)
Jul 29 20:17:07.194789 [routed] ERROR: cpcl_cxl_master_id(1026): Failed to get cluster master information
Jul 29 20:17:07.194789 [routed] ERROR: cpcl_cxl_master_sync_ip(1070): failed to get master id
Jul 29 20:17:07.194789 [routed] ERROR: api_get_member_info(485): failed to get member info selector = 0x81602ad, data = (nil)
Jul 29 20:17:07.194789 [routed] ERROR: cpcl_cxl_master_id(1026): Failed to get cluster master information
Jul 29 20:17:07.194789 [routed] ERROR: cpcl_cxl_master_sync_ip(1070): failed to get master id
Jul 29 20:17:09.241050 [routed] ERROR: OSPF2 instance default OspfInterfaceUp(4655): not starting protocol on interface 10.169.250.11(bond1.11)
Jul 29 20:17:09.241712 [routed] ERROR: OSPF2 instance default OspfInterfaceUp(4655): not starting protocol on interface 10.169.250.41(bond1.12)
Jul 29 20:17:09.242370 [routed] ERROR: OSPF2 instance default OspfInterfaceUp(4655): not starting protocol on interface 10.170.250.11(bond1.13)
Jul 29 20:17:09.243022 [routed] ERROR: OSPF2 instance default OspfInterfaceUp(4655): not starting protocol on interface 10.170.250.41(bond1.14)
Jul 29 20:17:09.243660 [routed] ERROR: OSPF2 instance default OspfInterfaceUp(4655): not starting protocol on interface 10.170.250.76(bond1.15)
Jul 29 20:17:09.244312 [routed] ERROR: OSPF2 instance default OspfInterfaceUp(4655): not starting protocol on interface 10.170.250.106(bond1.16)
Jul 29 20:17:09.244983 [routed] ERROR: OSPF2 instance default OspfInterfaceUp(4655): not starting protocol on interface 10.170.250.141(bond1.17)
Jul 29 20:17:09.245623 [routed] ERROR: OSPF2 instance default OspfInterfaceUp(4655): not starting protocol on interface 10.170.250.162(bond1.21)
Jul 29 20:17:09.246281 [routed] ERROR: OSPF2 instance default OspfInterfaceUp(4655): not starting protocol on interface 10.170.250.194(bond1.22)
I went through the OSPF configs and router id is the cluster IP on both. Nothing is out of the ordinary. Appreciate any insights you might have on this issue. Apologies for the long description and thank you for your time.