Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
mgades
Contributor
Jump to solution

OSPFv3 changes state during failover (ClusterXL)

Hi there,

I'm about to roll out IPv6 in our enterprise network, and are testing various scenarios in lab before rolling out in production. Something is puzzling me and telling me this is not right...

We're running 2 x 5800 in HA (active/passive). For IPv4 we are running OSPFv2 to announce a default route, and a few connected routes, and this works like a charm - not a single hickup/ping loss when failing over between the two nodes. Also on other OSPF neighbors there are no state change, as the process is clustered.

But when I try to mimic the same setup for IPv6 (using the OSPFv3 protocol), the OSPF session changes to INIT (as seen on a neighboring device), which leads to downtime until it converges and changes to FULL.

I have both a IPv4 and IPv6 ping running towards two distant hosts. When flipping over the nodes (using clusterXL_admin down on the active node), there are no ping timeouts on the IPv4 ping, but IPv6 fails immediately, and comes back when OSPFv3 reconverges (after about 16 ping timeouts - at least 15 pings too much 😉).

<CR_LAB>%Nov 23 10:53:53:483 2020 CR_LAB OSPFV3/5/OSPFv3_NBR_CHG: OSPFv3 1 Neighbor 172.20.10.10(Vlan-interface10) received 1-Way and its state from FULL to INIT.

<CR_LAB>
<CR_LAB>%Nov 23 10:54:06:456 2020 CR_LAB OSPFV3/5/OSPFv3_NBR_CHG: OSPFv3 1 Neighbor 172.20.10.10(Vlan-interface10) received LoadingDone and its state from LOADING to FULL.


CR_LAB is a core router (HPE Comware) - router-id 172.20.127.11
FW-A and FW-B are Checkpoint R80.40 JHF Take_87 nodes.

<CR_LAB>disp ospfv3 peer 172.20.10.10

OSPFv3 Process 1 with Router ID 172.20.127.11

Area 0.0.0.0 interface Vlan-interface10's neighbors
Router ID: 172.20.10.10 Address: FE80::131:10
State: Full Mode: Nbr is slave Priority: 1
DR: 172.20.127.11 BDR: 172.20.127.81 MTU: 1500
Options is 0x000013 (-|R|-|x|E|V6)
Dead timer due in 00:00:37
Neighbor is up for 03:38:55
Neighbor state change count: 16
Database Summary List 0
Link State Request List 0
Link State Retransmission List 0
Neighbor interface ID: 168461066
GR state: Normal
Grace period: 0 Grace period timer: Off
DD Rxmt Timer: Off LS Rxmt Timer: Off

The Checkpoint cluster is using the same router-id. And I can confirm the link-local IP is identical on the CluterXL interface.

These are the relevant OSPFv3 configuration lines:

set ipv6 ospf3 instance default rfc1583-compatibility off
set ipv6 ospf3 instance default graceful-restart-helper on
set ipv6 ospf3 instance default area backbone on
set ipv6 ospf3 instance default interface eth1 area backbone on
set ipv6 ospf3 instance default interface eth1 cost 100
set ipv6 ospf3 instance default interface eth1 priority 1
set ipv6 ospf3 instance default export-routemap export_ipv6 preference 1 on

set routemap export_ipv6 id 100 on
set routemap export_ipv6 id 100 allow
set routemap export_ipv6 id 100 match network ::/0 exact
set routemap export_ipv6 id 100 match protocol static


I know GR aren't applicable when using ClusterXL, but that's the default setting and same behavior when turning it off.

Does anyone have a clue what I've done wrong? Is the clustered OSPFv3 process using ClusterXL really supposed to change the neighbor state during failover?

 

Thanks in advance,

Morten Gade Sørensen

 

0 Kudos
1 Solution

Accepted Solutions
JackPrendergast
Advisor
Advisor

Hi @mgades 

It is applicable and required for v3. 

See below extract from the OSPF SK.

 

In cluster environment, in OSPFv2 all cluster members must have the same OSPF Router ID value. During a failover, one of the Standby members (Backup) becomes the new Active member (Master) and then continues where the former Active member (Master) failed. As a result, there should be no traffic outage and no need for OSPFv2 graceful restart. The above-mentioned sync of OSPF database does not happen in OSPFv3 therefore Graceful Restart is needed and supported for OSPFv3 with ClusterXL.

View solution in original post

0 Kudos
6 Replies
mgades
Contributor

Anyone running OSPFv3 in a HA cluster??

0 Kudos
JackPrendergast
Advisor
Advisor

Do you have graceful restart enabled?

0 Kudos
mgades
Contributor

Hi Jack

"I know GR aren't applicable when using ClusterXL, but that's the default setting and same behavior when turning it off."

As mentioned above, I tried both with same result (GR = Graceful Restart).
But since the OSPFv3 process actually should be clustered GR doesn't make much sense (if this has feature parity compared to how OSPFv2 behaves in ClusterXL). I actually only think GR is supported when using VRRP. In ClusterXL the routing table/state is synced (I assume).

 

0 Kudos
JackPrendergast
Advisor
Advisor

Hi @mgades 

It is applicable and required for v3. 

See below extract from the OSPF SK.

 

In cluster environment, in OSPFv2 all cluster members must have the same OSPF Router ID value. During a failover, one of the Standby members (Backup) becomes the new Active member (Master) and then continues where the former Active member (Master) failed. As a result, there should be no traffic outage and no need for OSPFv2 graceful restart. The above-mentioned sync of OSPF database does not happen in OSPFv3 therefore Graceful Restart is needed and supported for OSPFv3 with ClusterXL.

0 Kudos
mgades
Contributor

EUREKA!! You're right!

I was blinded by the comment in this post was under the impression that GR isn't supported in OSPFv3, but that was actually only in OSPFv2 with ClusterXL:
https://community.checkpoint.com/t5/Enterprise-Appliances-and-Gaia/Dynamic-Routing-Real-World-Experi...

When enabling Graceful Restart for OSPFv3 on both the Checkpoint devices and other OSPFv3 neighbors (in this case our core routers), the failover is instant and no(ish) lost IPv6 packets. From the logs on the HPE Comware device the OSPFv3 neighbor is going to EXSTART back to FULL within the same second:

%Nov 30 14:45:41:166 2020 xxx-CR OSPFV3/5/OSPFv3_NBR_CHG: OSPFv3 1 Neighbor 172.20.10.10(Vlan-interface10) received SeqNumberMismatch and its state from FULL to EXSTART.
%Nov 30 14:45:41:211 2020 xxx-CR OSPFV3/5/OSPFv3_NBR_CHG: OSPFv3 1 Neighbor 172.20.10.10(Vlan-interface10) received LoadingDone and its state from LOADING to FULL.

Thanks again 🙂

Morten

JackPrendergast
Advisor
Advisor

Anytime! 🙂

 

Have a good week!

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events