Solved: Re: VSX and Bond interfaces going down after few h...

Dilian_Chernev · ‎2025-02-12

Hello mates,

I am fighting whit very strange issue - Bond interfaces going down after few hours after reconfiguring interfaces on virtual machines.

There is a cluster of two 19200 (R81.20 JHF92) hosts in VSLS with bond interfaces to Cisco switches with LACP and VPC.
After configuring two VSs - configured interfaces, vlans, routes, blank policy with any-any-allow, everything is fine.
The only thing is that no vlan's are configured on the switches, becaus these VSs are prepared to replace existing plain devices that have same IPs. So to make sure everything is ok till the date of migration, there is no trafic on interfaces of VSs.

So after 4-6 hours, most of the bonds became down, Cisco switches are saying that ports are disabled and there is no way to bring them back up.
On CP side, bonds are with different Aggregator IDs and interface are "churned" and the only way to bring them up is to reboot appliances.

This happens 3 times till now, every time several hours after reconfiguring interfaces of VSs.

Opened a ticket after first time, but nothing usefull came out - only sk115516, but this not helping to prevent from happeing again.
There is nothing usefull in /var/log/messages

Does any one have simillar problems?
Any idea which log files to check or what debugs could be run? I am pretty sure this can reproduced.

Thanks,

Dilian

Timothy_Hall · ‎2025-02-12

If it happens again I'd suggest disabling UPPAK from cpconfig to see if it affects the issue. UPPAK has its tendrils sunk pretty deeply into the network drivers via DPDK, and it being the cause of your bond issue is not outside the realm of possibility.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

View solution in original post

AkosBakos · ‎2025-02-12

Hi @Dilian_Chernev

Intersting, strange behaviour

@churned: https://support.checkpoint.com/results/sk/sk169760

One of the peer's LACP (etherchannel) interfaces is suspended or is otherwise no longer active as an LACP interface.

@Cisco side bond:

The bond ID is the same on the newly generated LACP and the existig one? Is there anything common on the existing switch config and the new one?

Akos

----------------
\m/_(>_<)_\m/

Dilian_Chernev · ‎2025-02-12

Not sure how to respond on this 😞 but after restarting appliances everything works fine.
Tomorrow will try to edit VS config to see if the issue will happen again.

Timothy_Hall · ‎2025-02-12

If it happens again I'd suggest disabling UPPAK from cpconfig to see if it affects the issue. UPPAK has its tendrils sunk pretty deeply into the network drivers via DPDK, and it being the cause of your bond issue is not outside the realm of possibility.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Dilian_Chernev · ‎2025-03-06

Hi Timothy,

After disabling the UPPAK the issue is not happing again.

We have an open ticket with TAC and R&D involved to figured out what was the root cause for the problem.

Thanks for the help!

Timothy_Hall · ‎2025-03-06

Interesting that UPPAK was the cause, thanks for the follow-up.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Duane_Toler · ‎2025-02-12

Check the (very long) output of cat /proc/net/bonding/<bond name> before and after the event occurs, in the section "details partner lacp pdu", in both sections for each interface, to see if the remote side changes its LACP information.

You mentioned the interface being "churned", so you likely already saw this, tho.

On the Cisco side, if this is Nexus VPC, then check the status of the etherchannel to see if it has suspended the port-channel member interface. You can run a "debug port-channel error" or "debug port-channel trace" to hopefully catch any switch-side errors.

On IOS-XE, it's "debug etherchannel ..." for similar.

--
Ansible for Check Point APIs series: https://www.youtube.com/@EdgeCaseScenario and Substack

Timothy_Hall · ‎2025-02-12

Is this a new bond implementation on your 19200?
Were the bonds ever stable?
What is the interface speed and driver type of the physical interfaces (ethtool -i ethXX).

This issue sounds somewhat similar to a supposedly-fixed limitation of Lightspeed cards:

Bond may become unstable because of LACP packet losses (on the network or in the interface).

Workaround - Configure the LACP "slow" rate for this Bond on each side

Because you are on an Quantum Force appliance it will utilize UPPAK by default just like a Lightspeed appliance, so the above may apply to you. If both sides set to slow rate doesn't help, the last thing to try would be to disable UPPAK via cpconfig to go back to KPPAK and see if that impacts the problem.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Dilian_Chernev · ‎2025-02-12

It is a new bond implementation, but it is configured almost 4 months ago.
It's stable, except these 3 moments when VSs interfaces changes was made.

Here is ethtool info, it is identical on all involved interfaces (10Gb SFP+)

[Expert@fw2:0]# ethtool -i eth1-04
driver: net_ice
version: DPDK 20.11.7.4.0 (29 Mar 24)
firmware-version: 4.20 0x800178e2 1.3346.0
expansion-rom-version:
bus-info: 0000:17:00.7
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes

From both sides lacp rate is slow/normal

fw2:0> show bonding group 4
Bond Configuration
xmit-hash-policy layer2
down-delay 200
primary Not configured
lacp-rate slow
mode 8023AD
up-delay 200
mii-interval 100
min-links 0
Bond Interfaces
eth1-04
eth3-04

#### edit

There is something that just remember - bond in CP device is created with one port from Line card 1 model: CPAC-8-1/10F-D and second port from: Line card 3 model: CPAC-4-10/25F-D
There is difference in firmware, but driver is the same:

[Expert@fw2:0]# ethtool -i eth1-03
driver: net_ice
version: DPDK 20.11.7.4.0 (29 Mar 24)
firmware-version: 4.20 0x800178e2 1.3346.0
expansion-rom-version:
bus-info: 0000:17:00.5
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes

[Expert@fw2:0]# ethtool -i eth3-04
driver: net_ice
version: DPDK 20.11.7.4.0 (29 Mar 24)
firmware-version: 4.30 0x8001b94f 1.3415.0
expansion-rom-version:
bus-info: 0000:b1:00.2
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes

Are you a member of CheckMates?

VSX and Bond interfaces going down after few hours