Create a Post
cancel
Showing results for 
Search instead for 
Did you mean: 
Dilian_Chernev
Collaborator

VSX and Bond interfaces going down after few hours

Hello mates,

I am fighting whit very strange issue - Bond interfaces going down after few hours after reconfiguring interfaces on virtual machines.

There is a cluster of two 19200 (R81.20 JHF92) hosts in VSLS with bond interfaces to Cisco switches with LACP and VPC.
After configuring two VSs - configured interfaces, vlans, routes, blank policy with any-any-allow, everything is fine.
The only thing is that no vlan's are configured on the switches, becaus these VSs are prepared to replace existing plain devices that have same IPs. So to make sure everything is ok till the date of migration, there is no trafic on interfaces of VSs.

So after 4-6 hours, most of the bonds became down, Cisco switches are saying that ports are disabled and there is no way to bring them back up. 
On CP side, bonds are with different Aggregator IDs and interface are "churned" and the only way to bring them up is to reboot appliances.

This happens 3 times till now, every time several hours after reconfiguring interfaces of VSs.

Opened a ticket after first time, but nothing usefull came out - only sk115516, but this not helping to prevent from happeing again.
There is nothing usefull in /var/log/messages

Does any one have simillar problems?
Any idea which log files to check or what debugs could be run? I am pretty sure this can reproduced.

Thanks,

Dilian

0 Kudos
6 Replies
AkosBakos
Leader Leader
Leader

Hi @Dilian_Chernev 

Intersting, strange behaviour

@churned: https://support.checkpoint.com/results/sk/sk169760

One of the peer's LACP (etherchannel) interfaces is suspended or is otherwise no longer active as an LACP interface.

@Cisco side bond:

The bond ID is the same on the newly generated LACP and the existig one? Is there anything common on the existing switch config and the  new one?

Akos

----------------
\m/_(>_<)_\m/
0 Kudos
Dilian_Chernev
Collaborator

Not sure how to respond on this 😞 but after restarting appliances everything works fine.
Tomorrow will try to edit VS config to see if the issue will happen again.

0 Kudos
Timothy_Hall
Legend Legend
Legend

If it happens again I'd suggest disabling UPPAK from cpconfig to see if it affects the issue.  UPPAK has its tendrils sunk pretty deeply into the network drivers via DPDK, and it being the cause of your bond issue is not outside the realm of possibility.

Attend my 60-minute "Be your Own TAC: Part Deux" Presentation
Exclusively at CPX 2025 Las Vegas Tuesday Feb 25th @ 1:00pm
0 Kudos
Duane_Toler
Advisor

Check the (very long) output of cat /proc/net/bonding/<bond name>  before and after the event occurs, in the section "details partner lacp pdu", in both sections for each interface, to see if the remote side changes its LACP information.

You mentioned the interface being "churned", so you likely already saw this, tho.

On the Cisco side, if this is Nexus VPC, then check the status of the etherchannel to see if it has suspended the port-channel member interface.  You can run a "debug port-channel error" or "debug port-channel trace" to hopefully catch any switch-side errors.

On IOS-XE, it's "debug etherchannel ..." for similar.

 

0 Kudos
Timothy_Hall
Legend Legend
Legend

  • Is this a new bond implementation on your 19200? 
  • Were the bonds ever stable? 
  • What is the interface speed and driver type of the physical interfaces (ethtool -i ethXX). 

This issue sounds somewhat similar to a supposedly-fixed limitation of Lightspeed cards:

Bond may become unstable because of LACP packet losses (on the network or in the interface).

Workaround - Configure the LACP "slow" rate for this Bond on each side

Because you are on an Quantum Force appliance it will utilize UPPAK by default just like a Lightspeed appliance, so the above may apply to you.  If both sides set to slow rate doesn't help, the last thing to try would be to disable UPPAK via cpconfig to go back to KPPAK and see if that impacts the problem.

Attend my 60-minute "Be your Own TAC: Part Deux" Presentation
Exclusively at CPX 2025 Las Vegas Tuesday Feb 25th @ 1:00pm
0 Kudos
Dilian_Chernev
Collaborator

It is a new bond implementation, but it is configured almost 4 months ago.
It's stable, except these 3 moments when VSs interfaces changes was made.

Here is ethtool info, it is identical on all involved interfaces (10Gb SFP+)

[Expert@fw2:0]# ethtool -i eth1-04
driver: net_ice
version: DPDK 20.11.7.4.0 (29 Mar 24)
firmware-version: 4.20 0x800178e2 1.3346.0
expansion-rom-version:
bus-info: 0000:17:00.7
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes

From both sides lacp rate is slow/normal

 

fw2:0> show bonding group 4
Bond Configuration
xmit-hash-policy layer2
down-delay 200
primary Not configured
lacp-rate slow
mode 8023AD
up-delay 200
mii-interval 100
min-links 0
Bond Interfaces
eth1-04
eth3-04

#### edit 

There is something that just remember - bond in CP device is created with one port from Line card 1 model: CPAC-8-1/10F-D and second port from: Line card 3 model: CPAC-4-10/25F-D
There is difference in firmware, but driver is the same:

[Expert@fw2:0]# ethtool -i eth1-03
driver: net_ice
version: DPDK 20.11.7.4.0 (29 Mar 24)
firmware-version: 4.20 0x800178e2 1.3346.0
expansion-rom-version:
bus-info: 0000:17:00.5
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes

[Expert@fw2:0]# ethtool -i eth3-04
driver: net_ice
version: DPDK 20.11.7.4.0 (29 Mar 24)
firmware-version: 4.30 0x8001b94f 1.3415.0
expansion-rom-version:
bus-info: 0000:b1:00.2
supports-statistics: yes
supports-test: no
supports-eeprom-access: no
supports-register-dump: no
supports-priv-flags: yes

 

0 Kudos

Leaderboard

Epsum factorial non deposit quid pro quo hic escorol.

Upcoming Events

    CheckMates Events