Re: HCP found errors in bonding

kamilazat

Hello everyone.

HCP report shows issues with bonding interfaces due to churn state.

This is a 4600 device with R80.40.

/proc/net/bonding shows the following:

Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer2 (0)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 200
Down Delay (ms): 200

802.3ad info
LACP rate: slow
Min links: 0
Aggregator selection policy (ad_select): stable
System priority: 65535
System MAC address: 00:1c:7f:xx:xx:xx
Active Aggregator Info:
Aggregator ID: 2
Number of ports: 1
Actor Key: 9
Partner Key: 1
Partner Mac Address: 00:00:00:00:00:00

Slave Interface: eth7
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 1
Permanent HW addr: 00:1c:7f:xx:xx:xx
Slave queue ID: 0
Aggregator ID: 1
Actor Churn State: churned
Partner Churn State: churned
Actor Churned Count: 1
Partner Churned Count: 2
details actor lacp pdu:
system priority: 65535
system mac address: 00:1c:7f:xx:xx:xx
port key: 9
port priority: 255
port number: 1
port state: 69
details partner lacp pdu:
system priority: 65535
system mac address: 00:00:00:00:00:00
oper key: 1
port priority: 255
port number: 1
port state: 1

Slave Interface: eth6
MII Status: up
Speed: 1000 Mbps
Duplex: full
Link Failure Count: 1
Permanent HW addr: 00:1c:7f:3d:a7:d4
Slave queue ID: 0
Aggregator ID: 2
Actor Churn State: none
Partner Churn State: churned
Actor Churned Count: 1
Partner Churned Count: 2
details actor lacp pdu:
system priority: 65535
system mac address: 00:1c:7f:xx:xx:xx
port key: 9
port priority: 255
port number: 2
port state: 77
details partner lacp pdu:
system priority: 65535
system mac address: 00:00:00:00:00:00
oper key: 1
port priority: 255
port number: 1
port state: 1

What I see is that Aggregator IDs are different on the slaves, and I see traffic only being processed by eth6. At the same time there are increasing RX drops and buffer size is 256 on all interfaces. I don't want to increase the buffer size before resolving this bond issue and observe the system.

What would your recommendations be?

Cheers!

Chris_Atkinson

What is the accompanying switch side configuration / device & software version ?

The transmit hash policy being Layer-2 may also explain traffic preferring one slave.

Note both the hardware & software mentioned here are EOL.

CCSM R77/R80/ELITE

kamilazat

What should I exactly be looking on the partner side? Exactly the same configuration or some other setting like system priority?

I'm aware that it's already long EOL 🙂 But I couldn't find the 4600 device in either R81.10 or R81.20 Release Notes. This machine has only 2 CPUs and runs standalone with only 4 Gigs of RAM. Scary...

Lesley

Chris means this sk with the layer 2 hash:

https://support.checkpoint.com/results/sk/sk111823

You cannot upgrade a 4600 to R81.10 or R81.20, latest version you could install is R80.40

So you stuck on software and hardware level.

-------
If you like this post please give a thumbs up(kudo)! 🙂

G_W_Albrecht

Looking into https://www.checkpoint.com/support-services/support-life-cycle-policy/ we find that 4600 appliance was first available in Oct-2011, End of Engineering Support was Jun-2020 and final End of Support in Jun-2022.

CCSP - CCSE / CCTE / CTPS / CCME / CCSM Elite / SMB Specialist

Timothy_Hall

My guess is the bond including eth6 leads to what I'd call a "transit" VLAN where only the firewall and a router are present. In that case the default Layer 2 Transmit Hash Policy will cause all traffic to only utilize one physical interface, even if there is more than one interface defined as part of the bond and the bond is set for Active-Active. This is quite common and will be discussed this Thursday March 27th at my "Be Your Own TAC Part Deux" webinars for both EMEA and Americas. Bottom line is you need to set Layer 3+4 for the Transmit Hash policy, ideally on both the firewall and router/switch sides.

Attend my online "Be your Own TAC: Part Deux" CheckMates event
March 27th with sessions for both the EMEA and Americas time zones

kamilazat

Hi Tim. Thank you for the explanation. I tried to replicate it in my lab (R81.20 + Mikrotik) and first set the hash policy to Layer 2. Only one interface handled traffic. And then changed the hash policy to Layer 3+4 and both interfaces started processing the traffic. At every step I took HCP report to see if the errors will get fixed.

Interestingly, even though I see traffic going through both interfaces I still see errors in HCP:

Bond Name	Finding	Severity
bond1	Slave eth3 aggregator ID 2 is different than the bond aggregator ID 1.	ERROR
bond1	Slave eth6 actor port state 77 indicates that the port is not synced with the partner.	ERROR
bond1	Slave eth6 partner port state 1 indicates that the port is not synced with the partner.	ERROR
bond1	Slave eth3 actor port state 69 indicates that the port is not synced with the partner.	ERROR
bond1	Slave eth3 partner port state 1 indicates that the port is not synced with the partner.	ERROR

I double-checked that all the settings on both sides are exactly the same, and rebuilt the bonds after changing hash policy.

_Val_

It seems you are running an unsupported version of an appliance that was out of all lines of support two years ago. I would not expect any solution to the issue other than replacing your gateways.

AkosBakos

Hi @kamilazat

I dont't think it belogs to version, however R80.40 is not supported.

The important part is "Partner Churn State: churned"

Check this: https://support.checkpoint.com/results/sk/sk169760

maybe the bond members are not in the same bond (you mixed the cables)
- if not: shut all of the bond members in the same time (set interface ethX state off)
- then: turn on the interfaces

Or there is is a layer-1 soluton: pull them out from the box - wait a little (30 sec) - then plug them back

I ran into this situation 4 weeks ago.

Akos

----------------
\m/_(>_<)_\m/

kamilazat

Yes. that was the first sk I found 🙂 Before doing anything on a prod environment I wanted to check it on my lab, although it's R81.20.

I have already destroyed and rebuilt everything between the GW and a mikrotik router. First I changed the hash policy to layer3+4 as Tim recommended. Then tried to add the interfaces in different order, triple-checked that all the settings are exactly the same on both sides. Also tried manually changing the aggregator ID inside /sys/class/net/bond1, but this one ended up in permission errors, so I didn't push further.

It kinda messes with me now that I can't get those aggregator IDs be the same.

Are you a member of CheckMates?

HCP found errors in bonding