Gateway is dropping packets every minute

Dilian_Chernev · ‎2023-06-27

Hello mates,

We are dealing with very weird issue these days -

Gateway is dropping traffic each minute , like 11:15:02, 11:16:02, 11:17:02...

On each drop there are following lines in /var/log/messages:

Jun 19 10:08:03 2023 mdc-fw kernel: [fw4_3];fwmultik_f2p_cookie_outbound: fwmultik_f2p_packet_outbound Failed.
Jun 19 10:09:03 2023 mdc-fw kernel: [fw4_0];fwmultik_f2p_cookie_outbound: fwmultik_f2p_packet_outbound Failed.
Jun 19 10:10:03 2023 mdc-fw kernel: [fw4_11];fwmultik_f2p_cookie_outbound: fwmultik_f2p_packet_outbound Failed.
Jun 19 10:10:04 2023 mdc-fw kernel: [fw4_6];fwmultik_f2p_cookie_outbound: fwmultik_f2p_packet_outbound Failed.

Gateway was R80.40, when we experience the problem, now is upgraded to R81.10 JHF95, but still the same.

Device is CP 6800 and is pretty loaded - 240k connections, 3GB traffic and CPU about 65%

Service request is filled, but things are moving slowly, so just wanted to ask if someone had similar issue.

Thanks,
Dilian

Timothy_Hall · ‎2023-06-27

That message is indicating that the worker/instance core has completed its inspection and is trying to hand a packet back to a dispatcher core for outbound transmission, but that operation failed. Please provide output of command enabled_blades and the Super Seven commands: S7PAC - Super Seven Performance Assessment Commands

Seems unlikely to be a dispatcher code problem as the issue followed you through a major upgrade.

It is possible that your NICs and/or bus are hanging up every minute (you would see a big spike in "hi" reported by top if so), and queued packets are backing up in the dispatchers trying to reach the NICs/bus until their queues are full and they can't accept any more. Please also provide output of fwaccel stats -l and cpstat -f sensors os. Any messages about NIC lockups in /var/log/messages? Are all expansion NIC cards firmly seated?

Attend my 60-minute "Be your Own TAC: Part Deux" Presentation
Exclusively at CPX 2025 Las Vegas Tuesday Feb 25th @ 1:00pm

the_rock · ‎2023-06-27

I had seen that issue few times and every single time I fixed it by running cpconfig, disabling corexl, rebooting, re-enablin corexl and rebooting again. I know it probably cant be done during work hours, but thats what seemed to fix it.

Cheers,

Andy

Dilian_Chernev · ‎2023-06-27

Thanks for the reply.

Did you try this on cluster system?
When one member is with disabled CoreXL, which will be the active device after first reboot?

Just to be prepared 🙂

the_rock · ‎2023-06-27

Every time I had to do this with customers, it was a cluster. Corexl state has nothing to do with cluster state, if you disable corexl and reboot current master, other member will become active, UNLESS, you have prempt mode enabled (as per below), which I would not recommend, due to traffic issues when failover happens.

Andy

Dilian_Chernev · ‎2023-06-28

Unfortunately, this didn't solve our problem. It was related to SecureXL traffic, as we found out later.

Disabling CoreXL had a catastrophic effect - from 20 cores to 1, the machine was barely breathing.
Also ClusterXL is counting cores available, and machine with less cores is becoming active.

Thanks,
Dilian

the_rock · ‎2023-06-28

I find that really odd, because I tested this exact scenario in my R81.20 lab (with a cluster) and never had that issue.

Andy

Are you a member of CheckMates?

Gateway is dropping packets every minute