RX drop in one interface 21400 Appliance (SAM)

Chinmaya_Naik · ‎2020-06-15

Hi Checkmates,

Gateway Version : R77.30

We face massive number of RX drop in production hours.

We have 21400 Appliance which having SAM hardware. We only use Firewall blade.

Concurrent connection is nearly 200000 connection. (fw tab -t connections -s)

Gateways have 12 CORE, By default SND have 2 CORE and FW workers have 10 core.

So because of RX drop we changed the CoreXL configuration using "CPCONFIG" utility , Assigned 4 CORE to SND and 8 CORE to FW workers.

After chages we verify using below command.

FW1

Interface eth3-01 (irq 179): CPU 2
Interface eth3-04 (irq 140): CPU 1
Interface eth3-12 (irq 85): CPU 3
Kernel fw_0: CPU 11
Kernel fw_1: CPU 10
Kernel fw_2: CPU 9
Kernel fw_3: CPU 8
Kernel fw_4: CPU 7
Kernel fw_5: CPU 6
Kernel fw_6: CPU 5
Kernel fw_7: CPU 4
Daemon pepd: CPU all
Daemon fwd: CPU all
Daemon pdpd: CPU all
Daemon lpd: CPU all
Daemon rtmd: CPU all
Daemon mpdaemon: CPU all
Daemon cpd: CPU all
Daemon cprid: CPU all
Interface eth-bp1d1: has multi queue enabled
Interface eth-bp1d2: has multi queue enabled

FW2

Interface eth3-01 (irq 83): CPU 0
Interface eth1-01 (irq 123): CPU 0
Interface eth1-02 (irq 187): CPU 1
Interface eth3-04 (irq 171): CPU 2
Interface eth2-01 (irq 156): CPU 2
Interface eth2-02 (irq 196): CPU 3
Interface eth3-12 (irq 236): CPU 1
Kernel fw_0: CPU 11
Kernel fw_1: CPU 10
Kernel fw_2: CPU 9
Kernel fw_3: CPU 8
Kernel fw_4: CPU 7
Kernel fw_5: CPU 6
Kernel fw_6: CPU 5
Kernel fw_7: CPU 4
Daemon lpd: CPU all
Daemon pdpd: CPU all
Daemon fwd: CPU all
Daemon mpdaemon: CPU all
Daemon rtmd: CPU all
Daemon pepd: CPU all
Daemon cprid: CPU all
Daemon cpd: CPU all
Interface eth-bp1d1: has multi queue enabled
Interface eth-bp1d2: has multi queue enabled

Accelerated conns/Total conns : 211008/216112 (97%)
Accelerated pkts/Total pkts : 39971509687/40950216403 (97%)
F2Fed pkts/Total pkts : 545116700/40950216403 (1%)
PXL pkts/Total pkts : 433590016/40950216403 (1%)
QXL pkts/Total pkts : 0/40950216403 (0%)

NOTE : eth1-01 is busiest interface among all

I need some clarification on below point:

SND core configuration is mismatch so can we face any issue if failover is happened ??

We are using CPCONFIG for CoreXL configuration which is automatically assigned the interface to core then why its mismatch ??

Still We only use Firewall blade so can we increase the SND core to atleast 6 or 8 core to resolved the RX drop issue ???

Can we Enabled the "Multi-Queue" , assigned dedicated core to "eth1-01" which handle more traffic in our production ???

We also plan to upgrade to R80.30 so is this help ?

Also please suggested any alternative solution ??

Regards

@Chinmaya_Naik

Timothy_Hall · ‎2020-06-15

Really need to see the "Super Seven" outputs (https://community.checkpoint.com/t5/General-Topics/Super-Seven-Performance-Assessment-Commands-s7pac...) to get a full picture of your configuration, but I'll take a shot based on what you have provided so far.

> I need some clarification on below point:

> SND core configuration is mismatch so can we face any issue if failover is happened ??

You have the same number of Firewall Worker/Kernel instances on both cluster members which is all that matters in a ClusterXL configuration, so you are fine there. Generally the operation/state of SND/SecureXL is purely local and not sync'ed between cluster members in your version.

> We are using CPCONFIG for CoreXL configuration which is automatically assigned the interface to core then why its mismatch ??

Automatic interface affinity will move interface SoftIRQ processing for individual interfaces around on the SND cores based on traffic loads, which will be quite different on the active vs. standby cluster member. This is expected behavior and not a problem.

> Still We only use Firewall blade so can we increase the SND core to atleast 6 or 8 core to resolved the RX drop issue ???

Given that 97% of your traffic is accelerated, yes I'd recommend reducing number of workers from 8 to 6 to try a 6/6 CoreXL split. Note that you can allocate more than 6 SNDs, but in your version the locking and coordination overhead between more than 6 SND's starts to exact more of a performance toll. Even if you have a large-looking number of RX-DRP's, if they are less than 0.1% of total traffic on the interface you are fine. I'd guess that moving to a 6/6 split then enabling Multi-Queue on eth1-01 will reduce RX-DRPs below 0.1%, without the need for more than 6 SNDs.

> Can we Enabled the "Multi-Queue" , assigned dedicated core to "eth1-01" which handle more traffic in our production ???

Yes it looks like you currently have Multi-Queue enabled on two interfaces so adding a third is fine; the total limit in your version is five interfaces. But I'd recommend adjusting for a 6/6 CoreXL split first before enabling Multi-Queue on this interface, as enabling Multi-Queue on more interfaces when SNDs are already overloaded can actually make overall performance worse.

> We also plan to upgrade to R80.30 so is this help ?

Definitely.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

Chinmaya_Naik · ‎2020-06-15

Hi @Timothy_Hall

Thanks you very much for the update.

Just I need to add few point that our gateways are ruining in simple VRRP (Master/Backup) mode.

As yo mention that :

Generally the operation/state of SND/SecureXL is purely local and not sync'ed between cluster members in your version.

In R80.x have syncs between cluster members ?

But in your version the locking and coordination overhead between more than 6 SND's starts to exact more of a performance toll

in R80.x we can configured like SND > FW wrokers if required ?

Also if we increase the Buffer ring size will help ?

I will also share you the output of "Super Seven"

Thanks and Regards

@Chinmaya_Naik

Timothy_Hall · ‎2020-06-15

> In R80.x have syncs between cluster members ?

State sync works more or less the same regardless of whether you are using VRRP or ClusterXL. Generally the state information for SecureXL/SND is not sync'ed between cluster members in R80.10 or earlier as SecureXL calculations were handled locally, but then there is this:

sk121753: Both ClusterXL High Availability members are Active

This may well have changed in R80.20+ with the revamp of SecureXL. The R80.20 ClusterXL Administration Guide states as a requirement "SecureXL status - SecureXL on all members has to be either enabled, or disabled" which would seem to imply that the state of SecureXL (and therefore its sync) between members does matter. This would probably be a question for R&D, paging @PhoneBoy.

> in R80.x we can configured like SND > FW wrokers if required ?

You are allowed to configure more than 6 SND cores in R80.10 and earlier, it is just that the additional performance you gain by adding additional cores beyond six starts to be offset more and more by the additional overhead of keeping them all coordinated. So you can certainly go beyond six SNDs to increase performance, it is just that 6 SNDs in R80.10 and earlier is a bit of a "sweet spot" that you shouldn't go past without good reason. This SND scalability issue was fully resolved in R80.20+.

>Also if we increase the Buffer ring size will help ?

As a last resort yes, but doing so is only addressing the symptom (RX-DRP) and not the underlying problem (ring buffers not being emptied fast enough by existing SNDs). To combat RX-DRP of >0.1%, one should always add more SND cores first, then ensure Multi-Queue is enabled on the interface, and as a last resort increase ring buffer size. If you end up having to increase it after doing all that though, your firewall is probably underpowered.

Gaia 4.18 (R82) Immersion Tips, Tricks, & Best Practices Video Course
Now Available at https://shadowpeak.com/gaia4-18-immersion-course

PhoneBoy · ‎2020-06-15

The state of SecureXL on both members definitely matters, as mentioned in the linked SK.
Also, in this particular case, since SAM cards are involved, you definitely have to be using SecureXL.
The state of SecureXL will definitely impact how the cores are used, which I presume would also manifest itself in the sync process.

Chinmaya_Naik · ‎2020-06-18

Hi @Timothy_Hall

Thank you for the update.

Just a few query :

As you mention Rx drop > 0.1% then we need to think about it but In our environment we see some huge amount of Rx drop in only Night time like from 8 to 10 o'clock more than 100000 drop.

So if we calculate overall value then its showing link 0.01% but I think its not ok because we see Rx drop in some particular time only and during that time we see nearly 200000 connection.

Thanks and Regards

@Chinmaya_Naik