Re: XMT ERROR - What does it mean?

Amir_Arama · ‎2022-02-02

can anyone tell me what XMT ERROR mean in fwaccel stats -d output.

i have big numbers that keep increasing

R80.40 gw T 139. Runs: fw+vpn

thx

Timothy_Hall · ‎2022-02-02

I am assuming that this counter indicates a situation where the SecureXL driver is trying to place a packet into the egress interface ring buffer and it is full, or that process failed for some other reason. Please run netstat -ni, do you see any nonzero TX-* error counters for any of your interfaces? If so please run ethtool -S (interface) for that interface and post the results.

Note that this counter does not necessarily indicate packet loss, as the SecureXL driver may just hold the packet and try to transmit it again later.

Attend my 60-minute "Be your Own TAC: Part Deux" Presentation
Exclusively at CPX 2025 Las Vegas Tuesday Feb 25th @ 1:00pm

Amir_Arama · ‎2022-02-03

[Expert@:0]# netstat -ni
Kernel Interface table
Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
eth4 1500 0 6188485515 0 0 1499988 12917789433 0 0 0 BMRU
eth5 1500 0 11568439168 0 492 72667 5269677013 0 0 0 BMRU
eth6 1500 0 1386045553 0 0 24 732863806 0 0 0 BMRU
eth7 1500 0 4298949 0 0 0 3812975 0 0 0 BMRU
lo 65536 0 985391 0 0 0 985391 0 0 0 ALdNRU
[Expert@:0]# ethtool -S eth5
NIC statistics:
rx_packets: 11642195714
tx_packets: 5307764897
rx_bytes: 16121420460043
tx_bytes: 1042794755950
rx_broadcast: 14398748
tx_broadcast: 120565
rx_multicast: 493
tx_multicast: 0
multicast: 493
collisions: 0
rx_crc_errors: 0
rx_no_buffer_count: 0
rx_missed_errors: 0
tx_aborted_errors: 0
tx_carrier_errors: 0
tx_window_errors: 0
tx_abort_late_coll: 0
tx_deferred_ok: 0
tx_single_coll_ok: 0
tx_multi_coll_ok: 0
tx_timeout_count: 2
rx_long_length_errors: 0
rx_short_length_errors: 0
rx_align_errors: 0
tx_tcp_seg_good: 0
tx_tcp_seg_failed: 0
rx_flow_control_xon: 0
rx_flow_control_xoff: 0
tx_flow_control_xon: 0
tx_flow_control_xoff: 0
rx_long_byte_count: 16121420460043
tx_dma_out_of_sync: 0
lro_aggregated: 0
lro_flushed: 0
tx_smbus: 0
rx_smbus: 0
dropped_smbus: 0
os2bmc_rx_by_bmc: 0
os2bmc_tx_by_bmc: 0
os2bmc_tx_by_host: 0
os2bmc_rx_by_host: 0
tx_hwtstamp_timeouts: 0
rx_hwtstamp_cleared: 0
rx_errors: 0
tx_errors: 0
tx_dropped: 0
rx_length_errors: 0
rx_over_errors: 0
rx_frame_errors: 0
rx_fifo_errors: 72667
tx_fifo_errors: 0
tx_heartbeat_errors: 0
tx_queue_0_packets: 5261646326
tx_queue_0_bytes: 991340161337
tx_queue_0_restart: 2328068
tx_queue_1_packets: 1583334
tx_queue_1_bytes: 255272249
tx_queue_1_restart: 0
tx_queue_2_packets: 1170911
tx_queue_2_bytes: 229048559
tx_queue_2_restart: 0
tx_queue_3_packets: 1333170
tx_queue_3_bytes: 219112446
tx_queue_3_restart: 0
tx_queue_4_packets: 2074616
tx_queue_4_bytes: 277662939
tx_queue_4_restart: 0
tx_queue_5_packets: 1984555
tx_queue_5_bytes: 339575490
tx_queue_5_restart: 0
tx_queue_6_packets: 31494
tx_queue_6_bytes: 24862553
tx_queue_6_restart: 0
rx_queue_0_packets: 1706897155
rx_queue_0_bytes: 2359134322721
rx_queue_0_drops: 8189
rx_queue_0_csum_err: 0
rx_queue_0_alloc_failed: 0
rx_queue_1_packets: 1612403402
rx_queue_1_bytes: 2228968123386
rx_queue_1_drops: 6506
rx_queue_1_csum_err: 0
rx_queue_1_alloc_failed: 0
rx_queue_2_packets: 1660705787
rx_queue_2_bytes: 2311342525115
rx_queue_2_drops: 8683
rx_queue_2_csum_err: 0
rx_queue_2_alloc_failed: 0
rx_queue_3_packets: 1526985150
rx_queue_3_bytes: 2098399714779
rx_queue_3_drops: 16072
rx_queue_3_csum_err: 0
rx_queue_3_alloc_failed: 0
rx_queue_4_packets: 1715716699
rx_queue_4_bytes: 2358823581624
rx_queue_4_drops: 12146
rx_queue_4_csum_err: 0
rx_queue_4_alloc_failed: 0
rx_queue_5_packets: 2185583629
rx_queue_5_bytes: 3033398172997
rx_queue_5_drops: 7839
rx_queue_5_csum_err: 0
rx_queue_5_alloc_failed: 0
rx_queue_6_packets: 1160507659
rx_queue_6_bytes: 1582404516739
rx_queue_6_drops: 8309
rx_queue_6_csum_err: 0
rx_queue_6_alloc_failed: 0
[Expert@:0]#

by the way we get tons of this all the time on fw ctl zdebug + drop on some host communicating with other host

@;325163061;[cpu_0];[SIM-207024420];do_packet_finish: cut-through: XMT FAILED!!! xmt_rc=-2, conn:<10.x.x.x,1500,10.x.x.x,59009,6>;
@;325163061;[cpu_0];[SIM-207024420];do_packet_finish: cut-through: XMT FAILED!!! xmt_rc=-2, conn:<10.x.x.x,1500,10.x.x.x,59010,6>;

Timothy_Hall · ‎2022-02-03

1) What kind of firewall hardware is this? Check Point appliance or open hardware?

2) Also please provide the output of ethtool -i eth5 so we can see driver type. Seems like your NICs are reporting inbound overruns (RX-OVR) but not very many RX-DRPs which is a little strange and indicates possible issues at the NIC hardware level.

3) You've got something messed up with your Multi-Queue configuration, and it is manifesting itself here:

tx_queue_0_packets: 5261646326
tx_queue_0_bytes: 991340161337
tx_queue_0_restart: 2328068

Looks like for eth5 tx_queue_0 is getting way, WAY more outbound traffic than the other TX queues 1-6 which should not be happening. TX queue 0 is getting so swamped it is filling up, rejecting packets, and having to restart the queue accepting packets again; during that period XMTs from SecureXL will fail.

4) There is also this, which may indicate jumbo frames in use that are larger than the interface's MTU and may be related to the inbound overruns:

rx_long_byte_count: 16121420460043

5) Have you tried to manually tune Multi-Queue? This is a BIG no-no on the Gaia 3.10 OS which is generally R80.40 and higher and can result in these types of imbalances. Please provide output of mq_mng –o –v from expert mode.

Attend my 60-minute "Be your Own TAC: Part Deux" Presentation
Exclusively at CPX 2025 Las Vegas Tuesday Feb 25th @ 1:00pm

Amir_Arama · ‎2022-02-04

open server - HP G9 dl360

let me give you some more background:

initially i configured auto mq, but it balance 8 cores (max supported by nics) to all nics combined. and i have huge cpu utilization, so i wanted to understand which nic is causing most of it, so i seperated cores per interfaces, until i found eth5 is causing it.

working with TAC we discovered that vpn is causing most of the cpu utilization, when removing vpn and letting traffic go unencrypted cores was at 95% idle. when vpn was enabled they was like 5-40% idle. after upgarding to ongoing take 150 and changing encryption algoritm it's better but still not so much. i didn't think vpn is related at first because eth5 is facing the local lan, and eth4 facing the peer vpn gw. but it seems that snd's of eth5 are doing also the encryption work. or they affected by it some how.

here are the outputs your requeted:

[Expert@:0]# ethtool -i eth5
driver: igb
version: 5.3.5.20
firmware-version: 1.70, 0x80000f44, 1.2028.0
expansion-rom-version:
bus-info: 0000:04:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

posting again updated ethtool -S :

[Expert@:0]# ethtool -S eth5
NIC statistics:
rx_packets: 4964018578
tx_packets: 2248161359
rx_bytes: 6930159736278
tx_bytes: 472471347313
rx_broadcast: 5515000
tx_broadcast: 47680
rx_multicast: 241
tx_multicast: 0
multicast: 241
collisions: 0
rx_crc_errors: 0
rx_no_buffer_count: 0
rx_missed_errors: 0
tx_aborted_errors: 0
tx_carrier_errors: 0
tx_window_errors: 0
tx_abort_late_coll: 0
tx_deferred_ok: 0
tx_single_coll_ok: 0
tx_multi_coll_ok: 0
tx_timeout_count: 0
rx_long_length_errors: 0
rx_short_length_errors: 0
rx_align_errors: 0
tx_tcp_seg_good: 0
tx_tcp_seg_failed: 0
rx_flow_control_xon: 0
rx_flow_control_xoff: 0
tx_flow_control_xon: 0
tx_flow_control_xoff: 0
rx_long_byte_count: 6930159736278
tx_dma_out_of_sync: 0
lro_aggregated: 0
lro_flushed: 0
tx_smbus: 0
rx_smbus: 0
dropped_smbus: 0
os2bmc_rx_by_bmc: 0
os2bmc_tx_by_bmc: 0
os2bmc_tx_by_host: 0
os2bmc_rx_by_host: 0
tx_hwtstamp_timeouts: 0
rx_hwtstamp_cleared: 0
rx_errors: 0
tx_errors: 0
tx_dropped: 0
rx_length_errors: 0
rx_over_errors: 0
rx_frame_errors: 0
rx_fifo_errors: 9145
tx_fifo_errors: 0
tx_heartbeat_errors: 0
tx_queue_0_packets: 738644974
tx_queue_0_bytes: 161876078815
tx_queue_0_restart: 42049
tx_queue_1_packets: 691795
tx_queue_1_bytes: 81135487
tx_queue_1_restart: 0
tx_queue_2_packets: 1504911959
tx_queue_2_bytes: 288430955930
tx_queue_2_restart: 510681
tx_queue_3_packets: 555859
tx_queue_3_bytes: 78171677
tx_queue_3_restart: 0
tx_queue_4_packets: 2490259
tx_queue_4_bytes: 2495498088
tx_queue_4_restart: 0
tx_queue_5_packets: 440930
tx_queue_5_bytes: 60110650
tx_queue_5_restart: 0
tx_queue_6_packets: 28210
tx_queue_6_bytes: 4076532
tx_queue_6_restart: 0
tx_queue_7_packets: 2622
tx_queue_7_bytes: 1356208
tx_queue_7_restart: 0
rx_queue_0_packets: 390728994
rx_queue_0_bytes: 516183302822
rx_queue_0_drops: 303
rx_queue_0_csum_err: 0
rx_queue_0_alloc_failed: 0
rx_queue_1_packets: 594522390
rx_queue_1_bytes: 825697479777
rx_queue_1_drops: 2115
rx_queue_1_csum_err: 0
rx_queue_1_alloc_failed: 0
rx_queue_2_packets: 607234731
rx_queue_2_bytes: 851926240989
rx_queue_2_drops: 99
rx_queue_2_csum_err: 0
rx_queue_2_alloc_failed: 0
rx_queue_3_packets: 804427571
rx_queue_3_bytes: 1138657768456
rx_queue_3_drops: 2634
rx_queue_3_csum_err: 0
rx_queue_3_alloc_failed: 0
rx_queue_4_packets: 883199019
rx_queue_4_bytes: 1236082372690
rx_queue_4_drops: 1798
rx_queue_4_csum_err: 0
rx_queue_4_alloc_failed: 0
rx_queue_5_packets: 812398978
rx_queue_5_bytes: 1146114511637
rx_queue_5_drops: 1229
rx_queue_5_csum_err: 0
rx_queue_5_alloc_failed: 0
rx_queue_6_packets: 511234866
rx_queue_6_bytes: 704483327773
rx_queue_6_drops: 592
rx_queue_6_csum_err: 0
rx_queue_6_alloc_failed: 0
rx_queue_7_packets: 359382343
rx_queue_7_bytes: 489991856392
rx_queue_7_drops: 107
rx_queue_7_csum_err: 0
rx_queue_7_alloc_failed: 0

[Expert@:0]# mq_mng -o -v
Total 16 cores. Multiqueue 14 cores: 0,8,1,9,2,10,3,11,4,12,5,13,6,14
i/f type state mode cores
------------------------------------------------------------------------------------------------
eth4 igb Up Manual (6/6) 0(94),1(101),2(105),3(106),4(1
07),5(108)
eth5 igb Up Manual (8/8) 6(96),8(102),9(109),10(110),11
(111),12(112),13(113),14(114)
eth6 igb Up Manual (6/6) 0(98),1(103),2(115),3(116),4(1
17),5(118)
eth7 igb Up Manual (6/6) 0(100),1(104),2(119),3(120),4(
121),5(122)

core interfaces queue irq rx packets tx packets
------------------------------------------------------------------------------------------------
0 eth7 eth7-TxRx-0 100 330909 163420
eth6 eth6-TxRx-0 98 61509682 110784055
eth4 eth4-TxRx-0 94 852416352 983677832
1 eth7 eth7-TxRx-1 104 442589 448258
eth6 eth6-TxRx-1 103 54020929 9
eth4 eth4-TxRx-1 101 930887 987425974
2 eth7 eth7-TxRx-2 119 265478 138919
eth6 eth6-TxRx-2 115 145991072 166638954
eth4 eth4-TxRx-2 105 1714115580 753778613
3 eth7 eth7-TxRx-3 120 447326 240353
eth6 eth6-TxRx-3 116 122485504 5
eth4 eth4-TxRx-3 106 745493 920895922
4 eth7 eth7-TxRx-4 121 100454 430922
eth6 eth6-TxRx-4 117 85561230 9
eth4 eth4-TxRx-4 107 1226037 964544196
5 eth7 eth7-TxRx-5 122 170053 185712
eth6 eth6-TxRx-5 118 56023660 28
eth4 eth4-TxRx-5 108 627845 883814781
6 eth5 eth5-TxRx-0 96 391544257 738646606
8 eth5 eth5-TxRx-1 102 596438493 693538
9 eth5 eth5-TxRx-2 109 607669561 1512735496
10 eth5 eth5-TxRx-3 110 807574510 557285
11 eth5 eth5-TxRx-4 111 886419689 2492481
12 eth5 eth5-TxRx-5 112 815035662 442178
13 eth5 eth5-TxRx-6 113 513408486 28301
14 eth5 eth5-TxRx-7 114 360884778 2622

also here is top from some random capture:

%Cpu0 : 0.0 us, 1.0 sy, 0.0 ni, 97.0 id, 0.0 wa, 0.0 hi, 2.0 si, 0.0 st
%Cpu1 : 0.0 us, 1.0 sy, 0.0 ni, 97.1 id, 0.0 wa, 1.0 hi, 1.0 si, 0.0 st
%Cpu2 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu3 : 0.0 us, 0.0 sy, 0.0 ni, 99.0 id, 0.0 wa, 0.0 hi, 1.0 si, 0.0 st
%Cpu4 : 0.0 us, 0.0 sy, 0.0 ni, 98.0 id, 0.0 wa, 0.0 hi, 2.0 si, 0.0 st
%Cpu5 : 0.0 us, 1.0 sy, 0.0 ni, 97.0 id, 0.0 wa, 1.0 hi, 1.0 si, 0.0 st
%Cpu6 : 0.0 us, 12.6 sy, 0.0 ni, 32.2 id, 0.0 wa, 0.0 hi, 55.2 si, 0.0 st
%Cpu7 : 1.0 us, 4.0 sy, 0.0 ni, 94.0 id, 0.0 wa, 1.0 hi, 0.0 si, 0.0 st
%Cpu8 : 0.0 us, 0.0 sy, 0.0 ni, 53.3 id, 0.0 wa, 0.0 hi, 46.7 si, 0.0 st
%Cpu9 : 0.0 us, 1.3 sy, 0.0 ni, 41.0 id, 0.0 wa, 0.0 hi, 57.7 si, 0.0 st
%Cpu10 : 0.0 us, 0.0 sy, 0.0 ni, 42.7 id, 0.0 wa, 0.0 hi, 57.3 si, 0.0 st
%Cpu11 : 0.0 us, 0.0 sy, 0.0 ni, 36.2 id, 0.0 wa, 1.2 hi, 62.5 si, 0.0 st
%Cpu12 : 0.0 us, 33.3 sy, 0.0 ni, 21.1 id, 0.0 wa, 0.0 hi, 45.6 si, 0.0 st
%Cpu13 : 0.0 us, 7.7 sy, 0.0 ni, 23.1 id, 0.0 wa, 0.0 hi, 69.2 si, 0.0 st
%Cpu14 : 0.0 us, 19.8 sy, 0.0 ni, 30.2 id, 0.0 wa, 0.0 hi, 50.0 si, 0.0 st
%Cpu15 : 1.0 us, 4.0 sy, 0.0 ni, 92.9 id, 0.0 wa, 1.0 hi, 1.0 si, 0.0 st
KiB Mem : 65193124 total, 47614248 free, 6902084 used, 10676792 buff/cache
KiB Swap: 33551748 total, 33551748 free, 0 used. 57282704 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
102 admin 20 0 0 0 0 R 58.4 0.0 170:33.76 13 ksoftirqd/13
88 admin 20 0 0 0 0 R 46.5 0.0 396:23.62 11 ksoftirqd/11
81 admin 20 0 0 0 0 R 39.6 0.0 324:49.96 10 ksoftirqd/10
95 admin 20 0 0 0 0 R 38.6 0.0 254:35.82 12 ksoftirqd/12
53 admin 20 0 0 0 0 R 35.6 0.0 131:12.51 6 ksoftirqd/6
74 admin 20 0 0 0 0 R 34.7 0.0 206:23.04 9 ksoftirqd/9
109 admin 20 0 0 0 0 R 33.7 0.0 83:29.70 14 ksoftirqd/14
67 admin 20 0 0 0 0 R 11.9 0.0 141:55.80 8 ksoftirqd/8
12136 admin 20 0 0 0 0 R 5.9 0.0 11:40.47 6 snd
12144 admin 20 0 0 0 0 S 5.9 0.0 21:48.59 14 snd
10199 admin 20 0 0 0 0 S 3.0 0.0 105:56.14 15 fw_worker_0
10200 admin 20 0 0 0 0 R 3.0 0.0 62:30.12 7 fw_worker_1
12143 admin 20 0 0 0 0 S 2.0 0.0 16:33.75 13 snd
99 admin 20 0 0 0 0 S 1.0 0.0 0:00.61 7 rcuos/12
159 admin 20 0 0 0 0 S 1.0 0.0 1:13.72 6 kworker/6:1
162 admin 20 0 0 0 0 S 1.0 0.0 0:48.71 9 kworker/9:1
163 admin 20 0 0 0 0 S 1.0 0.0 1:27.20 10 kworker/10:1
2596 admin 20 0 0 0 0 S 1.0 0.0 0:41.58 14 kworker/14:3

here is a perf top -c on each snd cores related to this nic

49.46% [kernel] [k] intel_pmu_handle_irq
16.80% [kernel] [k] native_write_msr_safe
14.47% [kernel] [k] native_apic_msr_write
8.06% [kernel] [k] __kprobes_text_start
3.70% [kernel] [k] nmi
2.71% [kernel] [k] trigger_load_balance
1.72% [kernel] [k] perf_event_task_tick
1.59% [kernel] [k] idle_cpu
0.54% [kernel] [k] scheduler_tick
0.29% [kernel] [k] perf_pmu_enable
0.23% [kernel] [k] x86_pmu_enable
0.21% [kernel] [k] raise_softirq
0.12% [kernel] [k] intel_bts_enable_local
0.04% [kernel] [k] ctx_resched
0.03% [kernel] [k] __perf_event_enable
0.01% [kernel] [k] event_function
0.01% [kernel] [k] perf_ctx_unlock
0.01% [kernel] [k] flush_smp_call_function_queue
0.00% [kernel] [k] remote_function
0.00% [kernel] [k] irq_work_run

and perf top on snd cores related to wan link

77.65% [kernel] [k] perf_pmu_sched_task
16.41% [kernel] [k] x86_pmu_enable
3.18% [kernel] [k] perf_ctx_unlock
1.94% [kernel] [k] scheduler_tick
0.51% [kernel] [k] perf_event_task_tick
0.32% [kernel] [k] trigger_load_balance
[Expert@:0]# ^C

i also put here statistics and info about the wan interface:

[Expert@FWDRPMATE:0]# ethtool -i eth4
driver: igb
version: 5.3.5.20
firmware-version: 1.70, 0x80000f44, 1.2028.0
expansion-rom-version:
bus-info: 0000:04:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no
[Expert@FWDRPMATE:0]# ethtool -S eth4
NIC statistics:
rx_packets: 3024706563
tx_packets: 6331976966
rx_bytes: 789286747698
tx_bytes: 9300475868971
rx_broadcast: 1
tx_broadcast: 351
rx_multicast: 0
tx_multicast: 751
multicast: 0
collisions: 0
rx_crc_errors: 0
rx_no_buffer_count: 0
rx_missed_errors: 0
tx_aborted_errors: 0
tx_carrier_errors: 0
tx_window_errors: 0
tx_abort_late_coll: 0
tx_deferred_ok: 0
tx_single_coll_ok: 0
tx_multi_coll_ok: 0
tx_timeout_count: 1
rx_long_length_errors: 0
rx_short_length_errors: 0
rx_align_errors: 0
tx_tcp_seg_good: 0
tx_tcp_seg_failed: 0
rx_flow_control_xon: 0
rx_flow_control_xoff: 0
tx_flow_control_xon: 0
tx_flow_control_xoff: 0
rx_long_byte_count: 789286747698
tx_dma_out_of_sync: 0
lro_aggregated: 0
lro_flushed: 0
tx_smbus: 0
rx_smbus: 0
dropped_smbus: 0
os2bmc_rx_by_bmc: 0
os2bmc_tx_by_bmc: 0
os2bmc_tx_by_host: 0
os2bmc_rx_by_host: 0
tx_hwtstamp_timeouts: 0
rx_hwtstamp_cleared: 0
rx_errors: 0
tx_errors: 0
tx_dropped: 0
rx_length_errors: 0
rx_over_errors: 0
rx_frame_errors: 0
rx_fifo_errors: 184166
tx_fifo_errors: 0
tx_heartbeat_errors: 0
tx_queue_0_packets: 1140407572
tx_queue_0_bytes: 1618048127881
tx_queue_0_restart: 19574
tx_queue_1_packets: 1157649682
tx_queue_1_bytes: 1692427247932
tx_queue_1_restart: 10527
tx_queue_2_packets: 810060850
tx_queue_2_bytes: 1190384132595
tx_queue_2_restart: 126
tx_queue_3_packets: 1076756729
tx_queue_3_bytes: 1606619311065
tx_queue_3_restart: 7839
tx_queue_4_packets: 1126786549
tx_queue_4_bytes: 1661535812825
tx_queue_4_restart: 22133
tx_queue_5_packets: 1019443857
tx_queue_5_bytes: 1504911051092
tx_queue_5_restart: 8927
rx_queue_0_packets: 852523761
rx_queue_0_bytes: 229485827889
rx_queue_0_drops: 2776
rx_queue_0_csum_err: 0
rx_queue_0_alloc_failed: 0
rx_queue_1_packets: 1060558
rx_queue_1_bytes: 266814990
rx_queue_1_drops: 0
rx_queue_1_csum_err: 0
rx_queue_1_alloc_failed: 0
rx_queue_2_packets: 2167570424
rx_queue_2_bytes: 546087756117
rx_queue_2_drops: 181390
rx_queue_2_csum_err: 0
rx_queue_2_alloc_failed: 0
rx_queue_3_packets: 849165
rx_queue_3_bytes: 212643338
rx_queue_3_drops: 0
rx_queue_3_csum_err: 0
rx_queue_3_alloc_failed: 0
rx_queue_4_packets: 1401896
rx_queue_4_bytes: 610824371
rx_queue_4_drops: 0
rx_queue_4_csum_err: 0
rx_queue_4_alloc_failed: 0
rx_queue_5_packets: 715849
rx_queue_5_bytes: 240438500
rx_queue_5_drops: 0
rx_queue_5_csum_err: 0
rx_queue_5_alloc_failed: 0

about XMT FAILED errors we are seeing in fw ctl zdebug, actually it seems like rx drops and not tx drops. because i can see the direction coming from eth5(lan) to eth4(wan) are dropped and the line shows the core that is the snd of eth5 (receiver).
also i can see reversed packets from the wan to the lan that are dropped at the snd of the wan (received side). and also those cores seems to get high peaks frequently.

about the jumbo frames, i checked, and the bb switch in the lan comes before the fw, not forwarding packets larger than regular mtu, if i try to ping with greater size with df, it send itself the frag needed. also the lan fw before this fw, don't accept larger mtu.

by the way couple of years ago i increased those Nics buffer to 1024 (from 512+-) because i have lots rxdrops and increasing snd's wasn't enough.

another weird thing, the fw and the other end that received the same throughput and same pps from this gw, and decrypt them and forward them, doesn't have any cpu utilization. also i have checked with another gw that encrypts 550mbps and 50k pps, have only 1 snd for all interfaces, not mq, and use only 10-15% utilization on one core. all fws same hardware basically. which makes me think something is specifically wrong with this server, and it's not legitimate resources per the work it's doing. i thought maybe HT enabled in bios, but from linux commands it seems that it's not.

Timothy_Hall · ‎2022-02-05

Pretty sure your XMT Failed is on the TX side of eth4 as cpu0 is reporting that in the zdebug and CPU 0 lines up with TX queue 0 which is the one experiencing the restarts. You are taking some RX-DRPs on eth4 that look like a big number, but the RX-DRP rate is only 0.02% and therefore negligible; these are probably only happening during policy installs or other brief periods of high CPU load.

I wouldn't worry too much about the jumbo frames as I believe they will still be accepted by the firewall since a default MTU of 1500 just limits the size of what the firewall interface can transmit, not what it will receive. However jumbo frames will occupy more than one ring buffer slot which may exacerbate queuing drops and also require fragmentation.

I understand that TAC took you through some manual tuning of Multi-Queue which was certainly valid prior to version R80.40 under kernel 2.6.18, but to be blunt this is a huge no-no when the Gaia 3.10 kernel is present as once you start making manual changes all attempts for Multi-Queue to automatically reassign traffic to keep everything balanced is abandoned. This is indicated in your mq_mng output showing "Manual". I believe this is why TX queue 0 is getting pounded resulting in your XMT Failures. Not sure if this results in packet loss or if SecureXL just buffers it and tries again later, but given SecureXL's implementation I would assume the former.

It looks like you have a total of 16 cores, and 14 of those are also being used as SNDs which are overlapping with your Workers/Instances which is going to just make things worse. Please provide output of fw ctl affinity -l -a -v to verify your split. Depending on what blades you have enabled (enabled_blades) and level of acceleration (fwaccel stats -s) you will probably need to reset your split, my shot in the dark would be a 6/10 split.

It is not clear to me based on your server specs how many actual physical cores you have on your server irrespective of how many threads per core are set. Can you determine that?

My recommendation is to set all your interfaces back to Automatic Mode, make sure you have no manual affinity adjustments in fwaffinity.conf, and assess what happens with your current split. You'll probably need to then adjust your split. Making further manual adjustments to Multi-Queue is just going to dig the hole you are already in even deeper.

Attend my 60-minute "Be your Own TAC: Part Deux" Presentation
Exclusively at CPX 2025 Las Vegas Tuesday Feb 25th @ 1:00pm

Amir_Arama · ‎2022-02-05

as for now we only have 2 fw workers, as workers don't work much, and snd's was required a LOT.

i have changed mq_mng back to auto on all interfaces, and now it have 8 cores on all 4 interfaces.

so far(30min), it looks like slight improvement. XMT errors not keep growing all the time (growing more slowly), the ping response time was reduced at first but then was back to 40-60ms.

still, CoreXL_SND average utilization in cpview is around 60% (COREXL_FW is around 20%)

[Expert@:0]# mq_mng -o -v
Total 16 cores. Multiqueue 14 cores: 0,8,1,9,2,10,3,11,4,12,5,13,6,14
i/f type state mode cores
------------------------------------------------------------------------------------------------
eth4 igb Up Auto (8/8) 0(94),8(101),1(105),9(106),2(1
07),10(108),3(123),11(124)
eth5 igb Up Auto (8/8) 0(96),8(102),1(109),9(110),2(1
11),10(112),3(113),11(114)
eth6 igb Up Auto (8/8) 0(98),8(103),1(115),9(116),2(1
17),10(118),3(125),11(126)
eth7 igb Up Auto (8/8) 0(100),8(104),1(119),9(120),2(
121),10(122),3(127),11(128)

core interfaces queue irq rx packets tx packets
------------------------------------------------------------------------------------------------
0 eth7 eth7-TxRx-0 100 4054 2614
eth6 eth6-TxRx-0 98 4337 2389295
eth5 eth5-TxRx-0 96 1431950059 4787390404
eth4 eth4-TxRx-0 94 69991933 20191761
1 eth7 eth7-TxRx-2 119 7482 3570
eth6 eth6-TxRx-2 115 635321 0
eth5 eth5-TxRx-2 109 1799048202 2904810018
eth4 eth4-TxRx-2 105 18046 18743539
2 eth7 eth7-TxRx-4 121 3550 4169
eth6 eth6-TxRx-4 117 160583 0
eth5 eth5-TxRx-4 111 3804379172 4513782
eth4 eth4-TxRx-4 107 17924 15181281
3 eth7 eth7-TxRx-6 127 3622 4611
eth6 eth6-TxRx-6 125 1058229 0
eth5 eth5-TxRx-6 113 1598328597 114939
eth4 eth4-TxRx-6 123 16817 18803238
8 eth7 eth7-TxRx-1 104 14297 11289
eth6 eth6-TxRx-1 103 847619 0
eth5 eth5-TxRx-1 102 1490849757 2368795
eth4 eth4-TxRx-1 101 21762 19148078
9 eth7 eth7-TxRx-3 120 3112 3614
eth6 eth6-TxRx-3 116 874458 0
eth5 eth5-TxRx-3 110 2544794164 1846379
eth4 eth4-TxRx-3 106 23225 18328112
10 eth7 eth7-TxRx-5 122 13273 12902
eth6 eth6-TxRx-5 118 73330 0
eth5 eth5-TxRx-5 112 2688024236 1522012
eth4 eth4-TxRx-5 108 31633 15783661
11 eth7 eth7-TxRx-7 128 2314 7709
eth6 eth6-TxRx-7 126 861339 0
eth5 eth5-TxRx-7 114 1631642001 23084
eth4 eth4-TxRx-7 124 20099 19661388

[Expert@:0]# fw ctl affinity -l -a -v
Kernel fw_0: CPU 15
Kernel fw_1: CPU 7
Daemon mpdaemon: CPU 7 15
Daemon fwd: CPU 7 15
Daemon in.asessiond: CPU 7 15
Daemon cprid: CPU 7 15
Daemon lpd: CPU 7 15
Daemon in.geod: CPU 7 15
Daemon vpnd: CPU 7 15
Daemon cprid: CPU 7 15
Daemon cpd: CPU 7 15
Interface eth4: has multi queue enabled
Interface eth5: has multi queue enabled
Interface eth6: has multi queue enabled
Interface eth7: has multi queue enabled

[Expert@:0]# enabled_blades
fw vpn mon

(p.s no traffic counters are enabled in monitor)

[Expert@:0]# fwaccel stats -s
Accelerated conns/Total conns : 1927/1996 (96%)
Accelerated pkts/Total pkts : 26501229324/27030948142 (98%)
F2Fed pkts/Total pkts : 529718818/27030948142 (1%)
F2V pkts/Total pkts : 4149287/27030948142 (0%)
CPASXL pkts/Total pkts : 0/27030948142 (0%)
PSLXL pkts/Total pkts : 163745402/27030948142 (0%)
CPAS pipeline pkts/Total pkts : 0/27030948142 (0%)
PSL pipeline pkts/Total pkts : 0/27030948142 (0%)
CPAS inline pkts/Total pkts : 0/27030948142 (0%)
PSL inline pkts/Total pkts : 0/27030948142 (0%)
QOS inbound pkts/Total pkts : 0/27030948142 (0%)
QOS outbound pkts/Total pkts : 0/27030948142 (0%)
Corrected pkts/Total pkts : 0/27030948142 (0%)

random top now:

Tasks: 319 total, 2 running, 317 sleeping, 0 stopped, 0 zombie
%Cpu0 : 0.0 us, 1.0 sy, 0.0 ni, 30.7 id, 0.0 wa, 1.0 hi, 67.3 si, 0.0 st
%Cpu1 : 0.0 us, 0.0 sy, 0.0 ni, 46.5 id, 0.0 wa, 0.0 hi, 53.5 si, 0.0 st
%Cpu2 : 0.0 us, 0.0 sy, 0.0 ni, 53.0 id, 0.0 wa, 0.0 hi, 47.0 si, 0.0 st
%Cpu3 : 0.0 us, 0.0 sy, 0.0 ni, 49.5 id, 0.0 wa, 1.0 hi, 49.5 si, 0.0 st
%Cpu4 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu5 : 0.0 us, 1.0 sy, 0.0 ni, 99.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu6 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu7 : 1.0 us, 9.2 sy, 0.0 ni, 89.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu8 : 0.0 us, 2.0 sy, 0.0 ni, 44.0 id, 0.0 wa, 0.0 hi, 54.0 si, 0.0 st
%Cpu9 : 0.0 us, 0.0 sy, 0.0 ni, 39.6 id, 0.0 wa, 0.0 hi, 60.4 si, 0.0 st
%Cpu10 : 0.0 us, 0.0 sy, 0.0 ni, 41.0 id, 0.0 wa, 0.0 hi, 59.0 si, 0.0 st
%Cpu11 : 0.0 us, 0.0 sy, 0.0 ni, 44.4 id, 0.0 wa, 0.0 hi, 55.6 si, 0.0 st
%Cpu12 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu13 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu14 : 0.0 us, 1.0 sy, 0.0 ni, 99.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu15 : 2.0 us, 10.2 sy, 0.0 ni, 87.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 65193124 total, 47527864 free, 6910096 used, 10755164 buff/cache
KiB Swap: 33551748 total, 33551748 free, 0 used. 57263388 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND
10199 admin 20 0 0 0 0 R 7.9 0.0 296:46.88 15 fw_worker_0
10200 admin 20 0 0 0 0 S 7.9 0.0 225:31.69 7 fw_worker_1
12072 admin 20 0 618944 118872 42884 S 2.0 0.2 37:43.00 7 fw_full
13123 admin 20 0 3900 1680 1072 R 2.0 0.0 0:00.10 15 top
3 admin 20 0 0 0 0 S 1.0 0.0 40:05.55 0 ksoftirqd/0
74 admin 20 0 0 0 0 S 1.0 0.0 423:48.77 9 ksoftirqd/9
12135 admin 20 0 0 0 0 S 1.0 0.0 40:27.52 5 snd
12136 admin 20 0 0 0 0 S 1.0 0.0 24:24.55 6 snd
12138 admin 20 0 0 0 0 S 1.0 0.0 38:29.45 8 snd
12144 admin 20 0 0 0 0 S 1.0 0.0 48:17.69 14 snd

about HT, not sure how to determine it from cli.

we have two processors, each contain 8 cores.

in dmidecode i see this on each of the processors:

Core Count: 8
Core Enabled: 8
Thread Count: 16

so it seems like maybe HT was kept enabled by default. but i'm not sure that it's conclusive from this output because i see the same on other FWs, and i'm sure that i disabled HT in bios before each installations. also in top i see only 16 cores total and not 32.

Timothy_Hall · ‎2022-02-06

Based on the blades you have enabled and acceleration stats, I concur with your current 14/2 split. However it looks like you are bumping against an 8 queue limit for the igb driver which may also be driven by your NIC hardware. So while you have 14 threads assigned to SND, the same 8 SND CPU threads (0,1,2,3,8,9,10,11) are having to handle all the SND load. CPU threads 4-6 and 12-14 that are assigned to SND are doing absolutely nothing which is confirmed by your top output.

So this is going to sound a bit strange, but it might be advantageous to drop the server from 16 threads to 8 threads via the BIOS, and set a 7/1 split by disabling CoreXL (or maybe a 6/2 split with CoreXL still enabled). You have a special situation where a very high amount of connections and packets are accelerated causing practically all your processing to happen on the SNDs. The overhead of CoreXL coordination between multiple firewall workers is not helping you at all.

SMT/Hyperthreading actually hurts the performance of the SND cores under high load due to the rapid-fire, non-waiting nature of SecureXL operations as the different SND threads stomp on each other trying to get to the same physical core. Firewall Workers on the other hand benefit from SMT because they spend a lot of time waiting for an event to occur (like the next packet of a connection) and another Firewall Worker in another thread can jump onto the same physical CPU and get some work done during the wait.

Just want to reiterate that this is a very special situation and the above recommendations should most definitely NOT be implemented on the vast majority of firewalls out there.

Attend my 60-minute "Be your Own TAC: Part Deux" Presentation
Exclusively at CPX 2025 Las Vegas Tuesday Feb 25th @ 1:00pm

Amir_Arama · ‎2022-02-07

Thank you Timothy,

so i replaced the open server with another hp g10 (because i suspected something is wrong with the hardware/bios) but with the same nic unfortunately. and i optimize the bios per https://community.checkpoint.com/t5/General-Topics/R80-x-Performance-Tuning-Tip-BIOS/td-p/95897

I also changed mq to auto back as you recommended. and there is huge improvement. cpu utilization since is around 20% on SND 8 cores, (we were at 40-60% before that).

although I see have tx restart on the queues of the WAN interface which increments randomly on any tx queues (could be 6-20 restarts total on all queues per minute). I'm not sure if it's acceptable or not and would like your opinion.
most of the traffic direction is: enter from lan and go out to the wan. no drops on rx of wan. also no drops or restarts at all on the lan interface.

WAN:

tx_heartbeat_errors: 0
tx_queue_0_packets: 709988150
tx_queue_0_bytes: 1035258595614
tx_queue_0_restart: 10265
tx_queue_1_packets: 701211508
tx_queue_1_bytes: 1033151214528
tx_queue_1_restart: 615
tx_queue_2_packets: 687362666
tx_queue_2_bytes: 999566423562
tx_queue_2_restart: 317
tx_queue_3_packets: 709671927
tx_queue_3_bytes: 1043638846375
tx_queue_3_restart: 357
tx_queue_4_packets: 590150259
tx_queue_4_bytes: 868109592884
tx_queue_4_restart: 290
tx_queue_5_packets: 707698304
tx_queue_5_bytes: 1039255347264
tx_queue_5_restart: 901
tx_queue_6_packets: 710325578
tx_queue_6_bytes: 1038235705785
tx_queue_6_restart: 3716
tx_queue_7_packets: 702520493
tx_queue_7_bytes: 1021865020122
tx_queue_7_restart: 574
rx_queue_0_packets: 730774

i also still experience high response time in ping. from lan to vpn peer fw (over tunnel) 20ms, but vpn peer enc domain pc's 60-90ms(from vpn peer to it's local lan pc's under 1ms) , not sure if it's related to performance, or vpn. or something else..

i also would like to thank you. i appreciate your help here and on other posts. i learn from you a LOT. things that i never learn elsewhere including from checkpoint stuff. so Thanks!

Timothy_Hall · ‎2022-02-07

As long as the WAN queues are reasonably balanced (which they are in your latest output) I wouldn't worry about queue restarts as long as you not experiencing actual drops. The queue restarts were an issue on the original server in that they were a red flag that queue 0 was getting way overloaded due to improper queue balancing.

Very possible that something in the BIOS settings of the original server was hampering your performance, usually the culprit is settings involving energy conservation.

As far the latency you are seeing there should be very little delay introduced by the firewall if practically all of the traffic is fully accelerated by SecureXL. I'd suggest running the pathping (Windows OS) or tracepath (Linux) commands from inside the network through the firewall to somewhere on the WAN. These commands are similar to tracert/traceroute but take the time to flood each hop with a lot of traffic and help isolate precisely where latency or loss is being introduced in the network path. I highly doubt the firewall is the source of 60-90ms latency in your scenario, but if it is run the Super Seven commands and post back to this thread so we can investigate further.

Attend my 60-minute "Be your Own TAC: Part Deux" Presentation
Exclusively at CPX 2025 Las Vegas Tuesday Feb 25th @ 1:00pm

Amir_Arama · ‎2022-02-07

Thank you

so path ping from DC lan PC behind FW1 to DR Lan PC behind FW2

Source to Here This Node/Link
Hop RTT Lost/Sent = Pct Lost/Sent = Pct Address
0 Source PC
1/ 100 = 0% |
2 0ms 0/ 100 = 0% 0/ 100 = 0% FW1(IN)
0/ 100 = 0% |
3 41ms 0/ 100 = 0% 0/ 100 = 0% FW2(DR-wan)
0/ 100 = 0% |
4 42ms 0/ 100 = 0% 0/ 100 = 0% DST PC

p.s in FW2 the Mq is still manual, hopefully tomorrow i will have a window to make this change. also there are on board NIC which is in use and don't support MQ. (eth0-eth3)

so all delays happens between FW1 and FW2 and maybe a bit more through fw2 (don't know who to blame, could be the line also) (ping from the fw to this pc come back in less than 1ms so it's not switch/cables etc)

SUPERSEVEN on FW1 and then on two: (P.s FW1 is the fw we discussed in all this post)

[Expert@FW1:0]# fwaccel stat

Accept Templates : enabled
Drop Templates : disabled by Firewall
NAT Templates : enabled
[Expert@FW1:0]#

[Expert@FW1:0]# fwaccel stats -s

Accelerated conns/Total conns : 2574/2591 (99%)
Accelerated pkts/Total pkts : 9706394917/9871760070 (98%)
F2Fed pkts/Total pkts : 165365153/9871760070 (1%)
F2V pkts/Total pkts : 1312154/9871760070 (0%)
CPASXL pkts/Total pkts : 0/9871760070 (0%)
PSLXL pkts/Total pkts : 13643297/9871760070 (0%)
CPAS pipeline pkts/Total pkts : 0/9871760070 (0%)
PSL pipeline pkts/Total pkts : 0/9871760070 (0%)
CPAS inline pkts/Total pkts : 0/9871760070 (0%)
PSL inline pkts/Total pkts : 0/9871760070 (0%)
QOS inbound pkts/Total pkts : 0/9871760070 (0%)
QOS outbound pkts/Total pkts : 0/9871760070 (0%)
Corrected pkts/Total pkts : 0/9871760070 (0%)
grep -c ^processor /proc/cpuinfo
[Expert@FW1:0]#
[Expert@FW1:0]#
[Expert@FW1:0]# grep -c ^processor /proc/cpuinfo
16
[Expert@FW1:0]# /sbin/cpuinfo
HyperThreading=disabled
[Expert@FW1:0]# fw ctl affinity -l -r
CPU 0:
CPU 1:
CPU 2:
CPU 3:
CPU 4:
CPU 5:
CPU 6:
CPU 7:
CPU 8:
CPU 9:
CPU 10:
CPU 11:
CPU 12:
CPU 13:
CPU 14: fw_1
mpdaemon fwd rtmd in.asessiond in.geod lpd vpnd cprid cprid cpd
CPU 15: fw_0
mpdaemon fwd rtmd in.asessiond in.geod lpd vpnd cprid cprid cpd
All:
Interface eth6: has multi queue enabled
Interface eth7: has multi queue enabled
Interface eth8: has multi queue enabled
Interface eth9: has multi queue enabled
[Expert@FW1:0]# netstat -ni
Kernel Interface table
Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
eth6 1500 0 3313160984 0 0 126 6557462840 0 0 0 BMRU
eth7 1500 0 5781200040 0 310 0 2902647307 0 0 0 BMRU
eth8 1500 0 782059878 0 0 0 409229880 0 0 0 BMRU
eth9 1500 0 2562908 0 0 0 2242381 0 0 0 BMRU
lo 65536 0 469393 0 0 0 469393 0 0 0 ALdPRU

[Expert@FW1:0]# fw ctl multik stat
ID | Active | CPU | Connections | Peak
----------------------------------------------
0 | Yes | 15 | 1594 | 3301
1 | Yes | 14 | 1733 | 3538
[Expert@FW1:0]# cpstat os -f multi_cpu -o 1

Processors load
---------------------------------------------------------------------------------
|CPU#|User Time(%)|System Time(%)|Idle Time(%)|Usage(%)|Run queue|Interrupts/sec|
---------------------------------------------------------------------------------
| 1| 0| 27| 73| 27| ?| 78532|
| 2| 0| 27| 73| 27| ?| 78531|
| 3| 0| 23| 77| 23| ?| 78531|
| 4| 0| 41| 59| 41| ?| 78531|
| 5| 0| 25| 75| 25| ?| 78531|
| 6| 0| 26| 74| 26| ?| 78532|
| 7| 0| 26| 74| 26| ?| 78532|
| 8| 0| 16| 84| 16| ?| 78532|
| 9| 0| 0| 100| 0| ?| 78532|
| 10| 0| 0| 100| 0| ?| 78532|
| 11| 0| 0| 100| 0| ?| 78531|
| 12| 0| 0| 100| 0| ?| 78531|
| 13| 0| 1| 99| 1| ?| 78531|
| 14| 0| 0| 100| 0| ?| 78532|
| 15| 2| 10| 88| 12| ?| 78532|
| 16| 2| 10| 88| 12| ?| 78532|
---------------------------------------------------------------------------------

________________________________________________________________________________________________

[Expert@FW2:0]# fwaccel stat

Accept Templates : enabled
Drop Templates : disabled
NAT Templates : enabled
[Expert@FW2:0]#
[Expert@FW2:0]# fwaccel stats -s

Accelerated conns/Total conns : 2506/2577 (97%)
Accelerated pkts/Total pkts : 67769268945/69022063506 (98%)
F2Fed pkts/Total pkts : 1252794561/69022063506 (1%)
F2V pkts/Total pkts : 11758117/69022063506 (0%)
CPASXL pkts/Total pkts : 0/69022063506 (0%)
PSLXL pkts/Total pkts : 637498648/69022063506 (0%)
CPAS pipeline pkts/Total pkts : 0/69022063506 (0%)
PSL pipeline pkts/Total pkts : 0/69022063506 (0%)
CPAS inline pkts/Total pkts : 0/69022063506 (0%)
PSL inline pkts/Total pkts : 0/69022063506 (0%)
QOS inbound pkts/Total pkts : 0/69022063506 (0%)
QOS outbound pkts/Total pkts : 0/69022063506 (0%)
Corrected pkts/Total pkts : 0/69022063506 (0%)
[Expert@FW2:0]#
[Expert@FW2:0]#
[Expert@FW2:0]# grep -c ^processor /proc/cpuinfo
16
[Expert@FW2:0]# /sbin/cpuinfo
HyperThreading=disabled

[Expert@FW2:0]# fw ctl affinity -l -r

CPU 0: eth0 eth1 eth2 eth3
CPU 1: eth0 eth1 eth2 eth3
CPU 2: eth0 eth1 eth2 eth3
CPU 3: eth0 eth1 eth2 eth3
CPU 4: eth0 eth1 eth2 eth3
CPU 5: eth0 eth1 eth2 eth3
CPU 6: eth0 eth1 eth2 eth3
fw_1
mpdaemon fwd rtmd in.asessiond cprid lpd vpnd in.geod cprid cpd
CPU 7: eth0 eth1 eth2 eth3
fw_0
mpdaemon fwd rtmd in.asessiond cprid lpd vpnd in.geod cprid cpd
CPU 8:
CPU 9:
CPU 10:
CPU 11:
CPU 12:
CPU 13:
CPU 14:
CPU 15:
All:
The current license permits the use of CPUs 0, 1, 2, 3, 4, 5, 6, 7 only.
Interface eth4: has multi queue enabled
Interface eth5: has multi queue enabled
Interface eth6: has multi queue enabled

[Expert@FW2:0]# netstat -ni
Kernel Interface table
Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0 1500 0 0 0 0 0 0 0 0 0 BMU
eth1 1500 0 2534394497 0 0 0 4785562940 0 0 0 BMRU
eth2 1500 0 1179698797 0 75 0 1837174159 0 0 0 BMRU
eth3 1500 0 4312584 0 0 0 3849835 0 0 0 BMRU
eth4 1500 0 16585992386 0 5 126026 35580000391 0 0 0 BMRU
eth4.804 1500 0 1245708180 0 0 0 2580565915 0 24940 0 BMRU
eth4.805 1500 0 5603159236 0 0 0 11239315006 0 100102 0 BMRU
eth4.806 1500 0 91609315 0 0 0 150146869 0 5465 0 BMRU
eth4.807 1500 0 51160937 0 6 0 61940005 0 456 0 BMRU
eth4.808 1500 0 7036798557 0 0 0 19363874380 0 208406 0 BMRU
eth4.809 1500 0 1163206519 0 0 0 2494053966 0 9128 0 BMRU
eth4.810 1500 0 139 0 0 0 159 0 0 0 BMRU
eth4.811 1500 0 352578574 0 0 0 541759921 0 5697 0 BMRU
eth4.812 1500 0 892044 0 0 0 673961 0 0 0 BMRU
eth4.813 1500 0 75580132 0 0 0 117268872 0 1751 0 BMRU
eth4.814 1500 0 2623752 0 0 0 5078609 0 29 0 BMRU
eth4.815 1500 0 21219715 0 0 0 8812657 0 439 0 BMRU
eth4.816 1500 0 439854053 0 0 0 959487004 0 4538 0 BMRU
eth4.1901 1500 0 269413905 0 840 0 318701740 0 2698 0 BMRU
eth4.1902 1500 0 0 0 0 0 0 0 0 0 BMRU
eth4.1903 1500 0 8241238 0 810 0 6467051 0 54 0 BMRU
eth4.1904 1500 0 0 0 0 0 0 0 0 0 BMRU
eth4.1905 1500 0 0 0 0 0 0 0 0 0 BMRU
eth4.1906 1500 0 0 0 0 0 0 0 0 0 BMRU
eth4.1911 1500 0 121657497 0 0 0 115869761 0 12482 0 BMRU
eth4.1912 1500 0 0 0 0 0 0 0 0 0 BMRU
eth4.1914 1500 0 0 0 0 0 0 0 0 0 BMRU
eth4.1915 1500 0 0 0 0 0 0 0 0 0 BMRU
eth4.1919 1500 0 2373787841 0 0 0 1813929500 0 5177 0 BMRU
eth4.1920 1500 0 0 0 0 0 0 0 0 0 BMRU
eth5 1500 0 35117151 0 0 0 36511606 0 0 0 BMRU
eth6 1500 0 41420306330 0 1 318697 19837947720 0 0 0 BMRU
lo 65536 0 3596603 0 0 0 3596603 0 0 0 LMdPNRU

[Expert@FW2:0]# fw ctl multik stat
ID | Active | CPU | Connections | Peak
----------------------------------------------
0 | Yes | 7 | 1694 | 5181
1 | Yes | 6 | 1700 | 3530

[Expert@FW2:0]# cpstat os -f multi_cpu -o 1

Processors load
---------------------------------------------------------------------------------
|CPU#|User Time(%)|System Time(%)|Idle Time(%)|Usage(%)|Run queue|Interrupts/sec|
---------------------------------------------------------------------------------
| 1| 0| 5| 95| 5| ?| 77275|
| 2| 0| 5| 95| 5| ?| 77274|
| 3| 0| 5| 95| 5| ?| 77273|
| 4| 0| 62| 38| 62| ?| 77274|
| 5| 0| 13| 87| 13| ?| 77273|
| 6| 0| 4| 96| 4| ?| 77273|
| 7| 4| 8| 88| 12| ?| 77271|
| 8| 2| 13| 85| 15| ?| 77271|
| 9| 0| 0| 100| 0| ?| 77271|
| 10| 0| 0| 100| 0| ?| 77272|
| 11| 0| 0| 100| 0| ?| 77272|
| 12| 0| 0| 100| 0| ?| 77271|
| 13| 0| 0| 100| 0| ?| 77270|
| 14| 0| 0| 100| 0| ?| 77270|

Timothy_Hall · ‎2022-02-08

Strange that you are picking up 40ms at FW2 like that, try a pathping directly to the following three addresses from the inside and post the results to help determine if it is the line between the two:

1) Externally-facing IP address of FW1

2) Internally-facing IP address of FW2

3) Externally-facing IP address of FW2

Also which interface on FW1 and FW2 face each other? In other words what interface name on FW1 is facing FW2 and which interface on FW2 is facing FW1.

FW1 looks good after your adjustments. FW2 is seeing a bunch of TX-DRPs on your eth4 subinterfaces, hopefully those will go away once you make the auto MQ adjustments to FW2 that were already made to FW1. Also it looks like you are limited by license to only 8 cores on FW2, so in that case I'd definitely recommend changing the number of threads from 16 to 8 in the BIOS of FW2 and going with a 6/2 split, as the extra 8 cores over the license limit aren't doing you any good and just causing needless overhead. Also make sure the BIOS settings are optimized for FW2 while you are in there adjusting the thread count.

Attend my 60-minute "Be your Own TAC: Part Deux" Presentation
Exclusively at CPX 2025 Las Vegas Tuesday Feb 25th @ 1:00pm

Amir_Arama · ‎2022-02-08

Thank you!

about your recommendations - will do.

eth6 of fw2 facing eth6 of fw1

fw1 internal nics eth7+8+9
fw2 internal nics are: eth1+2+3+4+5

C:\Users\aarama>pathping fw1_wan

0/ 100 = 0% |
1 0ms 0/ 100 = 0% 0/ 100 = 0% self
0/ 100 = 0% |
2 0ms 0/ 100 = 0% 0/ 100 = 0% fw1_wan

pathping fw2_wan

Computing statistics for 75 seconds...
Source to Here This Node/Link
Hop RTT Lost/Sent = Pct Lost/Sent = Pct Address
0
0/ 100 = 0% |
1 0ms 0/ 100 = 0% 0/ 100 = 0% self
0/ 100 = 0% |
2 0ms 0/ 100 = 0% 0/ 100 = 0% fw1_lan
0/ 100 = 0% |
3 25ms 0/ 100 = 0% 0/ 100 = 0% fw2_wan

Trace complete.

pathping fw2_internal_vlan_on_eth4

Computing statistics for 75 seconds...
Source to Here This Node/Link
Hop RTT Lost/Sent = Pct Lost/Sent = Pct Address

0/ 100 = 0% |
1 0ms 0/ 100 = 0% 0/ 100 = 0% self
0/ 100 = 0% |
2 0ms 0/ 100 = 0% 0/ 100 = 0% fw1_lan
0/ 100 = 0% |
3 37ms 0/ 100 = 0% 0/ 100 = 0% FW1_one_of_internal_vlans on eth4

Trace complete.

Timothy_Hall · ‎2022-02-08

Yeah that's strange that you are picking up 25ms on the near side of FW2, hopefully that will improve once the tuning is done. If it doesn't might be some kind of switch or interface congestion going on where eth6 is attached, although the eth6 interfaces themselves seem mostly fine on the two firewalls but eth6 looks to be struggling a bit on FW2. Please provide the output of ethtool -S eth6 on both firewalls.

Attend my 60-minute "Be your Own TAC: Part Deux" Presentation
Exclusively at CPX 2025 Las Vegas Tuesday Feb 25th @ 1:00pm

Amir_Arama · ‎2022-02-08

by the way the l2 line is 1g, and i see that traffic between those interfaces is 985mbps. maybe the line is chocked don't u think?

[Expert@FW1:0]# ethtool -S eth6
NIC statistics:
rx_packets: 7541861113
tx_packets: 14903608200
rx_bytes: 1679632347626
tx_bytes: 21917720574002
rx_broadcast: 1
tx_broadcast: 826
rx_multicast: 0
tx_multicast: 155
multicast: 0
collisions: 0
rx_crc_errors: 0
rx_no_buffer_count: 0
rx_missed_errors: 0
tx_aborted_errors: 0
tx_carrier_errors: 0
tx_window_errors: 0
tx_abort_late_coll: 0
tx_deferred_ok: 0
tx_single_coll_ok: 0
tx_multi_coll_ok: 0
tx_timeout_count: 28
rx_long_length_errors: 0
rx_short_length_errors: 0
rx_align_errors: 0
tx_tcp_seg_good: 0
tx_tcp_seg_failed: 0
rx_flow_control_xon: 0
rx_flow_control_xoff: 0
tx_flow_control_xon: 0
tx_flow_control_xoff: 0
rx_long_byte_count: 1679632347626
tx_dma_out_of_sync: 0
lro_aggregated: 0
lro_flushed: 0
tx_smbus: 0
rx_smbus: 0
dropped_smbus: 0
os2bmc_rx_by_bmc: 0
os2bmc_tx_by_bmc: 0
os2bmc_tx_by_host: 0
os2bmc_rx_by_host: 0
tx_hwtstamp_timeouts: 0
rx_hwtstamp_cleared: 0
rx_errors: 0
tx_errors: 0
tx_dropped: 0
rx_length_errors: 0
rx_over_errors: 0
rx_frame_errors: 0
rx_fifo_errors: 126
tx_fifo_errors: 0
tx_heartbeat_errors: 0
tx_queue_0_packets: 2101149431
tx_queue_0_bytes: 3071829453586
tx_queue_0_restart: 12055
tx_queue_1_packets: 1816877870
tx_queue_1_bytes: 2669239350811
tx_queue_1_restart: 1300
tx_queue_2_packets: 1841642136
tx_queue_2_bytes: 2701629590453
tx_queue_2_restart: 810
tx_queue_3_packets: 2034719160
tx_queue_3_bytes: 2997635073671
tx_queue_3_restart: 1167
tx_queue_4_packets: 1571701216
tx_queue_4_bytes: 2320291594232
tx_queue_4_restart: 979
tx_queue_5_packets: 1918395970
tx_queue_5_bytes: 2813492300087
tx_queue_5_restart: 1909
tx_queue_6_packets: 2100834983
tx_queue_6_bytes: 3091780741411
tx_queue_6_restart: 12901
tx_queue_7_packets: 1518288320
tx_queue_7_bytes: 2192206499466
tx_queue_7_restart: 1134
rx_queue_0_packets: 2023684
rx_queue_0_bytes: 522527366
rx_queue_0_drops: 0
rx_queue_0_csum_err: 0
rx_queue_0_alloc_failed: 0
rx_queue_1_packets: 1813431
rx_queue_1_bytes: 293831685
rx_queue_1_drops: 0
rx_queue_1_csum_err: 0
rx_queue_1_alloc_failed: 0
rx_queue_2_packets: 1530388
rx_queue_2_bytes: 435814688
rx_queue_2_drops: 0
rx_queue_2_csum_err: 0
rx_queue_2_alloc_failed: 0
rx_queue_3_packets: 7529875242
rx_queue_3_bytes: 1645802731288
rx_queue_3_drops: 126
rx_queue_3_csum_err: 0
rx_queue_3_alloc_failed: 0
rx_queue_4_packets: 1176848
rx_queue_4_bytes: 444175315
rx_queue_4_drops: 0
rx_queue_4_csum_err: 0
rx_queue_4_alloc_failed: 0
rx_queue_5_packets: 1326995
rx_queue_5_bytes: 276222361
rx_queue_5_drops: 0
rx_queue_5_csum_err: 0
rx_queue_5_alloc_failed: 0
rx_queue_6_packets: 1882324
rx_queue_6_bytes: 348914547
rx_queue_6_drops: 0
rx_queue_6_csum_err: 0
rx_queue_6_alloc_failed: 0
rx_queue_7_packets: 2216383
rx_queue_7_bytes: 1338162787
rx_queue_7_drops: 0
rx_queue_7_csum_err: 0
rx_queue_7_alloc_failed: 0

[Expert@FW2:0]# ethtool -S eth6
NIC statistics:
rx_packets: 54899149880
tx_packets: 26798553331
rx_bytes: 80562061846139
tx_bytes: 6446687664068
rx_broadcast: 3185
tx_broadcast: 6930
rx_multicast: 1
tx_multicast: 140
multicast: 1
collisions: 0
rx_crc_errors: 0
rx_no_buffer_count: 0
rx_missed_errors: 0
tx_aborted_errors: 0
tx_carrier_errors: 0
tx_window_errors: 0
tx_abort_late_coll: 0
tx_deferred_ok: 0
tx_single_coll_ok: 0
tx_multi_coll_ok: 0
tx_timeout_count: 2
rx_long_length_errors: 0
rx_short_length_errors: 0
rx_align_errors: 0
tx_tcp_seg_good: 0
tx_tcp_seg_failed: 0
rx_flow_control_xon: 0
rx_flow_control_xoff: 0
tx_flow_control_xon: 0
tx_flow_control_xoff: 0
rx_long_byte_count: 80562061846139
tx_dma_out_of_sync: 0
lro_aggregated: 0
lro_flushed: 0
tx_smbus: 0
rx_smbus: 0
dropped_smbus: 0
os2bmc_rx_by_bmc: 0
os2bmc_tx_by_bmc: 0
os2bmc_tx_by_host: 0
os2bmc_rx_by_host: 0
tx_hwtstamp_timeouts: 0
rx_hwtstamp_cleared: 0
rx_errors: 0
tx_errors: 0
tx_dropped: 0
rx_length_errors: 0
rx_over_errors: 0
rx_frame_errors: 0
rx_fifo_errors: 399061
tx_fifo_errors: 0
tx_heartbeat_errors: 0
tx_queue_0_packets: 3810684423
tx_queue_0_bytes: 945387648630
tx_queue_0_restart: 0
tx_queue_1_packets: 6582592081
tx_queue_1_bytes: 1736765833299
tx_queue_1_restart: 0
tx_queue_2_packets: 4688819205
tx_queue_2_bytes: 1002968806055
tx_queue_2_restart: 4
tx_queue_3_packets: 5206007635
tx_queue_3_bytes: 1088407924818
tx_queue_3_restart: 0
tx_queue_4_packets: 3772192627
tx_queue_4_bytes: 850635740873
tx_queue_4_restart: 7
rx_queue_0_packets: 133294941
rx_queue_0_bytes: 193391346127
rx_queue_0_drops: 0
rx_queue_0_csum_err: 0
rx_queue_0_alloc_failed: 0
rx_queue_1_packets: 219528931
rx_queue_1_bytes: 320542519041
rx_queue_1_drops: 0
rx_queue_1_csum_err: 0
rx_queue_1_alloc_failed: 0
rx_queue_2_packets: 21769718264
rx_queue_2_bytes: 31892240591035
rx_queue_2_drops: 352001
rx_queue_2_csum_err: 0
rx_queue_2_alloc_failed: 0
rx_queue_3_packets: 173450414
rx_queue_3_bytes: 254586366194
rx_queue_3_drops: 0
rx_queue_3_csum_err: 0
rx_queue_3_alloc_failed: 0
rx_queue_4_packets: 27451797622
rx_queue_4_bytes: 40226636192810
rx_queue_4_drops: 46556
rx_queue_4_csum_err: 0
rx_queue_4_alloc_failed: 0

Timothy_Hall · ‎2022-02-08

Yes utilization was going to be my next question, the outbound path (eth6 FW1 TX and eth6 FW2 RX) seems to be struggling much more than the inbound path, does that seem right? Is the major flow of traffic outbound? The eth6 physical medium is running clean, but the extremely high load is overwhelming it. Are there are switchport counters that can be examined where the two eth6 interfaces are connected? Gotta think the switch is struggling too unless FW1 and FW2 are direct wired on eth6.

Definitely seems like a 802.3ad bond with 2 1GB ports on each firewall or if possible 10GB interfaces are in order here.

Attend my 60-minute "Be your Own TAC: Part Deux" Presentation
Exclusively at CPX 2025 Las Vegas Tuesday Feb 25th @ 1:00pm

Amir_Arama · ‎2022-02-08

Yes

The traffic coming from eth7+eth8 to eth6 outbound on fw1 and accepted by fw2 eth6 and from there distributed over multiple lan interfaces. Basically line used for system syncronization from dc (fw1) to dr (fw2)

There is no switch. They connected directly to bezeq modem.

I thought too to make it optical or bond but that requires to put switch in the middle on both sides and i wasn't sure that it worth the trouble.

Are you a member of CheckMates?

XMT ERROR - What does it mean?