Detected Hardware Unit Hang

JozkoMrkvicka · ‎2018-12-28

Hello everyone,

Environment:

Cluster of SG4800 with R77.30 and jumbo Take 216.

I just noticed that from time to time following messages are visible in /var/log/messages:

Dec 28 07:23:22 2018 GWB kernel: e1000e 0000:0f:00.0: eth4: Detected Hardware Unit Hang:
Dec 28 07:23:22 2018 GWB kernel:   TDH                  <218>
Dec 28 07:23:22 2018 GWB kernel:   TDT                  <21b>
Dec 28 07:23:22 2018 GWB kernel:   next_to_use          <21b>
Dec 28 07:23:22 2018 GWB kernel:   next_to_clean        <218>
Dec 28 07:23:22 2018 GWB kernel: buffer_info[next_to_clean]:
Dec 28 07:23:22 2018 GWB kernel:   time_stamp           <62cafbcd>
Dec 28 07:23:22 2018 GWB kernel:   next_to_watch        <218>
Dec 28 07:23:22 2018 GWB kernel:   jiffies              <62cb0098>
Dec 28 07:23:22 2018 GWB kernel:   next_to_watch.status <0>
Dec 28 07:23:22 2018 GWB kernel: MAC Status             <80783>
Dec 28 07:23:22 2018 GWB kernel: PHY Status             <796d>
Dec 28 07:23:22 2018 GWB kernel: PHY 1000BASE-T Status <3800>
Dec 28 07:23:22 2018 GWB kernel: PHY Extended Status    <3000>
Dec 28 07:23:22 2018 GWB kernel: PCI Status             <10>
Dec 28 01:32:53 2018 GWB kernel: e1000e 0000:0f:00.0: eth4: Detected Hardware Unit Hang:
Dec 28 01:32:53 2018 GWB kernel:   TDH                  <1fc>
Dec 28 01:32:53 2018 GWB kernel:   TDT                  <1ff>
Dec 28 01:32:53 2018 GWB kernel:   next_to_use          <1ff>
Dec 28 01:32:53 2018 GWB kernel:   next_to_clean        <1fc>
Dec 28 01:32:53 2018 GWB kernel: buffer_info[next_to_clean]:
Dec 28 01:32:53 2018 GWB kernel:   time_stamp           <618a1136>
Dec 28 01:32:53 2018 GWB kernel:   next_to_watch        <1fc>
Dec 28 01:32:53 2018 GWB kernel:   jiffies              <618a15f2>
Dec 28 01:32:53 2018 GWB kernel:   next_to_watch.status <0>
Dec 28 01:32:53 2018 GWB kernel: MAC Status             <80783>
Dec 28 01:32:53 2018 GWB kernel: PHY Status             <796d>
Dec 28 01:32:53 2018 GWB kernel: PHY 1000BASE-T Status <3800>
Dec 28 01:32:53 2018 GWB kernel: PHY Extended Status    <3000>
Dec 28 01:32:53 2018 GWB kernel: PCI Status             <10>

Looks like something is wrong with eth4. This interface is part of bond interface, together with eth3. Purpose of bond interface is Sync link between both members. All interfaces are 1G TP. Distance between both cluster members is 40 km.

[Expert@GWB:0]# cphaconf show_bond bond1
Bond name:      bond1
Bond mode:      Load Sharing
Bond status:    UP
Balancing mode: 802.3ad Layer3+4 Load Balancing
Configured slave interfaces: 2
In use slave interfaces:     2
Required slave interfaces:   1
Slave name      | Status          | Link
----------------+-----------------+-------
eth3            | Active          | Yes
eth4            | Active          | Yes
[Expert@GWB:0]# cat /proc/net/bonding/bond1
Ethernet Channel Bonding Driver: v3.2.4 (January 28, 2008)
Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer3+4 (1)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 200
Down Delay (ms): 200
802.3ad info
LACP rate: slow
Active Aggregator Info:
        Aggregator ID: 1
        Number of ports: 2
        Actor Key: 17
        Partner Key: 33071
        Partner Mac Address: 00:23:04:ea:cd:05
Slave Interface: eth3
MII Status: up
Link Failure Count: 1
Permanent HW addr: 00:1c:7f:35:1e:67
Aggregator ID: 1
Slave Interface: eth4
MII Status: up
Link Failure Count: 3
Permanent HW addr: 00:1c:7f:35:1e:69
Aggregator ID: 1

Interface statistics:

[Expert@GWB:0]# ethtool -i eth3
driver: e1000e
version: 2.1.4-NAPI
firmware-version: 2.1-0
bus-info: 0000:0b:00.0
[Expert@GWB:0]# ethtool -i eth4
driver: e1000e
version: 2.1.4-NAPI
firmware-version: 2.1-0
bus-info: 0000:0f:00.0
[Expert@GWB:0]# ifconfig eth3
eth3        Link encap:Ethernet HWaddr 00:1C:7F:35:1E:67
            UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
            RX packets:1755761152 errors:0 dropped:0 overruns:0 frame:0
            TX packets:1722666300 errors:0 dropped:0 overruns:0 carrier:0
            collisions:0 txqueuelen:1000
            RX bytes:2574922192 (2.3 GiB) TX bytes:4123666606 (3.8 GiB)
            Interrupt:185 Memory:fe9e0000-fea00000
[Expert@GWB:0]# ifconfig eth4
eth4        Link encap:Ethernet HWaddr 00:1C:7F:35:1E:67
            UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
            RX packets:74346835 errors:0 dropped:0 overruns:0 frame:0
            TX packets:129992397 errors:0 dropped:0 overruns:0 carrier:0
            collisions:0 txqueuelen:1000
            RX bytes:2213595757 (2.0 GiB) TX bytes:2373454427 (2.2 GiB)
            Interrupt:185 Memory:febe0000-fec00000
[Expert@GWB:0]# netstat -ani
Kernel Interface table
Iface       MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
bond1      1500   0 1830071571      0      0      0 1852622059      0      0      0 BMmRU
eth3       1500   0 1755725351      0      0      0 1722632073      0      0      0 BMsRU
eth4       1500   0 74346220      0      0      0 129989986      0      0      0 BMsRU
[Expert@GWB:0]# ethtool -S eth4
NIC statistics:
     rx_packets: 74346141
     tx_packets: 129989592
     rx_bytes: 36870531721
     tx_bytes: 75906486583
     rx_broadcast: 72425523
     tx_broadcast: 128068474
     rx_multicast: 1917491
     tx_multicast: 1917919
     rx_errors: 0
     tx_errors: 0
     tx_dropped: 0
     multicast: 1917491
     collisions: 0
     rx_length_errors: 0
     rx_over_errors: 0
     rx_crc_errors: 0
     rx_frame_errors: 0
     rx_no_buffer_count: 0
     rx_missed_errors: 0
     tx_aborted_errors: 0
     tx_carrier_errors: 0
     tx_fifo_errors: 0
     tx_heartbeat_errors: 0
     tx_window_errors: 0
     tx_abort_late_coll: 0
     tx_deferred_ok: 0
     tx_single_coll_ok: 0
     tx_multi_coll_ok: 0
     tx_timeout_count: 0
     tx_restart_queue: 201
     rx_long_length_errors: 0
     rx_short_length_errors: 0
     rx_align_errors: 0
     tx_tcp_seg_good: 0
     tx_tcp_seg_failed: 0
     rx_flow_control_xon: 0
     rx_flow_control_xoff: 0
     tx_flow_control_xon: 0
     tx_flow_control_xoff: 0
     rx_long_byte_count: 36870531721
     rx_csum_offload_good: 72427507
     rx_csum_offload_errors: 0
     rx_header_split: 0
     alloc_rx_buff_failed: 0
     tx_smbus: 0
     rx_smbus: 0
     dropped_smbus: 0
     rx_dma_failed: 0
     tx_dma_failed: 0

[Expert@GWB:0]# ethtool -S eth3
NIC statistics:
     rx_packets: 1755721963
     tx_packets: 1722628786
     rx_bytes: 181392964784
     tx_bytes: 212883823014
     rx_broadcast: 1753601078
     tx_broadcast: 1720420399
     rx_multicast: 1917328
     tx_multicast: 1917924
     rx_errors: 0
     tx_errors: 0
     tx_dropped: 0
     multicast: 1917328
     collisions: 0
     rx_length_errors: 0
     rx_over_errors: 0
     rx_crc_errors: 0
     rx_frame_errors: 0
     rx_no_buffer_count: 0
     rx_missed_errors: 0
     tx_aborted_errors: 0
     tx_carrier_errors: 0
     tx_fifo_errors: 0
     tx_heartbeat_errors: 0
     tx_window_errors: 0
     tx_abort_late_coll: 0
     tx_deferred_ok: 0
     tx_single_coll_ok: 0
     tx_multi_coll_ok: 0
     tx_timeout_count: 0
     tx_restart_queue: 0
     rx_long_length_errors: 0
     rx_short_length_errors: 0
     rx_align_errors: 0
     tx_tcp_seg_good: 0
     tx_tcp_seg_failed: 0
     rx_flow_control_xon: 0
     rx_flow_control_xoff: 0
     tx_flow_control_xon: 0
     tx_flow_control_xoff: 0
     rx_long_byte_count: 181392964784
     rx_csum_offload_good: 1753575650
     rx_csum_offload_errors: 0
     rx_header_split: 0
     alloc_rx_buff_failed: 0
     tx_smbus: 0
     rx_smbus: 0
     dropped_smbus: 0
     rx_dma_failed: 0
     tx_dma_failed: 0

Sync stats:

[Expert@GWB:0]# fw ctl pstat
System Capacity Summary:
Memory used: 24% (317 MB out of 1318 MB) - below watermark
Concurrent Connections: 95 (Unlimited)
Aggressive Aging is not active
Hash kernel memory (hmem) statistics:
Total memory allocated: 134217728 bytes in 32768 (4096 bytes) blocks using 32 pools
Total memory bytes used: 18721768   unused: 115495960 (86.05%)   peak: 37117084
Total memory blocks used:     6293   unused:    26475 (80%)   peak:     9638
Allocations: 3370097210 alloc, 0 failed alloc, 3369886583 free
System kernel memory (smem) statistics:
Total memory bytes used: 306889492   peak: 310893628
Total memory bytes wasted: 24536339
    Blocking memory bytes   used: 6230064   peak: 6722532
    Non-Blocking memory bytes used: 300659428   peak: 304171096
Allocations: 21376530 alloc, 0 failed alloc, 21372364 free, 0 failed free
vmalloc bytes used: 6291456 expensive: yes
Kernel memory (kmem) statistics:
Total memory bytes used: 191218320   peak: 207504120
Allocations: 3391446510 alloc, 0 failed alloc
               3391233868 free, 0 failed free
External Allocations: 0 for packets, 93818736 for SXL
Cookies:
        3761509268 total, 42662 alloc, 42662 free,
        3777276 dup, 4288962429 get, 247007545 put,
        3969813062 len, 119751361 cached len, 0 chain alloc,
        0 chain free
Connections:
        110810622 total, 58628503 TCP, 46986057 UDP, 5196047 ICMP,
        15 other, 0 anticipated, 1473 recovered, 95 concurrent,
        4935 peak concurrent
Fragments:
        309498393 fragments, 112535931 packets, 19491 expired, 0 short,
        0 large, 0 duplicates, 0 failures
NAT:
        9569835/0 forw, 7439782/0 bckw, 7345619 tcpudp,
        234246 icmp, 2178810-2955934 alloc
Sync:
        Version: new
        Status: Able to Send/Receive sync packets
        Sync packets sent:
         total : 256132610, retransmitted : 945, retrans reqs : 254, acks : 1120386
        Sync packets received:
         total : 143020690, were queued : 1949226, dropped by net : 928271
         retrans reqs : 443, received 2619713 acks
         retrans reqs for illegal seq : 0
         dropped updates as a result of sync overload: 0
        Callback statistics: handled 69537 cb, average delay : 1, max delay : 56

[Expert@GWB:0]# cphaprob syncstat
Sync Statistics (IDs of F&A Peers - 1 😞
Other Member Updates:
Sent retransmission requests................... 254
Avg missing updates per request................ 1
Old or too-new arriving updates................ 126
Unsynced missing updates....................... 0
Lost sync connection (num of events)........... 133
Timed out sync connection ..................... 0
Local Updates:
Total generated updates ....................... 21792644
Recv Retransmission requests................... 443
Recv Duplicate Retrans request................. 0
Blocking Events................................ 0
Blocked packets................................ 0
Max length of sending queue.................... 0
Avg length of sending queue.................... 0
Hold Pkts events............................... 69537
Unhold Pkt events.............................. 69537
Not held due to no members..................... 1
Max held duration (sync ticks)................. 0
Avg held duration (sync ticks)................. 0
Timers:
Sync tick (ms)................................. 100
CPHA tick (ms)................................. 500
Queues:
Sending queue size............................. 512
Receiving queue size........................... 512

Not sure if this might be Check Point issue, or Linux related bug...

Any ideas ?

Kind regards,
Jozko Mrkvicka

PhoneBoy · ‎2018-12-28

If I understand correctly, tx_restart_queue represents the number of times that transmits were delayed because the ring buffer is full.

This could happen because of flow control events...or a hardware hang of some sort.

I recommend engaging with the TAC on this.

JozkoMrkvicka · ‎2019-01-03

Looks like very similar issue is fixed in the R80.10: New Jumbo Hotfix (Take 169) GA-Release :

PMTR-20425,
PMTR-14191,
PMTR-20370

Gaia OS

In some scenarios, machines with the igb driver (on-board Mgmt/Sync and 1G expansion cards) receive the "Detected Tx Unit Hang" messages in /var/log/messages file.

Kind regards,
Jozko Mrkvicka

Alexander_Wilke · ‎2019-01-22

Hi,

we had and have the same issue with ouirt 12200 Appliances eth1 (1 Gbit/s ports). Complete Hardware replacements did not solve the issues. replacing Interface cards did not, too.

We are running R77.30 JHFA Take 336.

We did not notice any issues and we have These Messages more than 2 years I think.

Regards

michael-shi2006 · ‎2020-01-27

I am having the same message with my 12000 cluster (MGMT port). but I do not see any performance issue. I called checkpoint support, they can't find what is wrong, but offer a RMA for me if I like.

Any one see any performance issue due to this message ? and know why ?

Timothy_Hall · ‎2020-01-30

I think the tx_restart_queue indicates that the interrupt rate being used by the NIC is not able to keep up with emptying the TX ring buffer in a timely fashion. But because of the hardware hang message, I believe the TX queue getting restarted is just a symptom of the problem and not the actual problem itself. The hardware hang message indicates that the NIC driver did not get a response from the NIC card within some acceptable limit of time. So the NIC driver has to essentially reinitialize communication with the NIC card, and once that is done the emptying of the TX queue is restarted. Not completely sure if this causes the loss of everything that was in the TX ring buffer/queue at that time, but I don't think it does.

However this is a tough problem to diagnose because I've personally seen 3 different and separate solutions for it:

1) Replace NIC card hardware

2) Upgrade NIC driver

3) Apply a Jumbo HFA that I know for a fact did not update the NIC driver. My suspicion with this last one is that some other kernel code (perhaps SND or some other part of Gaia) is improperly monopolizing the SND core used to handle traffic for that interface, and it happens for just long enough to cause a timeout between the NIC driver and NIC hardware.

Attend my online "Be your Own TAC: Part Deux" CheckMates event
March 27th with sessions for both the EMEA and Americas time zones

Vladimir · ‎2021-02-14

@Timothy_Hall , I know this is an old post, but could these issues be traced to "Balancing mode: 802.3ad Layer3+4 Load Balancing"?

The L3+4 mode is not, strictly speaking, compliant with 802.3ad, but I see it used in this configuration often.

Timothy_Hall · ‎2021-02-14

Hmm I don't see how, 802.3ad doesn't directly care about the transmit hash policy in use other than specifying it cannot cause out-of-order frame delivery (bad) or frame duplications (very bad). I don't think what transmit policy is in use is even part of the initial or ongoing LACP PDUs. I suppose it could be some kind of bug in the NIC driver that is triggered by utilizing that particular transmit hash algorithm (see cause #2 from my post above).

Attend my online "Be your Own TAC: Part Deux" CheckMates event
March 27th with sessions for both the EMEA and Americas time zones

JozkoMrkvicka · ‎2021-02-15

By the way, following issue is fixed in R80.30 Jumbo Take 228:

PRJ-18609,
PMTR-60804

Gaia OS

Bond interface in XOR mode or 802.3AD (LACP) mode may experience suboptimal performance, if on the Bond interface the Transmit Hash Policy is configured to "Layer 3+4" and Multi-Queue is enabled.

Fix is relevant for Gaia 3.10 only.

Kind regards,
Jozko Mrkvicka

PBC_Cyber · ‎2020-01-30

Do you have cpu cores dedicated to the nic's? (sim affinity -l)

Also cpu0 should not be be assigned to a nic interface.

Marko_Keca · ‎2021-03-18

Hello all,

We are having same issue on 6800 VSX running R80.40 T91. TAC case is opened for more than 4 months without any solution.

On any VS we move to problematic node we see frequent failovers:

# cphaprob show_failover

Last cluster failover event:
Transition to new ACTIVE: Member 1 -> Member 2
Reason: Member state has been changed due to higher priority of remote cluster member 2 in PRIMARY-UP cluster
Event time: Thu Mar 18 23:12:30 2021

Cluster failover count:
Failover counter: 603
Time of counter reset: Wed Dec 9 10:17:47 2020 (reboot)

Cluster failover history (last 20 failovers since reboot/reset on Wed Dec 9 10:17:47 2020):

No. Time: Transition: CPU: Reason:
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
1 Thu Mar 18 23:12:30 2021 Member 1 -> Member 2 00 Member state has been changed due to higher priority of remote cluster member 2 in PRIMARY-UP cluster
2 Thu Mar 18 23:12:28 2021 Member 2 -> Member 1 00 VSX PNOTE
3 Thu Mar 18 22:32:29 2021 Member 1 -> Member 2 00 Member state has been changed due to higher priority of remote cluster member 2 in PRIMARY-UP cluster
4 Thu Mar 18 22:32:20 2021 Member 2 -> Member 1 00 VSX PNOTE
5 Thu Mar 18 22:32:11 2021 Member 1 -> Member 2 00 Member state has been changed due to higher priority of remote cluster member 2 in PRIMARY-UP cluster
6 Thu Mar 18 22:32:09 2021 Member 2 -> Member 1 00 VSX PNOTE
7 Thu Mar 18 22:06:38 2021 Member 1 -> Member 2 00 Member state has been changed due to higher priority of remote cluster member 2 in PRIMARY-UP cluster
8 Thu Mar 18 22:06:37 2021 Member 2 -> Member 1 00 VSX PNOTE
9 Thu Mar 18 22:05:54 2021 Member 1 -> Member 2 00 Member state has been changed due to higher priority of remote cluster member 2 in PRIMARY-UP cluster
10 Thu Mar 18 22:05:51 2021 Member 2 -> Member 1 00 VSX PNOTE
11 Thu Mar 18 19:39:53 2021 Member 1 -> Member 2 00 Member state has been changed due to higher priority of remote cluster member 2 in PRIMARY-UP cluster
12 Thu Mar 18 19:39:51 2021 Member 2 -> Member 1 00 VSX PNOTE
13 Thu Mar 18 18:47:04 2021 Member 1 -> Member 2 00 Member state has been changed due to higher priority of remote cluster member 2 in PRIMARY-UP cluster
14 Thu Mar 18 18:47:02 2021 Member 2 -> Member 1 00 VSX PNOTE
15 Thu Mar 18 18:18:38 2021 Member 1 -> Member 2 00 Member state has been changed due to higher priority of remote cluster member 2 in PRIMARY-UP cluster
16 Thu Mar 18 18:18:37 2021 Member 2 -> Member 1 00 VSX PNOTE
17 Thu Mar 18 18:01:59 2021 Member 2 -> Member 1 00 VSX PNOTE
18 Thu Mar 18 18:09:40 2021 Member 1 -> Member 2 00 Member state has been changed due to higher priority of remote cluster member 2 in PRIMARY-UP cluster
19 Thu Mar 18 18:01:59 2021 Member 2 -> Member 1 00 FULLSYNC PNOTE - cpstop
20 Thu Mar 18 18:09:40 2021 Member 1 -> Member 2 00 Member state has been changed due to higher priority of remote cluster member 2 in PRIMARY-UP cluster

Any active VS on this box is useless as frequent failovers are causing traffic outages, OSPF failures,... So we are running all VS's on primary node only, except one for testing.

We tried everything:

RMA the box
Clean install of R80.40 on new box
Move the box to new datacenter (new dark fiber, new cabling)
Reconfigured sync bond (with MQ and without MQ)
shut one port in sync bond
Upgrade/downgrade to all available JHF: T78-T83-T87-T89-T91
connect ports to other Cisco switch module

Nothing helped. Issue is still there.

We are running out of ideas what to do, so any suggestion is welcomed!
Thanks in advance!

Regards,
--
Marko

Timothy_Hall · ‎2021-03-18

Is the Hardware Unit Hang always happening on the same single interface?

Are the hangs confined to Onboard (ethX), expansion slot (ethX-0X), or both?

Please provide ethtool -S (interface) and ethtool -i (interface) output for the interface(s) experiencing the hang(s).

The 6000 series onboard interfaces (Sync, Mgmt, ethX, etc.) use the I211 hardware controller which has various special limitations such as only supporting 2 queues for Multi-Queue. Is it just those onboard interfaces experiencing the hang? If so see these SKs:

sk165516: Multi-Queue and Dynamic Split cannot add more than 2 CoreXL SNDs on 6500 appliance

sk114625: Multi-Queue does not work on 3200 / 5000 / 6000 / 15000 / 23000 appliances when it is enab...

Attend my online "Be your Own TAC: Part Deux" CheckMates event
March 27th with sessions for both the EMEA and Americas time zones

Marko_Keca · ‎2021-03-18

Hi Timothy,

Hardware Unit Hang is always happening on same box, probably Sync interface, as only this interface has lots of queue restarts.

We have MQ with only 2 cores enabled for this sync bond. Issue is also happening with MQ disabled.
15min ago, we shut Sync onboard interface, and leave only eth8 interface in sync bond enabled. Issue is still happening.

Times of "Detected Tx Unit Hang" error in messages log doesn't correlate with failover issue or OSPF restarting.

[Expert@cpfw-dmz2:0]# ethtool -S Sync
NIC statistics:
rx_packets: 4817481
tx_packets: 277920534
rx_bytes: 1303799237
tx_bytes: 36060961929
rx_broadcast: 95703
tx_broadcast: 2
rx_multicast: 28193
tx_multicast: 26079
multicast: 28193
collisions: 0
rx_crc_errors: 0
rx_no_buffer_count: 3
rx_missed_errors: 1055
tx_aborted_errors: 0
tx_carrier_errors: 0
tx_window_errors: 0
tx_abort_late_coll: 0
tx_deferred_ok: 0
tx_single_coll_ok: 0
tx_multi_coll_ok: 0
tx_timeout_count: 0
rx_long_length_errors: 0
rx_short_length_errors: 0
rx_align_errors: 0
tx_tcp_seg_good: 0
tx_tcp_seg_failed: 0
rx_flow_control_xon: 0
rx_flow_control_xoff: 0
tx_flow_control_xon: 0
tx_flow_control_xoff: 0
rx_long_byte_count: 1303799237
tx_dma_out_of_sync: 0
lro_aggregated: 0
lro_flushed: 0
tx_smbus: 0
rx_smbus: 0
dropped_smbus: 0
os2bmc_rx_by_bmc: 0
os2bmc_tx_by_bmc: 0
os2bmc_tx_by_host: 0
os2bmc_rx_by_host: 0
tx_hwtstamp_timeouts: 0
rx_hwtstamp_cleared: 0
rx_errors: 0
tx_errors: 0
tx_dropped: 0
rx_length_errors: 0
rx_over_errors: 0
rx_frame_errors: 0
rx_fifo_errors: 1055
tx_fifo_errors: 0
tx_heartbeat_errors: 0
tx_queue_0_packets: 266044504
tx_queue_0_bytes: 31092290336
tx_queue_0_restart: 832
tx_queue_1_packets: 11876124
tx_queue_1_bytes: 3856962415
tx_queue_1_restart: 0
rx_queue_0_packets: 2231456
rx_queue_0_bytes: 613063098
rx_queue_0_drops: 0
rx_queue_0_csum_err: 0
rx_queue_0_alloc_failed: 0
rx_queue_1_packets: 2586025
rx_queue_1_bytes: 671466215
rx_queue_1_drops: 0
rx_queue_1_csum_err: 0
rx_queue_1_alloc_failed: 0

[Expert@cpfw-dmz2:0]# ethtool -i Sync
driver: igb
version: 5.3.5.18
firmware-version: 0. 6-2
expansion-rom-version:
bus-info: 0000:0d:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

[Expert@cpfw-dmz2:0]#
[Expert@cpfw-dmz2:0]# ethtool -S eth8
NIC statistics:
rx_packets: 464407127
tx_packets: 723574
rx_bytes: 247311946248
tx_bytes: 135522639
rx_broadcast: 7154
tx_broadcast: 119065
rx_multicast: 28224
tx_multicast: 26120
multicast: 28224
collisions: 0
rx_crc_errors: 0
rx_no_buffer_count: 5422
rx_missed_errors: 2394408
tx_aborted_errors: 0
tx_carrier_errors: 0
tx_window_errors: 0
tx_abort_late_coll: 0
tx_deferred_ok: 0
tx_single_coll_ok: 0
tx_multi_coll_ok: 0
tx_timeout_count: 0
rx_long_length_errors: 0
rx_short_length_errors: 0
rx_align_errors: 0
tx_tcp_seg_good: 0
tx_tcp_seg_failed: 0
rx_flow_control_xon: 0
rx_flow_control_xoff: 0
tx_flow_control_xon: 0
tx_flow_control_xoff: 0
rx_long_byte_count: 247311946248
tx_dma_out_of_sync: 0
lro_aggregated: 0
lro_flushed: 0
tx_smbus: 0
rx_smbus: 0
dropped_smbus: 0
os2bmc_rx_by_bmc: 0
os2bmc_tx_by_bmc: 0
os2bmc_tx_by_host: 0
os2bmc_rx_by_host: 0
tx_hwtstamp_timeouts: 0
rx_hwtstamp_cleared: 0
rx_errors: 0
tx_errors: 0
tx_dropped: 0
rx_length_errors: 0
rx_over_errors: 0
rx_frame_errors: 0
rx_fifo_errors: 2394408
tx_fifo_errors: 0
tx_heartbeat_errors: 0
tx_queue_0_packets: 472657
tx_queue_0_bytes: 70294975
tx_queue_0_restart: 0
tx_queue_1_packets: 102163
tx_queue_1_bytes: 26064801
tx_queue_1_restart: 0
rx_queue_0_packets: 110213332
rx_queue_0_bytes: 109653744464
rx_queue_0_drops: 0
rx_queue_0_csum_err: 0
rx_queue_0_alloc_failed: 0
rx_queue_1_packets: 353352659
rx_queue_1_bytes: 135360682126
rx_queue_1_drops: 0
rx_queue_1_csum_err: 0
rx_queue_1_alloc_failed: 0

[Expert@cpfw-dmz2:0]# ethtool -i eth8
driver: igb
version: 5.3.5.18
firmware-version: 3.29, 0x8000021a
expansion-rom-version:
bus-info: 0000:09:00.3
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: no

Regards,
--
Marko

Timothy_Hall · ‎2021-03-19

Please provide the output of clish command show asset all and expert mode command lspci. I want to see if the I211 controller is just used for the Mgmt and Sync interfaces on the 6800, or if it is also used for the other onboard ethX interfaces. If only the Mgmt and Sync interfaces are utilizing the I211, it might be interesting to move the IP addresses off the Mgmt/Sync interfaces onto a non-I211 controller NIC, completely disable the Mgmt/Sync interfaces, and see if that has an effect on the problem.

Attend my online "Be your Own TAC: Part Deux" CheckMates event
March 27th with sessions for both the EMEA and Americas time zones

Marko_Keca · ‎2021-03-19

Here it is. Only Mgmt and Sync are I211.
At the moment sync network works only over eth8 interface, as on-board Sync interface is shutdown on switch.

cpfw-dmz2:0> show asset all
Platform: QM-30-00
Model: Check Point 6800
Serial Number: XXXXXXXXXX
CPU Model: Intel(R) Xeon(R) CPU E5-2640 v4
CPU Frequency: 2394.513 Mhz
Number of Cores: 20
CPU Hyperthreading: Enabled
Number of disks: 2
Disk 1 Model: INTEL SSDSC2KG480G8
Disk 1 Capacity: 480 GB
Disk 2 Model: INTEL SSDSC2KG480G8
Disk 2 Capacity: 480 GB
Total Disks size: 960 GB
Total Memory: 65536 MB
Memory Slot 1 Size: 8192 MB
Memory Slot 2 Size: 8192 MB
Memory Slot 3 Size: 8192 MB
Memory Slot 4 Size: 8192 MB
Memory Slot 5 Size: 8192 MB
Memory Slot 6 Size: 8192 MB
Memory Slot 7 Size: 8192 MB
Memory Slot 8 Size: 8192 MB
Power supply 1 name: Power Supply #1
Power supply 1 status: Up
Power supply 2 name: Power Supply #2
Power supply 2 status: Up
LOM Status: Installed
LOM Firmware Revision: 3.35
Number of line cards: 1
Line card 1 model: CPAC-4-10F-6500/6800-C
Line card 1 type: 4 ports 10GbE SFP+ Rev 3.1
AC Present: No

[Expert@cpfw-dmz2:0]# lspci | grep Net
05:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)
05:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)
06:00.0 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)
06:00.1 Ethernet controller: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ Network Connection (rev 01)
08:00.0 Ethernet controller: Intel Corporation 82580 Gigabit Network Connection (rev 01)
08:00.1 Ethernet controller: Intel Corporation 82580 Gigabit Network Connection (rev 01)
08:00.2 Ethernet controller: Intel Corporation 82580 Gigabit Network Connection (rev 01)
08:00.3 Ethernet controller: Intel Corporation 82580 Gigabit Network Connection (rev 01)
09:00.0 Ethernet controller: Intel Corporation 82580 Gigabit Network Connection (rev 01)
09:00.1 Ethernet controller: Intel Corporation 82580 Gigabit Network Connection (rev 01)
09:00.2 Ethernet controller: Intel Corporation 82580 Gigabit Network Connection (rev 01)
09:00.3 Ethernet controller: Intel Corporation 82580 Gigabit Network Connection (rev 01)
0d:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)
0e:00.0 Ethernet controller: Intel Corporation I211 Gigabit Network Connection (rev 03)

Regards,
--
Marko

Timothy_Hall · ‎2021-03-19

Given everything else that has been tried thus far, I'd suggest taking the I211 controllers out of the picture and see if that improves things. Move the IP addresses off Mgmt/Sync and set them to "state off", and even unplug them physically if it is convenient. I have a feeling there may be something going on between these limited I211 controllers and your CPAC-4-10F-6500 / 6800-C.

Also please post one of the hardware hang messages you are seeing, it is generic for the igb driver or a specific interface?

Next step would be to focus on your CPAC-4-10F-6500 / 6800-C. It isn't a Mellanox card is it? (mlx* driver) If so: sk141812: Firmware Update Hotfix for 40GBE and 100/25GBE Cards

Bit of a long shot, but verify your appliance has the latest BIOS: sk120915: Check Point Appliances BIOS Firmware versions map

Attend my online "Be your Own TAC: Part Deux" CheckMates event
March 27th with sessions for both the EMEA and Americas time zones

Bob_Zimmerman · ‎2021-03-20

From the lspci above, the CPAC-4-10f is PCI addresses 05:00.0, 05:00.1, 06:00.0, and 06:00.1. It's two Intel 82599ES cores with two functions each. All four will be run by ixgbe.

Timothy_Hall · ‎2021-03-20

Good catch on the ixgbe, still think the problem has to do with the low-power I211 interfaces, perhaps in how they are interacting with the PCI bus or something. The only mention of hangs by I210/I211 controllers I could find involved use of MSI vs. MSI-X, not sure how to check for this:

https://community.intel.com/t5/Ethernet-Products/I210-TX-RX-hang/td-p/482055

Attend my online "Be your Own TAC: Part Deux" CheckMates event
March 27th with sessions for both the EMEA and Americas time zones

Marko_Keca · ‎2021-03-20

Thanks for suggestions! We will try it in next maintenance window, and I will post the results.

Regards,
--
Marko

DvirS · ‎2021-06-08

Hi Marko,

Did you managed to solve the issue?

We have similar issue with sudden failover and OSPF not working after failover and these errors on interfaces that reside on one extension module.

Timothy_Hall · ‎2021-06-08

You'll need to provide a lot more information to get any kind of useful answer. Appliance model? NIC type? Code and JHFA level? Any other messages/errors? Troubleshooting steps you've taken? Is this a new deployment or did this start happening to an existing deployment? If an existing deployment, what was changed recently?

The "Detected Hardware Unit Hang" message is generally just the symptom of the underlying problem, but not the actual cause. Without the above information trying to speculate on the cause of your problem is pointless, even if it looks similar to what is being described in this thread.

Attend my online "Be your Own TAC: Part Deux" CheckMates event
March 27th with sessions for both the EMEA and Americas time zones

Marko_Keca · ‎2021-06-10

Hello all,

Sorry for late response regarding issue.

After few remote sessions with TAC and involving R&D in analysis of debugs collected, we have resolved the issue by disabling Priority Queue (more info here: Firewall Priority Queues in R77.30 / R80.x )

[Expert@fw01:0]# fw ctl multik prioq
Current mode is On

Available modes:
0. Off
1. Evaluator-only
2. On

Choose the desired mode number: (or 3 to Quit)
0
New mode is: Off
Please reboot the system

After disabling PrioQ, system is running stable for more than a month now.
We didn't find out why it was happening in first place, but in our case this procedure helped.

Regards,
--
Marko

Timothy_Hall · ‎2021-06-10

Thanks for the follow-up. As I mentioned earlier in this thread the "Detected Hardware Unit Hang" is just a symptom of the underlying problem and not the problem itself. That is quite odd that Priority Queues would cause this effect, the PQ code must have been improperly monopolizing kernel CPU time or something and impeding communication between the NIC driver and NIC hardware for just long enough to trigger the error message.

Attend my online "Be your Own TAC: Part Deux" CheckMates event
March 27th with sessions for both the EMEA and Americas time zones

Thomas_Eichelbu · ‎2022-02-11

Hello Team,

since a couple of days i also see this logs ...

[Expert@XXXXXXXXXXX::ACTIVE]# grep "Unit" mes*
messages:Feb 10 15:15:19 2022 XXXXXXXXXXX kernel: igb 0000:8b:00.3: Detected Tx Unit Hang
messages:Feb 11 08:19:37 2022 XXXXXXXXXXX kernel: igb 0000:8b:00.1: Detected Tx Unit Hang
messages.1:Feb 9 12:59:35 2022 XXXXXXXXXXX kernel: igb 0000:8b:00.1: Detected Tx Unit Hang
messages.1:Feb 10 08:34:39 2022 XXXXXXXXXXX kernel: igb 0000:8b:00.1: Detected Tx Unit Hang
messages.1:Feb 10 08:34:40 2022 XXXXXXXXXXX kernel: igb 0000:8b:00.3: Detected Tx Unit Hang
messages.1:Feb 10 10:14:41 2022 XXXXXXXXXXX kernel: igb 0000:8b:00.0: Detected Tx Unit Hang
messages.1:Feb 10 10:14:42 2022 XXXXXXXXXXX kernel: igb 0000:8b:00.1: Detected Tx Unit Hang
messages.1:Feb 10 11:09:40 2022 XXXXXXXXXXX kernel: igb 0000:8b:00.1: Detected Tx Unit Hang
messages.1:Feb 10 11:59:38 2022 XXXXXXXXXXX kernel: igb 0000:8b:00.1: Detected Tx Unit Hang
messages.1:Feb 10 12:24:32 2022 XXXXXXXXXXX kernel: igb 0000:8b:00.0: Detected Tx Unit Hang
messages.1:Feb 10 12:24:32 2022 XXXXXXXXXXX kernel: igb 0000:8b:00.1: Detected Tx Unit Hang
messages.1:Feb 10 12:49:38 2022 XXXXXXXXXXX kernel: igb 0000:8b:00.1: Detected Tx Unit Hang
messages.1:Feb 10 13:14:34 2022 XXXXXXXXXXX kernel: igb 0000:8b:00.1: Detected Tx Unit Hang
messages.1:Feb 10 13:19:35 2022 XXXXXXXXXXX kernel: igb 0000:8b:00.3: Detected Tx Unit Hang
messages.1:Feb 10 13:19:35 2022 XXXXXXXXXXX kernel: igb 0000:8b:00.1: Detected Tx Unit Hang
messages.1:Feb 10 14:05:02 2022 XXXXXXXXXXX kernel: igb 0000:8b:00.1: Detected Tx Unit Hang
messages.1:Feb 10 14:05:02 2022 XXXXXXXXXXX kernel: igb 0000:8b:00.1: Detected Tx Unit Hang
messages.2:Feb 8 10:24:30 2022 XXXXXXXXXXX kernel: igb 0000:8b:00.1: Detected Tx Unit Hang
messages.2:Feb 8 11:34:35 2022 XXXXXXXXXXX kernel: igb 0000:8b:00.1: Detected Tx Unit Hang
messages.2:Feb 8 14:29:31 2022 XXXXXXXXXXX kernel: igb 0000:8b:00.1: Detected Tx Unit Hang

i see this are the 1GB Ports "Line card 2 model: CPAC-8-1C-B"
its a 15600 appliance R81 + Take 44

a TAC case has been opened ...

mostly we dont feel any impact, but yesterday, an elephant flow, a policy install and a driver restart had stuck at precise the same minute, this was not a good thing!

has anybody already fixed this issue?

Are you a member of CheckMates?

Detected Hardware Unit Hang