cancel
Showing results for 
Search instead for 
Did you mean: 
Create a Post
JozkoMrkvicka
Platinum

Detected Hardware Unit Hang

Hello everyone,

Environment:

Cluster of SG4800 with R77.30 and jumbo Take 216.

I just noticed that from time to time following messages are visible in /var/log/messages:

Dec 28 07:23:22 2018 GWB kernel: e1000e 0000:0f:00.0: eth4: Detected Hardware Unit Hang:
Dec 28 07:23:22 2018 GWB kernel:   TDH                  <218>
Dec 28 07:23:22 2018 GWB kernel:   TDT                  <21b>
Dec 28 07:23:22 2018 GWB kernel:   next_to_use          <21b>
Dec 28 07:23:22 2018 GWB kernel:   next_to_clean        <218>
Dec 28 07:23:22 2018 GWB kernel: buffer_info[next_to_clean]:
Dec 28 07:23:22 2018 GWB kernel:   time_stamp           <62cafbcd>
Dec 28 07:23:22 2018 GWB kernel:   next_to_watch        <218>
Dec 28 07:23:22 2018 GWB kernel:   jiffies              <62cb0098>
Dec 28 07:23:22 2018 GWB kernel:   next_to_watch.status <0>
Dec 28 07:23:22 2018 GWB kernel: MAC Status             <80783>
Dec 28 07:23:22 2018 GWB kernel: PHY Status             <796d>
Dec 28 07:23:22 2018 GWB kernel: PHY 1000BASE-T Status  <3800>
Dec 28 07:23:22 2018 GWB kernel: PHY Extended Status    <3000>
Dec 28 07:23:22 2018 GWB kernel: PCI Status             <10>

Dec 28 01:32:53 2018 GWB kernel: e1000e 0000:0f:00.0: eth4: Detected Hardware Unit Hang:
Dec 28 01:32:53 2018 GWB kernel:   TDH                  <1fc>
Dec 28 01:32:53 2018 GWB kernel:   TDT                  <1ff>
Dec 28 01:32:53 2018 GWB kernel:   next_to_use          <1ff>
Dec 28 01:32:53 2018 GWB kernel:   next_to_clean        <1fc>
Dec 28 01:32:53 2018 GWB kernel: buffer_info[next_to_clean]:
Dec 28 01:32:53 2018 GWB kernel:   time_stamp           <618a1136>
Dec 28 01:32:53 2018 GWB kernel:   next_to_watch        <1fc>
Dec 28 01:32:53 2018 GWB kernel:   jiffies              <618a15f2>
Dec 28 01:32:53 2018 GWB kernel:   next_to_watch.status <0>
Dec 28 01:32:53 2018 GWB kernel: MAC Status             <80783>
Dec 28 01:32:53 2018 GWB kernel: PHY Status             <796d>
Dec 28 01:32:53 2018 GWB kernel: PHY 1000BASE-T Status  <3800>
Dec 28 01:32:53 2018 GWB kernel: PHY Extended Status    <3000>
Dec 28 01:32:53 2018 GWB kernel: PCI Status             <10>

Looks like something is wrong with eth4. This interface is part of bond interface, together with eth3. Purpose of bond interface is Sync link between both members. All interfaces are 1G TP. Distance between both cluster members is 40 km.

[Expert@GWB:0]# cphaconf show_bond bond1

Bond name:      bond1
Bond mode:      Load Sharing
Bond status:    UP
Balancing mode: 802.3ad Layer3+4 Load Balancing
Configured slave interfaces: 2
In use slave interfaces:     2
Required slave interfaces:   1

Slave name      | Status          | Link
----------------+-----------------+-------
eth3            | Active          | Yes
eth4            | Active          | Yes

[Expert@GWB:0]# cat /proc/net/bonding/bond1
Ethernet Channel Bonding Driver: v3.2.4 (January 28, 2008)

Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer3+4 (1)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 200
Down Delay (ms): 200

802.3ad info
LACP rate: slow
Active Aggregator Info:
        Aggregator ID: 1
        Number of ports: 2
        Actor Key: 17
        Partner Key: 33071
        Partner Mac Address: 00:23:04:ea:cd:05

Slave Interface: eth3
MII Status: up
Link Failure Count: 1
Permanent HW addr: 00:1c:7f:35:1e:67
Aggregator ID: 1

Slave Interface: eth4
MII Status: up
Link Failure Count: 3
Permanent HW addr: 00:1c:7f:35:1e:69
Aggregator ID: 1

Interface statistics:

[Expert@GWB:0]# ethtool -i eth3
driver: e1000e
version: 2.1.4-NAPI
firmware-version: 2.1-0
bus-info: 0000:0b:00.0

[Expert@GWB:0]# ethtool -i eth4
driver: e1000e
version: 2.1.4-NAPI
firmware-version: 2.1-0
bus-info: 0000:0f:00.0

[Expert@GWB:0]# ifconfig eth3
eth3        Link encap:Ethernet  HWaddr 00:1C:7F:35:1E:67
            UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
            RX packets:1755761152 errors:0 dropped:0 overruns:0 frame:0
            TX packets:1722666300 errors:0 dropped:0 overruns:0 carrier:0
            collisions:0 txqueuelen:1000
            RX bytes:2574922192 (2.3 GiB)  TX bytes:4123666606 (3.8 GiB)
            Interrupt:185 Memory:fe9e0000-fea00000

[Expert@GWB:0]# ifconfig eth4
eth4        Link encap:Ethernet  HWaddr 00:1C:7F:35:1E:67
            UP BROADCAST RUNNING SLAVE MULTICAST  MTU:1500  Metric:1
            RX packets:74346835 errors:0 dropped:0 overruns:0 frame:0
            TX packets:129992397 errors:0 dropped:0 overruns:0 carrier:0
            collisions:0 txqueuelen:1000
            RX bytes:2213595757 (2.0 GiB)  TX bytes:2373454427 (2.2 GiB)
            Interrupt:185 Memory:febe0000-fec00000

[Expert@GWB:0]# netstat -ani
Kernel Interface table
Iface       MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
bond1      1500   0 1830071571      0      0      0 1852622059      0      0      0 BMmRU
eth3       1500   0 1755725351      0      0      0 1722632073      0      0      0 BMsRU
eth4       1500   0 74346220      0      0      0 129989986      0      0      0 BMsRU

[Expert@GWB:0]# ethtool -S eth4
NIC statistics:
     rx_packets: 74346141
     tx_packets: 129989592
     rx_bytes: 36870531721
     tx_bytes: 75906486583
     rx_broadcast: 72425523
     tx_broadcast: 128068474
     rx_multicast: 1917491
     tx_multicast: 1917919
     rx_errors: 0
     tx_errors: 0
     tx_dropped: 0
     multicast: 1917491
     collisions: 0
     rx_length_errors: 0
     rx_over_errors: 0
     rx_crc_errors: 0
     rx_frame_errors: 0
     rx_no_buffer_count: 0
     rx_missed_errors: 0
     tx_aborted_errors: 0
     tx_carrier_errors: 0
     tx_fifo_errors: 0
     tx_heartbeat_errors: 0
     tx_window_errors: 0
     tx_abort_late_coll: 0
     tx_deferred_ok: 0
     tx_single_coll_ok: 0
     tx_multi_coll_ok: 0
     tx_timeout_count: 0
     tx_restart_queue: 201
     rx_long_length_errors: 0
     rx_short_length_errors: 0
     rx_align_errors: 0
     tx_tcp_seg_good: 0
     tx_tcp_seg_failed: 0
     rx_flow_control_xon: 0
     rx_flow_control_xoff: 0
     tx_flow_control_xon: 0
     tx_flow_control_xoff: 0
     rx_long_byte_count: 36870531721
     rx_csum_offload_good: 72427507
     rx_csum_offload_errors: 0
     rx_header_split: 0
     alloc_rx_buff_failed: 0
     tx_smbus: 0
     rx_smbus: 0
     dropped_smbus: 0
     rx_dma_failed: 0
     tx_dma_failed: 0


[Expert@GWB:0]# ethtool -S eth3
NIC statistics:
     rx_packets: 1755721963
     tx_packets: 1722628786
     rx_bytes: 181392964784
     tx_bytes: 212883823014
     rx_broadcast: 1753601078
     tx_broadcast: 1720420399
     rx_multicast: 1917328
     tx_multicast: 1917924
     rx_errors: 0
     tx_errors: 0
     tx_dropped: 0
     multicast: 1917328
     collisions: 0
     rx_length_errors: 0
     rx_over_errors: 0
     rx_crc_errors: 0
     rx_frame_errors: 0
     rx_no_buffer_count: 0
     rx_missed_errors: 0
     tx_aborted_errors: 0
     tx_carrier_errors: 0
     tx_fifo_errors: 0
     tx_heartbeat_errors: 0
     tx_window_errors: 0
     tx_abort_late_coll: 0
     tx_deferred_ok: 0
     tx_single_coll_ok: 0
     tx_multi_coll_ok: 0
     tx_timeout_count: 0
     tx_restart_queue: 0
     rx_long_length_errors: 0
     rx_short_length_errors: 0
     rx_align_errors: 0
     tx_tcp_seg_good: 0
     tx_tcp_seg_failed: 0
     rx_flow_control_xon: 0
     rx_flow_control_xoff: 0
     tx_flow_control_xon: 0
     tx_flow_control_xoff: 0
     rx_long_byte_count: 181392964784
     rx_csum_offload_good: 1753575650
     rx_csum_offload_errors: 0
     rx_header_split: 0
     alloc_rx_buff_failed: 0
     tx_smbus: 0
     rx_smbus: 0
     dropped_smbus: 0
     rx_dma_failed: 0
     tx_dma_failed: 0

Sync stats:

[Expert@GWB:0]# fw ctl pstat

System Capacity Summary:
  Memory used: 24% (317 MB out of 1318 MB) - below watermark
  Concurrent Connections: 95 (Unlimited)
  Aggressive Aging is not active

Hash kernel memory (hmem) statistics:
  Total memory allocated: 134217728 bytes in 32768 (4096 bytes) blocks using 32 pools
  Total memory bytes  used: 18721768   unused: 115495960 (86.05%)   peak: 37117084
  Total memory blocks used:     6293   unused:    26475 (80%)   peak:     9638
  Allocations: 3370097210 alloc, 0 failed alloc, 3369886583 free

System kernel memory (smem) statistics:
  Total memory  bytes  used: 306889492   peak: 310893628
  Total memory bytes wasted: 24536339
    Blocking  memory  bytes   used:  6230064   peak:  6722532
    Non-Blocking memory bytes used: 300659428   peak: 304171096
  Allocations: 21376530 alloc, 0 failed alloc, 21372364 free, 0 failed free
  vmalloc bytes  used:  6291456 expensive: yes

Kernel memory (kmem) statistics:
  Total memory  bytes  used: 191218320   peak: 207504120
  Allocations: 3391446510 alloc, 0 failed alloc
               3391233868 free, 0 failed free
  External Allocations: 0 for packets, 93818736 for SXL

Cookies:
        3761509268 total, 42662 alloc, 42662 free,
        3777276 dup, 4288962429 get, 247007545 put,
        3969813062 len, 119751361 cached len, 0 chain alloc,
        0 chain free

Connections:
        110810622 total, 58628503 TCP, 46986057 UDP, 5196047 ICMP,
        15 other, 0 anticipated, 1473 recovered, 95 concurrent,
        4935 peak concurrent

Fragments:
        309498393 fragments, 112535931 packets, 19491 expired, 0 short,
        0 large, 0 duplicates, 0 failures

NAT:
        9569835/0 forw, 7439782/0 bckw, 7345619 tcpudp,
        234246 icmp, 2178810-2955934 alloc

Sync:
        Version: new
        Status: Able to Send/Receive sync packets
        Sync packets sent:
         total : 256132610,  retransmitted : 945, retrans reqs : 254,  acks : 1120386
        Sync packets received:
         total : 143020690,  were queued : 1949226, dropped by net : 928271
         retrans reqs : 443, received 2619713 acks
         retrans reqs for illegal seq : 0
         dropped updates as a result of sync overload: 0
        Callback statistics: handled 69537 cb, average delay : 1,  max delay : 56


[Expert@GWB:0]# cphaprob syncstat

Sync Statistics (IDs of F&A Peers - 1 😞

Other Member Updates:
Sent retransmission requests...................  254
Avg missing updates per request................  1
Old or too-new arriving updates................  126
Unsynced missing updates.......................  0
Lost sync connection (num of events)...........  133
Timed out sync connection .....................  0

Local Updates:
Total generated updates .......................  21792644
Recv Retransmission requests...................  443
Recv Duplicate Retrans request.................  0

Blocking Events................................  0
Blocked packets................................  0
Max length of sending queue....................  0
Avg length of sending queue....................  0
Hold Pkts events...............................  69537
Unhold Pkt events..............................  69537
Not held due to no members.....................  1
Max held duration (sync ticks).................  0
Avg held duration (sync ticks).................  0

Timers:
Sync tick (ms).................................  100
CPHA tick (ms).................................  500

Queues:
Sending queue size.............................  512
Receiving queue size...........................  512

Not sure if this might be Check Point issue, or Linux related bug...

Any ideas ?

Kind regards,
Jozko Mrkvicka
3 Replies
Admin
Admin

Re: Detected Hardware Unit Hang

If I understand correctly, tx_restart_queue represents the number of times that transmits were delayed because the ring buffer is full.

This could happen because of flow control events...or a hardware hang of some sort.

I recommend engaging with the TAC on this. 

JozkoMrkvicka
Platinum

Re: Detected Hardware Unit Hang

Looks like very similar issue is fixed in the R80.10: New Jumbo Hotfix (Take 169) GA-Release :

PMTR-20425,
PMTR-14191,
PMTR-20370
Gaia OSIn some scenarios, machines with the igb driver (on-board Mgmt/Sync and 1G expansion cards) receive the "Detected Tx Unit Hang" messages in /var/log/messages file.
Kind regards,
Jozko Mrkvicka

Re: Detected Hardware Unit Hang

Hi,

we had and have the same issue with ouirt 12200 Appliances eth1 (1 Gbit/s ports). Complete Hardware replacements did not solve the issues. replacing Interface cards did not, too.

We are running R77.30 JHFA Take 336.

We did not notice any issues and we have These Messages more than 2 years I think.


Regards