Hello everyone,
Environment:
Cluster of SG4800 with R77.30 and jumbo Take 216.
I just noticed that from time to time following messages are visible in /var/log/messages:
Dec 28 07:23:22 2018 GWB kernel: e1000e 0000:0f:00.0: eth4: Detected Hardware Unit Hang:
Dec 28 07:23:22 2018 GWB kernel: TDH <218>
Dec 28 07:23:22 2018 GWB kernel: TDT <21b>
Dec 28 07:23:22 2018 GWB kernel: next_to_use <21b>
Dec 28 07:23:22 2018 GWB kernel: next_to_clean <218>
Dec 28 07:23:22 2018 GWB kernel: buffer_info[next_to_clean]:
Dec 28 07:23:22 2018 GWB kernel: time_stamp <62cafbcd>
Dec 28 07:23:22 2018 GWB kernel: next_to_watch <218>
Dec 28 07:23:22 2018 GWB kernel: jiffies <62cb0098>
Dec 28 07:23:22 2018 GWB kernel: next_to_watch.status <0>
Dec 28 07:23:22 2018 GWB kernel: MAC Status <80783>
Dec 28 07:23:22 2018 GWB kernel: PHY Status <796d>
Dec 28 07:23:22 2018 GWB kernel: PHY 1000BASE-T Status <3800>
Dec 28 07:23:22 2018 GWB kernel: PHY Extended Status <3000>
Dec 28 07:23:22 2018 GWB kernel: PCI Status <10>
Dec 28 01:32:53 2018 GWB kernel: e1000e 0000:0f:00.0: eth4: Detected Hardware Unit Hang:
Dec 28 01:32:53 2018 GWB kernel: TDH <1fc>
Dec 28 01:32:53 2018 GWB kernel: TDT <1ff>
Dec 28 01:32:53 2018 GWB kernel: next_to_use <1ff>
Dec 28 01:32:53 2018 GWB kernel: next_to_clean <1fc>
Dec 28 01:32:53 2018 GWB kernel: buffer_info[next_to_clean]:
Dec 28 01:32:53 2018 GWB kernel: time_stamp <618a1136>
Dec 28 01:32:53 2018 GWB kernel: next_to_watch <1fc>
Dec 28 01:32:53 2018 GWB kernel: jiffies <618a15f2>
Dec 28 01:32:53 2018 GWB kernel: next_to_watch.status <0>
Dec 28 01:32:53 2018 GWB kernel: MAC Status <80783>
Dec 28 01:32:53 2018 GWB kernel: PHY Status <796d>
Dec 28 01:32:53 2018 GWB kernel: PHY 1000BASE-T Status <3800>
Dec 28 01:32:53 2018 GWB kernel: PHY Extended Status <3000>
Dec 28 01:32:53 2018 GWB kernel: PCI Status <10>
Looks like something is wrong with eth4. This interface is part of bond interface, together with eth3. Purpose of bond interface is Sync link between both members. All interfaces are 1G TP. Distance between both cluster members is 40 km.
[Expert@GWB:0]# cphaconf show_bond bond1
Bond name: bond1
Bond mode: Load Sharing
Bond status: UP
Balancing mode: 802.3ad Layer3+4 Load Balancing
Configured slave interfaces: 2
In use slave interfaces: 2
Required slave interfaces: 1
Slave name | Status | Link
----------------+-----------------+-------
eth3 | Active | Yes
eth4 | Active | Yes
[Expert@GWB:0]# cat /proc/net/bonding/bond1
Ethernet Channel Bonding Driver: v3.2.4 (January 28, 2008)
Bonding Mode: IEEE 802.3ad Dynamic link aggregation
Transmit Hash Policy: layer3+4 (1)
MII Status: up
MII Polling Interval (ms): 100
Up Delay (ms): 200
Down Delay (ms): 200
802.3ad info
LACP rate: slow
Active Aggregator Info:
Aggregator ID: 1
Number of ports: 2
Actor Key: 17
Partner Key: 33071
Partner Mac Address: 00:23:04:ea:cd:05
Slave Interface: eth3
MII Status: up
Link Failure Count: 1
Permanent HW addr: 00:1c:7f:35:1e:67
Aggregator ID: 1
Slave Interface: eth4
MII Status: up
Link Failure Count: 3
Permanent HW addr: 00:1c:7f:35:1e:69
Aggregator ID: 1
Interface statistics:
[Expert@GWB:0]# ethtool -i eth3
driver: e1000e
version: 2.1.4-NAPI
firmware-version: 2.1-0
bus-info: 0000:0b:00.0
[Expert@GWB:0]# ethtool -i eth4
driver: e1000e
version: 2.1.4-NAPI
firmware-version: 2.1-0
bus-info: 0000:0f:00.0
[Expert@GWB:0]# ifconfig eth3
eth3 Link encap:Ethernet HWaddr 00:1C:7F:35:1E:67
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:1755761152 errors:0 dropped:0 overruns:0 frame:0
TX packets:1722666300 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:2574922192 (2.3 GiB) TX bytes:4123666606 (3.8 GiB)
Interrupt:185 Memory:fe9e0000-fea00000
[Expert@GWB:0]# ifconfig eth4
eth4 Link encap:Ethernet HWaddr 00:1C:7F:35:1E:67
UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1
RX packets:74346835 errors:0 dropped:0 overruns:0 frame:0
TX packets:129992397 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:2213595757 (2.0 GiB) TX bytes:2373454427 (2.2 GiB)
Interrupt:185 Memory:febe0000-fec00000
[Expert@GWB:0]# netstat -ani
Kernel Interface table
Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
bond1 1500 0 1830071571 0 0 0 1852622059 0 0 0 BMmRU
eth3 1500 0 1755725351 0 0 0 1722632073 0 0 0 BMsRU
eth4 1500 0 74346220 0 0 0 129989986 0 0 0 BMsRU
[Expert@GWB:0]# ethtool -S eth4
NIC statistics:
rx_packets: 74346141
tx_packets: 129989592
rx_bytes: 36870531721
tx_bytes: 75906486583
rx_broadcast: 72425523
tx_broadcast: 128068474
rx_multicast: 1917491
tx_multicast: 1917919
rx_errors: 0
tx_errors: 0
tx_dropped: 0
multicast: 1917491
collisions: 0
rx_length_errors: 0
rx_over_errors: 0
rx_crc_errors: 0
rx_frame_errors: 0
rx_no_buffer_count: 0
rx_missed_errors: 0
tx_aborted_errors: 0
tx_carrier_errors: 0
tx_fifo_errors: 0
tx_heartbeat_errors: 0
tx_window_errors: 0
tx_abort_late_coll: 0
tx_deferred_ok: 0
tx_single_coll_ok: 0
tx_multi_coll_ok: 0
tx_timeout_count: 0
tx_restart_queue: 201
rx_long_length_errors: 0
rx_short_length_errors: 0
rx_align_errors: 0
tx_tcp_seg_good: 0
tx_tcp_seg_failed: 0
rx_flow_control_xon: 0
rx_flow_control_xoff: 0
tx_flow_control_xon: 0
tx_flow_control_xoff: 0
rx_long_byte_count: 36870531721
rx_csum_offload_good: 72427507
rx_csum_offload_errors: 0
rx_header_split: 0
alloc_rx_buff_failed: 0
tx_smbus: 0
rx_smbus: 0
dropped_smbus: 0
rx_dma_failed: 0
tx_dma_failed: 0
[Expert@GWB:0]# ethtool -S eth3
NIC statistics:
rx_packets: 1755721963
tx_packets: 1722628786
rx_bytes: 181392964784
tx_bytes: 212883823014
rx_broadcast: 1753601078
tx_broadcast: 1720420399
rx_multicast: 1917328
tx_multicast: 1917924
rx_errors: 0
tx_errors: 0
tx_dropped: 0
multicast: 1917328
collisions: 0
rx_length_errors: 0
rx_over_errors: 0
rx_crc_errors: 0
rx_frame_errors: 0
rx_no_buffer_count: 0
rx_missed_errors: 0
tx_aborted_errors: 0
tx_carrier_errors: 0
tx_fifo_errors: 0
tx_heartbeat_errors: 0
tx_window_errors: 0
tx_abort_late_coll: 0
tx_deferred_ok: 0
tx_single_coll_ok: 0
tx_multi_coll_ok: 0
tx_timeout_count: 0
tx_restart_queue: 0
rx_long_length_errors: 0
rx_short_length_errors: 0
rx_align_errors: 0
tx_tcp_seg_good: 0
tx_tcp_seg_failed: 0
rx_flow_control_xon: 0
rx_flow_control_xoff: 0
tx_flow_control_xon: 0
tx_flow_control_xoff: 0
rx_long_byte_count: 181392964784
rx_csum_offload_good: 1753575650
rx_csum_offload_errors: 0
rx_header_split: 0
alloc_rx_buff_failed: 0
tx_smbus: 0
rx_smbus: 0
dropped_smbus: 0
rx_dma_failed: 0
tx_dma_failed: 0
Sync stats:
[Expert@GWB:0]# fw ctl pstat
System Capacity Summary:
Memory used: 24% (317 MB out of 1318 MB) - below watermark
Concurrent Connections: 95 (Unlimited)
Aggressive Aging is not active
Hash kernel memory (hmem) statistics:
Total memory allocated: 134217728 bytes in 32768 (4096 bytes) blocks using 32 pools
Total memory bytes used: 18721768 unused: 115495960 (86.05%) peak: 37117084
Total memory blocks used: 6293 unused: 26475 (80%) peak: 9638
Allocations: 3370097210 alloc, 0 failed alloc, 3369886583 free
System kernel memory (smem) statistics:
Total memory bytes used: 306889492 peak: 310893628
Total memory bytes wasted: 24536339
Blocking memory bytes used: 6230064 peak: 6722532
Non-Blocking memory bytes used: 300659428 peak: 304171096
Allocations: 21376530 alloc, 0 failed alloc, 21372364 free, 0 failed free
vmalloc bytes used: 6291456 expensive: yes
Kernel memory (kmem) statistics:
Total memory bytes used: 191218320 peak: 207504120
Allocations: 3391446510 alloc, 0 failed alloc
3391233868 free, 0 failed free
External Allocations: 0 for packets, 93818736 for SXL
Cookies:
3761509268 total, 42662 alloc, 42662 free,
3777276 dup, 4288962429 get, 247007545 put,
3969813062 len, 119751361 cached len, 0 chain alloc,
0 chain free
Connections:
110810622 total, 58628503 TCP, 46986057 UDP, 5196047 ICMP,
15 other, 0 anticipated, 1473 recovered, 95 concurrent,
4935 peak concurrent
Fragments:
309498393 fragments, 112535931 packets, 19491 expired, 0 short,
0 large, 0 duplicates, 0 failures
NAT:
9569835/0 forw, 7439782/0 bckw, 7345619 tcpudp,
234246 icmp, 2178810-2955934 alloc
Sync:
Version: new
Status: Able to Send/Receive sync packets
Sync packets sent:
total : 256132610, retransmitted : 945, retrans reqs : 254, acks : 1120386
Sync packets received:
total : 143020690, were queued : 1949226, dropped by net : 928271
retrans reqs : 443, received 2619713 acks
retrans reqs for illegal seq : 0
dropped updates as a result of sync overload: 0
Callback statistics: handled 69537 cb, average delay : 1, max delay : 56
[Expert@GWB:0]# cphaprob syncstat
Sync Statistics (IDs of F&A Peers - 1 😞
Other Member Updates:
Sent retransmission requests................... 254
Avg missing updates per request................ 1
Old or too-new arriving updates................ 126
Unsynced missing updates....................... 0
Lost sync connection (num of events)........... 133
Timed out sync connection ..................... 0
Local Updates:
Total generated updates ....................... 21792644
Recv Retransmission requests................... 443
Recv Duplicate Retrans request................. 0
Blocking Events................................ 0
Blocked packets................................ 0
Max length of sending queue.................... 0
Avg length of sending queue.................... 0
Hold Pkts events............................... 69537
Unhold Pkt events.............................. 69537
Not held due to no members..................... 1
Max held duration (sync ticks)................. 0
Avg held duration (sync ticks)................. 0
Timers:
Sync tick (ms)................................. 100
CPHA tick (ms)................................. 500
Queues:
Sending queue size............................. 512
Receiving queue size........................... 512
Not sure if this might be Check Point issue, or Linux related bug...
Any ideas ?
Kind regards,
Jozko Mrkvicka