Hello Timothy,
thank your for your reply and please excuse my delayed answer. I had to come back to work again for gathering the information you need.
Here comes the output from the two commands:
Kernel Interface table
Iface MTU Met RX-OK RX-ERR RX-DRP RX-OVR TX-OK TX-ERR TX-DRP TX-OVR Flg
Sync 1500 0 7328294935 0 0 0 41413517 0 0 0 BMsRU
bond0 1500 0 9048414719 4 0 0 22994929759 0 0 0 BMmRU
bond1 9216 0 8744713065 0 0 0 7744539947 0 0 0 BMmRU
bond1.411 9216 0 349523906 0 0 0 325117534 0 0 0 BMmRU
bond1.3073 9216 0 8391272904 0 0 0 7416811534 0 0 0 BMmRU
bond2 9216 0 362424970693 745 1151959 1151959 316723998361 0 0 0 BMmRU
bond2.10 9216 0 362419538404 0 0 0 316721387585 0 0 0 BMmRU
bond3 9216 0 480266691808 0 2443 0 523231391186 0 0 0 BMmRU
bond3.11 9216 0 2379446058 0 0 0 1167994665 0 0 0 BMmRU
bond3.12 9216 0 1526445627 0 0 0 468645893 0 0 0 BMmRU
bond3.13 9216 0 1572019373 0 0 0 322065773 0 0 0 BMmRU
bond3.14 9216 0 1524996780 0 0 0 301482132 0 0 0 BMmRU
bond3.21 9216 0 139283659108 0 0 0 127485706715 0 0 0 BMmRU
bond3.24 9216 0 214975474536 0 0 0 295802590825 0 0 0 BMmRU
bond3.101 9216 0 7590048591 0 0 0 509953757 0 0 0 BMmRU
bond3.102 9216 0 320384413 0 0 0 301545910 0 0 0 BMmRU
bond3.103 9216 0 61471579174 0 0 0 34935517725 0 0 0 BMmRU
bond3.104 9216 0 5398435810 0 0 0 5789230865 0 0 0 BMmRU
bond3.105 9216 0 4688468783 0 0 0 4270428459 0 0 0 BMmRU
bond3.106 9216 0 320384423 0 0 0 301542853 0 0 0 BMmRU
bond3.107 9216 0 320425547 0 0 0 301609300 0 0 0 BMmRU
bond3.108 9216 0 1109861580 0 0 0 2150879185 0 0 0 BMmRU
bond3.109 9216 0 320384401 0 0 0 301582175 0 0 0 BMmRU
bond3.110 9216 0 320384402 0 0 0 301582546 0 0 0 BMmRU
bond3.111 9216 0 10056794219 0 0 0 18185207716 0 0 0 BMmRU
bond3.112 9216 0 320384405 0 0 0 301653645 0 0 0 BMmRU
bond3.113 9216 0 320384416 0 0 0 301555121 0 0 0 BMmRU
bond3.114 9216 0 320384414 0 0 0 301565349 0 0 0 BMmRU
bond3.115 9216 0 347088603 0 0 0 305894329 0 0 0 BMmRU
bond3.116 9216 0 320770148 0 0 0 301992744 0 0 0 BMmRU
bond3.121 9216 0 24256798972 0 0 0 27685762455 0 0 0 BMmRU
bond3.161 9216 0 335102012 0 0 0 316038632 0 0 0 BMmRU
bond3.162 9216 0 344942645 0 0 0 333455397 0 0 0 BMmRU
bond3.1172 9216 0 259854298 0 0 0 240195171 0 0 0 BMmRU
bond3.1173 9216 0 255319208 0 0 0 240490853 0 0 0 BMmRU
eth1-01 1500 0 1720119793 4 0 0 22953516347 0 0 0 BMsRU
eth1-02 1500 0 22263179382 654 24 24 16275195207 0 0 0 BMRU
eth1-03 9216 0 322319396 0 0 0 190421067416 0 0 0 BMsRU
eth1-04 9216 0 362102653198 745 1151959 1151959 126302933128 0 0 0 BMsRU
eth2-01 9216 0 5257767598 0 0 0 4170483319 0 0 0 BMsRU
eth2-02 9216 0 130950036244 0 152 0 121342122054 0 0 0 BMsRU
eth3-01 9216 0 3486945473 0 0 0 3574056632 0 0 0 BMsRU
eth3-02 9216 0 138743790512 0 0 0 118395834415 0 0 0 BMsRU
eth3-03 9216 0 136903393940 0 1460 0 166674595766 0 0 0 BMsRU
eth3-04 9216 0 73669474382 0 831 0 116818841944 0 0 0 BMsRU
lo 16436 0 25593339 0 0 0 25593339 0 0 0 LRU
ethtool -S eth1-04
NIC statistics:
rx_packets: 362103638440
tx_packets: 126303479177
rx_bytes: 294503297876960
tx_bytes: 61213346090829
rx_broadcast: 4028
tx_broadcast: 654446
rx_multicast: 67084922
tx_multicast: 1958037
multicast: 67084922
collisions: 0
rx_crc_errors: 602
rx_no_buffer_count: 33786
rx_missed_errors: 1151959
tx_aborted_errors: 0
tx_carrier_errors: 0
tx_window_errors: 0
tx_abort_late_coll: 0
tx_deferred_ok: 0
tx_single_coll_ok: 0
tx_multi_coll_ok: 0
tx_timeout_count: 0
rx_long_length_errors: 0
rx_short_length_errors: 0
rx_align_errors: 0
tx_tcp_seg_good: 0
tx_tcp_seg_failed: 0
rx_flow_control_xon: 0
rx_flow_control_xoff: 0
tx_flow_control_xon: 0
tx_flow_control_xoff: 0
rx_long_byte_count: 294503297876960
tx_dma_out_of_sync: 0
lro_aggregated: 0
lro_flushed: 0
lro_recycled: 0
tx_smbus: 0
rx_smbus: 0
dropped_smbus: 0
os2bmc_rx_by_bmc: 0
os2bmc_tx_by_bmc: 0
os2bmc_tx_by_host: 0
os2bmc_rx_by_host: 0
rx_errors: 745
tx_errors: 0
tx_dropped: 0
rx_length_errors: 0
rx_over_errors: 0
rx_frame_errors: 0
rx_fifo_errors: 1151959
tx_fifo_errors: 0
tx_heartbeat_errors: 0
tx_queue_0_packets: 126303479177
tx_queue_0_bytes: 60101692092153
tx_queue_0_restart: 26
rx_queue_0_packets: 362103638442
rx_queue_0_bytes: 291606479635784
rx_queue_0_drops: 0
rx_queue_0_csum_err: 1447709
rx_queue_0_alloc_failed: 0
The problematic interface (bond2) consists of the two interfaces eth1-03 and eth1-04 and with eth1-04 we see the errors.
Maybe it is of interest for you, the two interfaces are copper-GBit. With the 10GBit-interfaces we don't see this behavior. I realize on another cluster we have similar problems, so I could imagine, that we have a generell issue with the config of the GBIt-copper interfaces.
Thank you for your help in advance
Best regards
Sascha