Re: CRC errors on eth0 device

DekPlent · ‎2023-03-10

Hi All,

I posted up about a year ago in relation to unscheduled failovers between a pair of checkpoint SMB 1590s running R80.20.35. Having opened a session out with Checkpoint support we tried changing the DNS probes to using IP addresses when testing for internet connectivity. This vastly reduced the number of failovers. Within the past few weeks failovers have started to occur more regularly again. I am speaking with the customer with regards to their internet connectivity as initial investigation is showing that both nodes can display the following before failover:

023 Mar 1 16:44:33 Gatekeeper2 user.info cposd: [CPOSD] WAN connection "Internet1": Internet connection probe status has changed to Disconnected. servers: 3, fails: 10, attempts: 30
2023 Mar 1 16:44:33 Gatekeeper2 daemon.info dnsmasq: reading /etc/resolv.conf
2023 Mar 1 16:44:33 Gatekeeper2 daemon.info dnsmasq: using nameserver 8.8.8.8#53
2023 Mar 1 16:44:33 Gatekeeper2 daemon.info dnsmasq: using nameserver 8.8.4.4#53
2023 Mar 1 16:44:33 Gatekeeper2 daemon.info dnsmasq: read /var/hosts - 31 addresses
2023 Mar 1 16:44:42 Gatekeeper2 daemon.info dnsmasq: reading /etc/resolv.conf
2023 Mar 1 16:44:42 Gatekeeper2 daemon.info dnsmasq: using nameserver 8.8.8.8#53
2023 Mar 1 16:44:42 Gatekeeper2 daemon.info dnsmasq: using nameserver 8.8.4.4#53
2023 Mar 1 16:44:42 Gatekeeper2 daemon.info dnsmasq: read /var/hosts - 31 addresses
2023 Mar 1 16:44:44 Gatekeeper2 daemoUpong opn.info dnsmasq: reading /ec/resolv.conf
2023 Mar 1 16:44:44 Gatekeeper2 daemon.info dnsmasq: using nameserver 8.8.8.8#53
2023 Mar 1 16:44:44 Gatekeeper2 daemon.info dnsmasq: using nameserver 8.8.4.4#53
2023 Mar 1 16:44:44 Gatekeeper2 daemon.info dnsmasq: read /var/hosts - 31 addresses
2023 Mar 1 16:44:44 Gatekeeper2 daemon.info dnsmasq: read /var/hosts - 31 addresses
2023 Mar 1 16:44:45 Gatekeeper2 user.info cposd: [CPOSD] WAN connection "Internet1": Internet connection probe status has changed to Connected. servers: 3, fails: 9, attempts: 30
2023 Mar 1 16:44:46 Gatekeeper2 user.info lua: [Security Settings] A policy change has been applied
2023 Mar 1 16:44:46 Gatekeeper2 user.info lua: [Security Settings] High Availability policy change has been applied

Upon further investigation with software vendor that these were purchased from we saw in dmesg:

[29837928.418739] mvpp2 f2000000.ethernet eth0: bad rx status 12018514 (crc error), size=1420
[29838238.763245] mvpp2 f2000000.ethernet eth0: bad rx status 12018514 (crc error), size=1420
[29838389.930050] mvpp2 f2000000.ethernet eth0: bad rx status 12018514 (crc error), size=1420
[29838758.413897] mvpp2 f2000000.ethernet eth0: bad rx status 12018514 (crc error), size=1420
[29838807.134983] mvpp2 f2000000.ethernet eth0: bad rx status 12018514 (crc error), size=1420
[29838866.139433] mvpp2 f2000000.ethernet eth0: bad rx status 12018514 (crc error), size=1420
[29839793.944978] mvpp2 f2000000.ethernet eth0: bad rx status 13008514 (crc error), size=66
[29839895.013737] mvpp2 f2000000.ethernet eth0: bad rx status 12018514 (crc error), size=1420
[29839925.978426] mvpp2 f2000000.ethernet eth0: bad rx status 12018514 (crc error), size=1420
[29841751.537571] mvpp2 f2000000.ethernet eth0: bad rx status 12018514 (crc error), size=1420
[29841907.975809] mvpp2 f2000000.ethernet eth0: bad rx status 12018514 (crc error), size=1420
[29842040.630989] mvpp2 f2000000.ethernet eth0: bad rx status 12018514 (crc error), size=1420
[29843430.323976] mvpp2 f2000000.ethernet eth0: bad rx status 12018514 (crc error), size=1420

In relation to the number of total packets the number of CRCs are low

eth0 Link encap:Ethernet HWaddr 00:1C:7F:AE:0A:A2
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:106054832559 errors:11201 dropped:0 overruns:0 frame:0
TX packets:105562308040 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:2048
RX bytes:98857319801669 (89.9 TiB) TX bytes:97218218207773 (88.4 TiB)

Now I read from a previous SK that the mac address for eth0 is the CPU interface to which the LAN1 to LAN8 devices are connected hence all interfaces have the same MAC (default behaviour) so I cannot find easily identify if the issue is caused by a LAN cable switch port etc. Additionally, none of the LAN1-8 interfaces are showing RX errors. I also read about possible *cosmetic errors* for virtual interfaces in another SK

As I see :

Event Code: CLUS-114704
State change: STANDBY -> ACTIVE
Reason for state change: No other ACTIVE members have been found in the cluster

From chpaprob state ; Is this stating that one interface cluster is not seeing traffic from it's peer / or just the sync interface perhaps?

Both units have been up for well over 300 days but we have an outage to patch (up to R81 possibly) and reboot within the next 2 weeks. At this time I can check any cabling. I don't know if there are any specific types of LAN cables we should be using. The issue is only been seen on one of the two units.

I will check the switch statistics shortly too.

I was wondering if anyone had any views or experiences with this?

Thanks and Regards

Dek

PhoneBoy · ‎2023-03-10

Those sorts of errors tend to be related to cabling or whatever is at the other end of it.

the_rock · ‎2023-03-10

Definitely agree with phoneboy...this is most likely cabling issue. I know below link is from Cisco community, but it can apply to any vendor really.

Andy

https://community.cisco.com/t5/switching/how-to-resolve-crc-errors/td-p/2216327

Timothy_Hall · ‎2023-03-11

Probably a cabling issue, but please provide output of ethtool -S eth0 for confirmation.

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com

DekPlent · ‎2023-03-13

Hi Timothy and all who responded;

The output of ethtool:

[Expert@Gatekeeper2]# ethtool -S eth0
NIC statistics:
good_octets_received: 100537372295696
bad_octets_received: 15799069
crc_errors_sent: 0
unicast_frames_received: 105806645478
broadcast_frames_received: 1582226201
multicast_frames_received: 58682701
frames_64_octets: 32266154150
frames_65_to_127_octet: 42570767442
frames_128_to_255_octet: 2245070861
frames_256_to_511_octet: 958205924
frames_512_to_1023_octet: 1101279521
frames_1024_to_max_octet: 135193459296
good_octets_sent: 98933885178893
unicast_frames_sent: 105294360565
multicast_frames_sent: 5
broadcast_frames_sent: 1593010746
fc_sent: 0
fc_received: 0
rx_fifo_overrun: 0
undersize_received: 0
fragments_err_received: 1
oversize_received: 0
jabber_received: 0
mac_receive_error: 220
bad_crc_event: 11279
collision: 0
late_collision: 0
rx_ppv2_overrun: 0
rx_cls_drop : 58529606
rx_fullq_drop : 0
rx_early_drop : 0
rx_bm_drop : 0
tx-guard-cpu0 : 1438088
tx-guard-cpu1 : 1475925
tx-guard-cpu2 : 1508278
tx-guard-cpu3 : 1508097

[Expert@Gatekeeper2]# uptime
12:45:47 up 349 days, 1:35, 2 users, load average: 0.03, 0.09, 0.06

This unit has 6 possible interfaces/cables which share the eth0 mac that I can investigate. None of the switch ports that this particular unit is connected to show any errors however. The are just 5 CRC errors on a switch port logged which is connected to a juniper firewall that carries traffic to and from the checkpoint however,

Thanks and Regards

Derek

Timothy_Hall · ‎2023-03-13

Looks like a cabling issue, which side will show CRC errors depends on what wires in which specific pairs are bad in the current cable.

The amount of CRCs is very small for 349 days of uptime so I wouldn't worry too much about it and replace the cable when you can. Also make sure the cable is not running alongside or near any large power sources/cables due to the remote possibility of EMI, and try to use at least a Cat6 cable if you have one which is somewhat more resistant to EMI than cat5e.

Gateway Performance Optimization R81.20 Course
now available at maxpowerfirewalls.com

the_rock · ‎2023-03-13

Tim brings up a good point Derek, definitely try to use cat6 cable if you can.

DekPlent · ‎2023-03-13

Thanks for the advice,

I have requested some cat 6 cables ofr my upcoming visit to site at the end of the month.

Just as an aside, can anyone advise if there is a 'one liner' that provides a human readable date stamp from the dmesg time stamps please. The usual options to dmesg do not seem to be available on this appliance. I'd like to find out if there is any correlation with the errors to failover times,

Thanks again for your help

Regards

Derek

the_rock · ‎2023-03-13

Heya Derek,

I cant speak for anyone else, but what I always do is either run cat or more on the file, so in this case either cat /var/log/dmesg or more /var/log/dmesg or you can log into the fw unsing winscp (as long as thats enabled with chsh command) and then nevigate to /var/log and get dmesg file to your local machine and open it that way ( I always use notepadd ++ for that purpose)

Now, not certain any of those methods will give you actual timestamps, but you can try.

Andy

Are you a member of CheckMates?

CRC errors on eth0 device